The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
RapidMiner in Amazon EC2
I've recently been running R programs in Amazon EC2. It is just fantastic. I love this model in which all you need is a terminal.
I saw here a question about RApidMiner & EC2 a while ago. Has anybody experimented since with using both?
(I'm new to the whole cloud thing, but it would be nice if somebody made available an AMI capable of running RapidMiner.)
Regards,
\E.
I saw here a question about RApidMiner & EC2 a while ago. Has anybody experimented since with using both?
(I'm new to the whole cloud thing, but it would be nice if somebody made available an AMI capable of running RapidMiner.)
Regards,
\E.
Tagged:
0
Answers
I heard from someone who already did this. In fact it was a talk on the last OSBI...
If you find or create such an AMI, please inform us! Would be quite useful for everybody here.
Greetings,
Sebastian
Here's a quick sketch of what I did (I don't include steps involving getting an account with Amazon, getting the credentials, etc) :
1) Launch a Windows 2008 Server image (for instance, Basic 64-bit Microsoft Windows Server 2008 (AMI Id: ami-d9e40db0) ). Make sure that you open the RDP port (3389) because that's how you are going to communicate with the Virtual Machine.
2) Install Java (www.java.com/downloads)
3) Install RapidMiner
The whole process takes about 20-25 minutes depending on how quickly the instance is available. If you guys want a step-by-step guide, you can download the instructions at https://s3.amazonaws.com/mirlitus/RapidMinerAmazonEC2.pdf. The only things that I've left out are the creation of an account and of initial security credentials (I have had an account with AWS for 3 years now. I've forgotten what I did, but it is fairly easy.)
For benchmarking purposes, I ran a small program in the following machines
- (Lenovo X201, 8 GB ram, Windows 7, dual)
- (Dell Precision T5400, 4Gb ram, windows XP, quad)
The program was simple: finding the best subset of variables to estimate a logistic regression using the operator (Optimization Brute Force Parallel).
I mounted the image in the best machine available ( 26 "cores" , 68 Gb ram ). By the way the downloading speeds are awesome ( I downloaded Java and RapidMIner in a few seconds).
Times:
Lenovo (without Parallel ) : 28 min
Lenovo (with Parallel) : 14 min
Dell (with Parallel) : 11.5 min
Amazon : 2 1/2 min
In the next few days, I'll try to put together the image and I will let you know. You have to understand that I'm still learning the whole thing, but it looks promising.
Another possibility for you guys at RapidMiner is to talk to the guys at Bitnami (http://bitnami.org/). What they do is creating images for Open-Source programs. One of the first examples I tried was to mount a Moodle server (this is a Course Management System). All I did was to select the Bitnami image and had the server working in minutes.
thank you for the information. Sounds really promising...
Greetings,
Sebastian
AMI ID: ami-c31bf0aa
Name: windows2008-rapidminer5.0
Description: Windows Server 2008 with Rapid Miner 5.0 Installed
Source: 618748120321/windows2008-rapidminer5.0
Owner: 618748120321
Visibility: Public
Architecture: x86_64
Platform: Windows
Root Device Type: ebs
Root Device: /dev/sda1
Image Size: 30 GiB
Virtualization: hvm
One last thing: The password to access the Windows Server is 'mirlitus'. Change it as soon as you log in.
As you can see, it is a Windows 2008 version (these images are a little more expensive because Microsoft has to be paid). If there is enough interest, I could prepare an image in ubuntu. Hour charges drop significantly.
I installed Java, Firefox, and Open Office too. I'm planning to have a second image that will have other programs installed. Namely, R, text editors, etc.
You can run this machine in instances from m1.large ($0.48/hours regular price or about $0.22 in the spot mkt ) to m2.4xlarge ($2.88/hour regular price or about $1.20 in the spot mkt)
This is as far as my skills can take me. There is one type of instance I haven't been able to work with (cc1.4xlarge). This seems to be the most promising one since you can cluster them at will. Amazon recently put together a cluster of 880 units and placed 145 among the Top 500 Supercomputers in the world (see http://www.zdnet.com/blog/btl/amazon-web-services-tackles-high-performance-computing-instances/36632). But I'm not a computer scientist... :-(
Again if anybody is successful playing with those instances, please share the info with us. I'm very interested. Same here: if I can be of any help to any of you with the lesser instances, I'll be glad to help you.
I'm curious about RapidAnalytics and the possibility of running it in Amazon EC2. How can I get my hands on the Community version?
Better late than never
anyone with an use case sort of ... Mac OS X remote - EC2 Windows 2008 Server, RM5, DisPaRe & GridGain & Amazon Elastic MapReduce?
please share your experience
My experience is negative I guess. I could not get Rapid Miner to run faster on Amazon then on my home PC.
My home PC is only an i7 2.667 CPU.
The paper "Distributed Pattern Recognition in RapidMiner" is imho very helpful:
"However, it has only limited support for parallelization and it lacks functionality to spread long-running computations over multiple machines. A solution to this is distributed computing with paradigms like MapReduce. In this paper, we present a system called DisPaRe, which integrates distributed computing frameworks into RapidMiner. " (cited)
e.g. k-means could be "easily" scaled and "run faster" ...
cheers
I agree with Clemens. It depends on the type of job you are running and on the type of Amazon instance you choose.
Jobs that would take advantage of Parallelization (like Cross-Val operators, or feature selection --both version Parallel of course) will run much faster online.
Jobs that consume a lot of memory also would take advantage of the cloud.
NOw Amazon offers different type of instances (the really cheap ones are not going to beat your laptop; others will. The best machine I've used is the equivalent of an 8-core with 68 GB of memory but is $2.8/hour)
\E
Has anyone tried running RapidAnalytics on an Amazon-type instance? It would be great to have an instance so real-time applications could be developed for web applications, etc.
But I was surprised how easy it is to configure it (believe me i'm not an expert in computers by any standard).
Steps:
1) Spin a linux machine of your choice (I used Centos above because I wanted the fastest machines with the largest memory)
2) Install java
3) Install mysql
4) Follow step by step the instruction that come as documentation for RapidAnalytics
5) That's it.
There is a trick which is needed that has to do with changing the name of the hostname . See this post here: http://rapid-i.com/rapidforum/index.php/topic,2930.0.html.
Does anyone know if the community license for RapidAnalytics allow for commercial use? So far I've been just been doing a lot of research.
About commercial use: My understanding is that you can use the community version for commercial purposes. However, you don't get the technical support you would get from Rapid-I. If you want to have that support (and I can imagine that it has to be superb since the folks at Rapid-i are so nice with those of us who use the community version) the enterprise version is a good option. The enterprise version also give you additional functionality not present in the community version.
They will make you an offer precisely tailored to your needs.
So you pay only for what you need.
Even though the (virtual) internal mem can be large enough to load the dbase, won't it be that it can't be processed unless the RapidMiner algos are updated for parallization? Or will my BIG dbase get loaded and processed by any of RapidMiner's algos--assuming that the algos see the cloud's internal MEM really as one big chunk of mem.
Anyone? My knowledge of parallal computions is limited.