"Creating Rapidminer Server Cluster"

data123 · March 2018

Hi,

Does anyone have a link to how we can create RM server clusters? I want to be able to distribute my jobs across multiple machines without Hadoop.

Thanks

Pavithra_Rao · March 2018

@data123

The following documentation should be helpful in setting up the Server cluster.

https://docs.rapidminer.com/latest/server/installation/job-agents.html

https://docs.rapidminer.com/latest/server/administration/job-containers.html

https://docs.rapidminer.com/latest/server/administration/

RandyLeBlanc · March 2018

Pavithra's links to the new Job Agent architecture for RM Server indicate a way to create distributed processing nodes listening to and taking jobs from a single RM Server--but they don't create inherently parallel architectures. I think Data123 wants RM to decompose RM jobs into tasks which are executed in parallel...

Today, Hadoop is the only distributed processing framework that will run RapidMiner jobs in a horizontally scaled parallel architecture. You can decompose jobs using the ScheduleProcess operator in a RapidMiner workflow and if you had multiple job agents connected to your server, you could achieve the grid computing style paradigm you seek.

Imagine you had something you were going to do 40 times and each of the 40 iterations was independent of the others--you could loop the entities and use ScheduleProcess passing an iteration / primary key / whatever as a macro to ScheduleProcess. If you have multiple Job Agents (which can be deployed, when licensed, on multiple hosts) you have a simple grid without Hadoop.

@mschmitz / @sgenzer did I miss anything? Please correct me if I'm wrong

MartinLiebig · March 2018

@RandyLeBlanc,

all good. The only thing to add: There is a good reason for it. A lot of the algorithms which are part of RM are not "decomposable". Spark's MLib is state of the art whats possible in terms of ML. You can see on the list of supported algorithms, that this is not thaaat much compared to the hundreds of learners in RM.

Best,

Martin

sgenzer · March 2018

well IMHO the answer is yes and no @RandyLeBlanc and @mschmitz - certainly Radoop is the "right" way to do this for all the reasons explained. But I could certainly envision a cleverly written series of processes that would spawn separate job agents in order to improve efficiencies, e.g. batch processing a large file.

Scott

RandyLeBlanc · March 2018

So what I'm referring to is what is known as "coarse grained parallelism" and its fairly common in many RM users' workflows. Imagine I'm a retailer with 1000 products and I'm building one model for each product... this would be super simple to do in a loop and by submitting 1000 ScheduleProcess calls. I don't need to compose the results at the end--I just need 1,000 models trained and if I have multiple JA's on the queue with multiple JC's they can go off and eat up all the computing resources we want.

What it doesn't accomplish is "fine grained parallelism." Imagine a stochastic model where I run a million simulations and I want to carve it up into 10 chunks of 100,000 simulations each and then average or aggregate the results for a single output. You can't do that with the ML algorithms in RM Core--that is what ML Lib does on Spark but not what we can achieve with my ScheduleProcess grid... But if the level of granularity is at the "Big enough chunk where distribution makes sense" and "this chunk can complete without any regard to any other chunk" then it's a usable scenario.

RandyLeBlanc · March 2018

Wait, you said what I said in two sentences and I took 20. Oops

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Creating Rapidminer Server Cluster"

Answers