The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Creating Rapidminer Server Cluster"

data123data123 Member Posts: 23 Maven
edited June 2019 in Help

Hi,

Does anyone have a link to how we can create RM server clusters? I want to be able to distribute my jobs across multiple machines without Hadoop.

 

Thanks

Tagged:

Answers

  • Pavithra_RaoPavithra_Rao Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 123 RM Data Scientist
  • RandyLeBlancRandyLeBlanc Employee-RapidMiner, Member Posts: 6 RM Team Member

    Pavithra's links to the new Job Agent architecture for RM Server indicate a way to create distributed processing nodes listening to and taking jobs from a single RM Server--but they don't create inherently parallel architectures. I think Data123 wants RM to decompose RM jobs into tasks which are executed in parallel...

     

    Today, Hadoop is the only distributed processing framework that will run RapidMiner jobs in a horizontally scaled parallel architecture. You can decompose jobs using the ScheduleProcess operator in a RapidMiner workflow and if you had multiple job agents connected to your server, you could achieve the grid computing style paradigm you seek. 

     

    Imagine you had something you were going to do 40 times and each of the 40 iterations was independent of the others--you could loop the entities and use ScheduleProcess passing an iteration / primary key / whatever as a macro to ScheduleProcess. If you have multiple Job Agents (which can be deployed, when licensed, on multiple hosts) you have a simple grid without Hadoop.

     

    @mschmitz / @sgenzer did I miss anything? Please correct me if I'm wrong :)

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    @RandyLeBlanc,

     

    all good. The only thing to add: There is a good reason for it. A lot of the algorithms which are part of RM are not "decomposable". Spark's MLib is state of the art whats possible in terms of ML. You can see on the list of supported algorithms, that this is not thaaat much compared to the hundreds of learners in RM.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • sgenzersgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    well IMHO the answer is yes and no @RandyLeBlanc and @mschmitz - certainly Radoop is the "right" way to do this for all the reasons explained. But I could certainly envision a cleverly written series of processes that would spawn separate job agents in order to improve efficiencies, e.g. batch processing a large file.


    Scott

     

  • RandyLeBlancRandyLeBlanc Employee-RapidMiner, Member Posts: 6 RM Team Member

    So what I'm referring to is what is known as "coarse grained parallelism" and its fairly common in many RM users' workflows. Imagine I'm a retailer with 1000 products and I'm building one model for each product... this would be super simple to do in a loop and by submitting 1000 ScheduleProcess calls. I don't need to compose the results at the end--I just need 1,000 models trained and if I have multiple JA's on the queue with multiple JC's they can go off and eat up all the computing resources we want.

     

    What it doesn't accomplish is "fine grained parallelism." Imagine a stochastic model where I run a million simulations and I want to carve it up into 10 chunks of 100,000 simulations each and then average or aggregate the results for a single output. You can't do that with the ML algorithms in RM Core--that is what ML Lib does on Spark but not what we can achieve with my ScheduleProcess grid... But if the level of granularity is at the "Big enough chunk where distribution makes sense" and "this chunk can complete without any regard to any other chunk" then it's a usable scenario.

  • RandyLeBlancRandyLeBlanc Employee-RapidMiner, Member Posts: 6 RM Team Member

    Wait, you said what I said in two sentences and I took 20. Oops

Sign In or Register to comment.