Join example sets with a loop
Dear Rapidminer Community,
I have looked through several posts about loops and joins in the forum already but I haven't found what I am looking for.
What I am trying to do:
I have 7 example sets which can be joined with the "join" operator by an outer-join using "date" as join attribute. Instead of building a process where I join all the example sets manually I would like to do this with a loop. This loop should simply go through the example sets and join them together to one big file, is this possible? I have gone through the accessible loop operators but haven't found a solution because if I put the "join" operator in a loop, the operator of course needs two inputs to join something together.
How do I handle this? Is there an operator which can do this?
Best regards
Felix
Best Answer
-
BalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
Hi,
please try this example.
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.1.000" expanded="true" height="68" name="Retrieve Iris" width="90" x="112" y="34">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="remember" compatibility="8.1.000" expanded="true" height="68" name="Remember" width="90" x="246" y="34">
<parameter key="name" value="Current set"/>
</operator>
<operator activated="true" class="retrieve" compatibility="8.1.000" expanded="true" height="68" name="Retrieve Iris (2)" width="90" x="112" y="136">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="retrieve" compatibility="8.1.000" expanded="true" height="68" name="Retrieve Iris (3)" width="90" x="112" y="238">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="retrieve" compatibility="8.1.000" expanded="true" height="68" name="Retrieve Iris (4)" width="90" x="112" y="340">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="retrieve" compatibility="8.1.000" expanded="true" height="68" name="Retrieve Iris (5)" width="90" x="112" y="442">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="collect" compatibility="8.1.000" expanded="true" height="145" name="Collect" width="90" x="380" y="136"/>
<operator activated="true" class="loop_collection" compatibility="8.1.000" expanded="true" height="68" name="Loop Collection" width="90" x="514" y="136">
<process expanded="true">
<operator activated="true" class="recall" compatibility="8.1.000" expanded="true" height="68" name="Recall" width="90" x="112" y="34">
<parameter key="name" value="Current set"/>
</operator>
<operator activated="true" class="concurrency:join" compatibility="8.1.000" expanded="true" height="82" name="Join" width="90" x="313" y="85">
<parameter key="remove_double_attributes" value="false"/>
<list key="key_attributes">
<parameter key="id" value="id"/>
</list>
</operator>
<operator activated="true" class="remember" compatibility="8.1.000" expanded="true" height="68" name="Remember (2)" width="90" x="514" y="85">
<parameter key="name" value="Current set"/>
</operator>
<connect from_port="single" to_op="Join" to_port="right"/>
<connect from_op="Recall" from_port="result" to_op="Join" to_port="left"/>
<connect from_op="Join" from_port="join" to_op="Remember (2)" to_port="store"/>
<portSpacing port="source_single" spacing="84"/>
<portSpacing port="sink_output 1" spacing="0"/>
</process>
</operator>
<operator activated="true" class="recall" compatibility="8.1.000" expanded="true" height="68" name="Final set" width="90" x="648" y="136">
<parameter key="name" value="Current set"/>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="Remember" to_port="store"/>
<connect from_op="Retrieve Iris (2)" from_port="output" to_op="Collect" to_port="input 1"/>
<connect from_op="Retrieve Iris (3)" from_port="output" to_op="Collect" to_port="input 2"/>
<connect from_op="Retrieve Iris (4)" from_port="output" to_op="Collect" to_port="input 3"/>
<connect from_op="Retrieve Iris (5)" from_port="output" to_op="Collect" to_port="input 4"/>
<connect from_op="Collect" from_port="collection" to_op="Loop Collection" to_port="collection"/>
<connect from_op="Final set" from_port="result" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Like the one in the other thread, this process uses Remember and Recall. You "Remember" the first example set and group the others to a Collection. Then in the Loop Collection you recall the current state of the joined example set, join it with the new one, and remember the new state. At the end you recall the final set.
Regards,
Balázs
2
Answers
Hi!
You could try the Collect operator, and then Loop Collection.
Regards,
Balázs
Hi Balazs,
I already tried to implement the process one of your colleagues (Edin_Klapic) posted in the forum (https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Problem-with-combining-all-example-set-from-IO-Object-Collection/m-p/38582) but it doesn't work for me. I think the problem is that what I am feeding into the loop_collection is an IOObjectCollection and not an Example set. The join attribute in this process doesn't work for me. :smileyindifferent:
Hi Balazs,
thank you very much for your help, now it works!
But overall I have to say that this whole process is only slightly more handy than simply joining the files together manually. :smileylol:
But anyway, thank you for your help! :smileyvery-happy:
Hi felix_w,
you're absolutely right. But this kind of process can cope with a variable number of incoming example sets and so on.
These kinds of collections come out of many loops in RapidMiner. You could read in a set of database tables, CSV files, web APIs etc. in loops and you would get such a collection. With this process (or a variant of it) you'd be able to automatically process these.
Regards,
Balázs