Creative Misuse of RapidMiner
Creative Misuse of RapidMiner
One of the most fun events at the RapidMiner Wisdom conference is the live predictive analytics process design competition "Who Wants to be a Data Miner?" In this competition, participants must design RapidMiner processes for a given goal within a few minutes. The tasks are related to predictive analytics and data analysis in general, but are rather uncommon. In fact, most of the challenges ask for things RapidMiner was never supposed to do.
During RapidMiner Wisdom 2016 in New York City, we again had two tasks prepared for the audience. Three brave contestants battled against each other and the clock to find the right solution (or at least something which is close enough). The first task this year was:
Create the full lyrics to “99 Bottles of Beer on the Wall”
According to Wikipedia, "99 Bottles of Beer is an anonymous United States folk song dating to the mid-20th century. It is a traditional song in both the United States and Canada. It is popular to sing on long trips, as it has a very repetitive format which is easy to memorize, and can take a long time to sing.”
Well, yeah. Some say that there are numerous problems with this song but this is – although a funny read – not the subject of this post. (By the way, the song has appeared many time in popular culture as well: maybe most notably, at least for some, in the game Monkey Island.)
Anyway, here is how the song goes:
99 bottles of beer on the wall, 99 bottles of beer.
Take one down and pass it around, 98 bottles of beer on the wall.
98 bottles of beer on the wall, 98 bottles of beer.
Take one down and pass it around, 97 bottles of beer on the wall.
97 bottles of beer on the wall, 97 bottles of beer.
Take one down and pass it around, 96 bottles of beer on the wall.
…
1 bottle of beer on the wall, 1 bottle of beer.
Take one down and pass it around, 0 bottles of beer on the wall.
Full lyrics can be found here but I think you got the idea.
So how can we solve the task above with RapidMiner?
Let’s start with a screenshot of the solution first:
We start with the operator “Generate Data” and generate a random data set with only 1 column and 100 examples (make the appropriate settings in the parameters of the operator). This is maybe not the most elegant way but one of the easiest ways in RapidMiner to get a data set with a specific structure and size. As a next step, we now need numbers from 1 to 100 in an extra column. Again, there are multiple ways to achieve this but the simplest is probably to use the operator “Generate ID” which is doing exactly that. We can now use “Select Attributes” and remove the columns which have been originally generated by “Generate Data”, i.e. we only keep our new “id” column. The result is a data set with 100 rows and the numbers 1 to 100 in one column named “id”.
Now all the logic happens in the next operator: “Generate Attributes”. The main problem which needs to be solved is how do we transform the sequence of numbers from 1 to 100 into a sequence from 99 to 0? Well that is easy: we can just generate a new value by subtracting the current “id” in each row from 100. At the same time we add the rest of the lyrics around those numbers. Here is how you need to set the parameters of “Generate Attributes” to achieve this:
Now you could even concatenate all these new columns into a single one if you want to. I leave it to you to figure out how. The final result after executing the process then looks like the following screenshot (only showing the beginning):
If you run the process yourself, check out the last line as well. I admit that we could handle this a bit better since the created lyrics end on: “0 bottles of beer on the wall, 0 bottles of beer. Take one down and pass it around. -1 bottles of beer on the wall.” Well, there is nothing wrong with -1 bottles of beer for mathematicians and physicists but some IT systems might not like negative numbers of objects.
Using RapidMiner for tasks like this is of course a bit, well, strange. But it also shows how flexible and powerful the visual approach of RapidMiner actually is. Others have created solutions in practically every programming language on earth, some shorter and some longer than others. But I would always prefer the RapidMiner solution over the code of most of them.
Below is the XML of the complete process. You can save it into an arbitrary file on your system and use “File -> Import Process…” to get it into RapidMiner.
Have fun trying this out!
XML of the Process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data" compatibility="7.0.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="30">
<parameter key="number_of_attributes" value="1"/>
</operator>
<operator activated="true" class="generate_id" compatibility="7.0.000" expanded="true" height="82" name="Generate ID" width="90" x="179" y="30"/>
<operator activated="true" class="select_attributes" compatibility="7.0.000" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="30">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="id"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="82" name="Generate Attributes" width="90" x="447" y="30">
<list key="function_descriptions">
<parameter key="c1" value="100-id"/>
<parameter key="l1" value="" bottles of beer on the wall ""/>
<parameter key="c2" value="100-id"/>
<parameter key="l2" value="" bottles of beer. Take one down and pass it around ""/>
<parameter key="c3" value="99-id"/>
<parameter key="l3" value="" bottles of beer on the wall.""/>
</list>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>