How to write an Example Set from the local repository to Amazon S3
I've reformatted a large data set in RapidMiner and now want to write the result to S3. However, connecting the dataset (via the Retrieve operator) to the input port of the Write Amazon S3 operator results in the error: Wrong connection - Your connection is producing the wrong type of data. Try changing the starting point of the connection.
The Write Amazon S3 operator only seems to work when I feed it a file from my computer using the Open File operator. But the files I need are stored as binary .ioo and .md files on my computer, and when I try uploading either of these to S3, and then reading them back, they are nonsense.
Could anyone suggest anything? I've also tried writing to Redshift using the Write Database command, but it goes extremely slowly to the point of crashing RapidMiner. I know that my upload speed isn't the problem as I'm running RapidMiner off a server with a 700MB upload speed. Many thanks in advance!
Best Answer
-
JEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
One thing you might want to do is write the data into a format such as CSV before uploading it to S3.
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="cloud_connectivity:read_amazons3" compatibility="8.2.000" expanded="true" height="68" name="Read Amazon S3" width="90" x="45" y="85">
<parameter key="connection" value="myConnection"/>
<parameter key="file" value="myFile"/>
</operator>
<operator activated="true" class="read_csv" compatibility="8.2.000" expanded="true" height="68" name="Read CSV" width="90" x="179" y="34">
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="write_csv" compatibility="8.2.000" expanded="true" height="82" name="Write CSV" width="90" x="313" y="34"/>
<operator activated="true" class="cloud_connectivity:write_amazons3" compatibility="8.2.000" expanded="true" height="68" name="Write Amazon S3" width="90" x="514" y="85">
<parameter key="connection" value="myConnection"/>
<parameter key="file" value="myFile"/>
</operator>
<connect from_op="Read Amazon S3" from_port="file" to_op="Read CSV" to_port="file"/>
<connect from_op="Read CSV" from_port="output" to_op="Write CSV" to_port="input"/>
<connect from_op="Write CSV" from_port="file" to_op="Write Amazon S3" to_port="file"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
</process>Also, I haven't tested the RedShift upload & download speed via JDBC, but let's assume there is some sort of issue and there is a bottleneck making it run slowly for both the above suggested process & the Redshift.
In that case you can spinup a small EMR cluster and connect to it with RapidMiner Radoop. Then using Radoop put a Read CSV operator stream your data into AWS and finally use a Write Database or Store in Hive operator to write it to S3.
See here for an article on Store in Hive with customer storage handlers.
Custom storage handlers on Hadoop when using Radoop "Store in Hive"
1
Answers
Thank you @JEdward ! Of course, it seems obvious now that I needed to be feeding a CSV file to the Write S3 Operator, rather than the Example Set.