How to Properly Use Loop Amazon S3
Community,
I am trying to extract data from S3 using the "Loop Amazon S3" operator. It is Twitter data and the data files are nested pretty deeply - for example: raw_data/2016/10/11/16/file_1.txt
I must not have it configured correctly because RM tells me "Input Missing .... previous operator did not return any output" - if I point the operator to a higher directory like "10" , the process runs a long time before erroring. If I point it to the directory like "16" (i.e. the directory where all my files are located) it still gives an error.
I suspect I need to customize the "macro" fields but the description of the fields don't really make any sense. Right now the "file name" , "file path" and "parent path" macro fields contain the default values.
My layout goes like: [Loop Amazon S3] -> [Read Document] -> [JSON to Data] -> results
Thanks for your help!
Best Answer
-
mmichel Employee-RapidMiner, Member Posts: 129 RM Engineering
Hi AustinT,
the 'Loop Amazon S3' is a meta operator. So you need to provide the subprocess within the operator itself.
Do it by double clicking on the operator and move the other operators (Read document and JSON to Data) inside the 'Loop Amazon S3' operator.
You should end up like this:
<?xml version="1.0" encoding="UTF-8"?><process version="7.2.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.2.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="cloud_connectivity:loop_amazons3" compatibility="7.2.000" expanded="true" height="82" name="Loop Amazon S3" width="90" x="45" y="34">
<parameter key="connection" value="AmazonS3"/>
<parameter key="folder" value="/someFolder/someSubfolder"/>
<process expanded="true">
<operator activated="true" class="text:read_document" compatibility="7.2.001-SNAPSHOT" expanded="true" height="68" name="Read Document" width="90" x="112" y="34"/>
<connect from_port="file object" to_op="Read Document" to_port="file"/>
<connect from_op="Read Document" from_port="output" to_port="out 1"/>
<portSpacing port="source_file object" spacing="0"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<connect from_op="Loop Amazon S3" from_port="out 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Cheers,
Marcel
0
Answers
Thank you for the quick response, Marcel. Here's what the subprocess within the Loop Amazon S3 operator looks like. I have chosen a directory very close to the "node" (so to speak) so I'm not anticipating the operator to run very long. It is still running, so I will check back when I have some results. Thanks again
EDIT: Although it ran for awhile it worked very nicely! Next thing to troubleshoot is text encoding and combining the results into one dataset. I'm a beginner! Thanks again
Hi AustinT,
glad to hear that your process is working. Depending on the file number and your internet connection it may take some time to complete this process.
Just a quick tip for the process designing phase. You don't want to execute the Loop Amazon operator every time while editing the process, so just save the results of the operator by using the Store operator. After that you are able to load the results with the Retrieve operator. So during the designing phase just use the Retrieve operator instead of the Loop Amazon operator. Otherwise you will be wasting a lot of time ;-)