"Read URL (CSV)
Hi,
I'm trying to use the Read URL-operator to get some financial data. This data is real-time, and I need to re-load it every now and then without having to manually download and edit the data. Example:
https://chartapi.finance.yahoo.com/instrument/1.0/AAPL/chartdata;type=quote;range=1d/csv
The Read URL fails because there are some lines on the top of the document (17 lines) that are only informational and not useful for the data set. I need to remove these lines.
Is there a way to remove or ignore these lines using RapidMiner operators?
Best Answer
-
Thomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
Are these streaming delayed quotes? I usually download the historical quotes with no problem. I wonder if it has to do with the data feed itself.
So this works but you'd need to rename the columns and verify that's correct.
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="open_file" compatibility="7.5.001" expanded="true" height="68" name="Open File" width="90" x="112" y="34">
<parameter key="resource_type" value="URL"/>
<parameter key="url" value="https://www.google.com/finance/getprices?q=AAPL&i=300&i=300&p=100d&f=d,c,h,l,o,v"/>
</operator>
<operator activated="true" class="read_csv" compatibility="7.5.001" expanded="true" height="68" name="Read CSV" width="90" x="246" y="34">
<parameter key="csv_file" value="C:\Users\THOMAS~1\AppData\Local\Temp\rm_file_4131079488985567346.dump"/>
<parameter key="column_separators" value="[0-9]+,[0-9]+,[0-9]+,[0-9]+,[0-9]+,[0-9]+"/>
<parameter key="comment_characters" value="\"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Comment"/>
<parameter key="1" value="Comment"/>
<parameter key="2" value="Comment"/>
<parameter key="3" value="Comment"/>
<parameter key="4" value="Comment"/>
<parameter key="5" value="Comment"/>
<parameter key="6" value="Comment"/>
<parameter key="7" value="Unit"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="att1.true.polynominal.attribute"/>
</list>
</operator>
<operator activated="true" class="split" compatibility="7.5.001" expanded="true" height="82" name="Split" width="90" x="380" y="34"/>
<connect from_op="Open File" from_port="file" to_op="Read CSV" to_port="file"/>
<connect from_op="Read CSV" from_port="output" to_op="Split" to_port="example set input"/>
<connect from_op="Split" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>1
Answers
In this particular case I would use an Open File operator and a Read CSV operator. The Read CSV allows you to comment out the lines you don't want. See below:
Hi Thomas,
I didn't realise I can read an online CSV directly with this operator - that simplifies the process.
The process works except it only returns the volume column (there are also close, high, low, open and volume).
I tried to add the rest of the columns in the data set meta information, but I get the error "the columns does not exist in input data." It must have something to do with how RapidMiner understands the file. Does RapidMiner require column names separated by commas above the example set? I can't find a way to define the columns.
From what I see if I paste the URL into the browser is that it shows the time series data in CSV format but when selected the comma as delimeter it only shows the Volume. Hmm you might have to use the REGEX option. What does the API documentaiton say about importing this data?
Is an XML data export option available? You might want to try that if it's available. If yes, try the Read XML operator then.
Ok Thomas,
The data is available as xml: https://chartapi.finance.yahoo.com/instrument/1.0/AAPL/chartdata;type=quote;range=1d/xml
I will look into it.
Could there be an issue with how RapidMiner handles data using the Open File (URL) / Read CSV?
I tried using another data source:
http://www.google.com/finance/getprices?q=AAPL&i=300&i=300&p=100d&f=d,c,h,l,o,v
But I have the same problem with only the first column being selected while the others are left out and not recognized.
It seems like even though I rule out the first rows using annotations, RapidMiner still remembers that the first row has only one column, and sticks with that
Thanks Thomas, that works great!
According to my sources these are live data. I checked the timestamp and it is todays date. Counting minutes from the timestamp it looks like live data to me.
This is the link for minute data (i=xx is seconds):
http://www.google.com/finance/getprices?q=AAPL&i=60&p=1d&f=d,c,h,l,o,v
The previous one was 5 minute intervals.
Thanks again!
Yeah Google and Yahoo get cranky when you start extracting live data from them so they make it hard to get it. Good luck!