"Importing a time series from CSV - lots of funky problems"

EWH · August 2011

Importing a time series from CSV - lots of funky problems
I have a moderately large foreign exchange time series .csv file that I am trying to import. I had gotten frustrated trying to do this in RM 5.0 beta, but I decided to give it another try, updated to 5.1 with the time series extension and used the old repository. This is a EURUSD file with 1 minute data going back about 10 years (over 3.4 million records, 135MB with 7 fields in the standard format: date, time, open, high, low, close and volume). The date and time were nominal type. I tried to pre-process the data, converting the first two columns to date and time type. RM for some reason decided it needed over 10x the actual file size in RAM, hung, crashed, and so forth. (How are people with big time series such as hours of 96kHz vibration data expected to use this? In my admittedly slight experience R is a lot better with memory and orders of magnitude faster.) So I tried using a "cut last" as the first step (100000) and it gave me an error about the simple example set format being incompatible with time series (even with breakpoint on the cut last function), and a "WARNING: Using deprecated example set stream version 1"

So I decided to re-import using the wizard, though I couldn't see why or even how the file formats could affect such a simple case.
The .csv import does not seem to work properly. The default seems to be semicolon separators, which is a strange choice for a comma delimited format, but that's no big deal. Trying to get the date to import as a date in format "yyyy.MM.dd", it turns dates into date-times, all with 00:00:00 EDT. One can only set one format, so there is no way to specify the time column's format once the date format is set. . If the time format is specified instead (first setting date back to nominal), the times are all also converted to date-times, with a "Thu Jan 01 HH:mm:ss EST 1970" format, when HH:mm is specified as the Date format. There seems to be no capability to combine the date and time into one temporal index, at least without using a different program to preprocess the data. After trying out a few type options, all the data disappears from the wizard preview and one must start over. (I can't seem to get it to do it a fourth time. This program is clearly trying to drive me insane. AHA - it happens after trying to reload the data after an error, described below.)

The initial guess is that the date column is polynominal and the time is binominal. With the 100 record preview, that works (though it's certainly strange that a program such as this can't guess date and time formats), but loading more data throws an error - apparently 01:04 occurs more than once in the column, so it just displays all "?" for that column. Now why having the same value twice should be a problem, I don't know, nor why it hit on 01:04 as an example when 01:02 is the first repeat, but the next reload (more than 100 rows) causes all the preview data to disappear. ???

Added:
(The forum software has some issues, too - I push prieview, my login was timed out. It tried to eat my post, but I re-login, go back, there's my text. Now it will preview, but not post, and doesn't prompt for a login. Copy all, go back, restart topic from scratch... sigh. Edit: and now I find it did post the first time, but just wouldn't admit it. And I can't delete it despite there being a button supposedly for that purpose. Double sigh.)

awchisholm · August 2011

Hello EWH

Importing can indeed sometimes be "fun". Could you post a sufficiently large snapshot of the data that exhibits the problem as well as the XML of your process? I can't promise success but I might be able to make a suggestion or two.

regards

Andrew

wessel · August 2011

Hey,

I managed to load 10GB (and larger) time series datasets in Rapd Miner.

Here is a tip that may help you:
Find your data attribute in "Read CSV" > 'data set meta data information'
Change the type to "date_time", not "date".

Best regards,

Wessel

EWH · August 2011

Thanks, Wessel, I'll try that.

Andrew:
The problem with the import process using the old data already in a repository was I think mostly due to the large file, and I have been trying other things with the import process so that original XML is gone, anyway.

The .csv wizard import problem happens with a smaller set and isn't dependent on any xml, so I have attached the first 4096 records from the csv file. The time format has changed due to using OpenOffice Calc to trim the file down. See data below.

Steps to reproduce:
Try to import EURUSD1a.csv with the wizard.
2nd screen - Column separator = comma
uncheck "ignore errors"
skip 3rd screen (annotation)
4th screen:
date format = "yyy.MM.dd"
set 1st column to date type
reload data
set 2nd column to time type
reload data
all data has disappeared. The wizard must be restarted. If you continue to try to export without restarting, the repository file will only be a few hundred bytes with no real data.

2nd problem:
restart the wizard
perform the same steps until you get to the fourth screen of the wizard.
uncheck "Preview uses only first 100 rows"
uncheck "ignore errors"
Reload data
(close error popup)
reload data again
All data has disappeared.

***
As far as my plan, If I can't make Wessel's suggestion work for me, I plan to use OpenOffice Calc to split the file up into 65536 record chunks, change the file extensions to .txt, use the regexp search and replace in OO Writer to merge the date and time fields, rename to *.csv, then import the files individually (over 2 months of data in a file isn't too bad, and they'll load at least 50x faster.)

***
Well, I don't see an option to attach files. [glow=red,2,300]Glow[/glow] and [move]scrolling marquees[/move] insert Flash, lots of animated emoticons ::) ... if I had a server, I suppose the www or ftp links would work, but it sure seems like a fundamental oversight not to provide any way to attach a file. Maybe I'm missing something.

These edited lines should reproduce the problem:

"1999.10.01","01:02","1.0679","1.0680","1.0679","1.0680","2"
"1999.10.01","01:03","1.0682","1.0682","1.0681","1.0681","2"
"1999.10.01","01:04","1.0683","1.0684","1.0681","1.0681","4"
"1999.10.01","01:05","1.0682","1.0682","1.0682","1.0682","1"
"1999.10.01","01:06","1.0680","1.0680","1.0680","1.0680","1"
"1999.10.01","01:09","1.0681","1.0681","1.0680","1.0680","2"
"1999.10.01","01:17","1.0682","1.0682","1.0682","1.0682","1"
"1999.10.01","01:18","1.0680","1.0680","1.0680","1.0680","1"
"1999.10.01","01:24","1.0681","1.0681","1.0681","1.0681","1"
"1999.10.01","01:25","1.0679","1.0681","1.0679","1.0681","3"
"1999.10.01","01:30","1.0678","1.0678","1.0676","1.0677","4"
"1999.10.01","01:31","1.0678","1.0678","1.0678","1.0678","1"
"1999.10.01","01:34","1.0677","1.0679","1.0677","1.0679","3"
"1999.10.01","01:35","1.0678","1.0678","1.0678","1.0678","1"
"1999.10.01","01:36","1.0679","1.0679","1.0679","1.0679","1"
"1999.10.01","01:37","1.0676","1.0676","1.0676","1.0676","1"
"1999.10.01","01:38","1.0675","1.0677","1.0675","1.0677","3"
"1999.10.01","01:39","1.0676","1.0676","1.0673","1.0674","4"
"1999.10.01","01:42","1.0673","1.0673","1.0673","1.0673","1"
"1999.10.01","01:43","1.0674","1.0674","1.0673","1.0673","2"
"1999.10.01","01:46","1.0674","1.0674","1.0674","1.0674","1"
"1999.10.01","01:49","1.0673","1.0673","1.0673","1.0673","1"
"1999.10.01","01:51","1.0672","1.0672","1.0672","1.0672","1"
"1999.10.01","01:52","1.0671","1.0672","1.0671","1.0672","2"
"1999.10.01","01:54","1.0671","1.0671","1.0671","1.0671","1"
"1999.10.01","01:59","1.0666","1.0671","1.0637","1.0638","30"
"1999.10.01","02:00","1.0637","1.0638","1.0636","1.0636","4"
"1999.10.01","02:01","1.0638","1.0638","1.0638","1.0638","1"
"1999.10.01","02:25","1.0670","1.0673","1.0670","1.0671","4"
"1999.10.01","02:26","1.0669","1.0671","1.0669","1.0671","3"
"1999.10.01","02:28","1.0672","1.0672","1.0672","1.0672","1"
"1999.10.01","02:30","1.0673","1.0673","1.0673","1.0673","1"
"1999.10.01","02:31","1.0670","1.0670","1.0670","1.0670","1"
"1999.10.01","02:32","1.0672","1.0672","1.0669","1.0670","5"
"1999.10.01","02:33","1.0670","1.0670","1.0670","1.0670","1"
"1999.10.01","02:35","1.0669","1.0669","1.0669","1.0669","1"
"1999.10.01","02:38","1.0670","1.0671","1.0670","1.0670","4"
"1999.10.01","02:39","1.0671","1.0671","1.0671","1.0671","1"
"1999.10.01","02:40","1.0672","1.0672","1.0670","1.0670","3"
"1999.10.01","02:41","1.0671","1.0671","1.0671","1.0671","1"
"1999.10.01","02:42","1.0671","1.0671","1.0670","1.0670","2"
"1999.10.01","02:43","1.0670","1.0670","1.0670","1.0670","1"
"1999.10.01","02:48","1.0668","1.0668","1.0668","1.0668","1"
"1999.10.01","02:49","1.0670","1.0670","1.0670","1.0670","1"
"1999.10.01","02:51","1.0670","1.0671","1.0670","1.0671","2"
"1999.10.01","02:52","1.0669","1.0671","1.0669","1.0670","4"
"1999.10.01","03:00","1.0669","1.0670","1.0669","1.0670","2"
"1999.10.01","03:04","1.0672","1.0674","1.0672","1.0674","3"
"1999.10.01","03:05","1.0673","1.0675","1.0673","1.0673","5"
"1999.10.01","03:09","1.0673","1.0673","1.0673","1.0673","1"
"1999.10.04","01:01","1.0714","1.0714","1.0712","1.0714","5"
"1999.10.04","01:02","1.0713","1.0713","1.0712","1.0712","2"
"1999.10.04","01:04","1.0713","1.0713","1.0713","1.0713","1"
"1999.10.04","01:15","1.0712","1.0712","1.0712","1.0712","1"
"1999.10.04","01:21","1.0711","1.0711","1.0711","1.0711","1"
"1999.10.04","01:26","1.0712","1.0712","1.0712","1.0712","1"
"1999.10.04","01:29","1.0713","1.0713","1.0713","1.0713","1"
"1999.10.04","01:33","1.0715","1.0715","1.0713","1.0714","4"
"1999.10.04","01:34","1.0714","1.0714","1.0713","1.0714","3"
"1999.10.04","01:36","1.0715","1.0715","1.0714","1.0714","2"
"1999.10.04","01:37","1.0715","1.0717","1.0715","1.0717","4"

JEdward · August 2011

It does look as though it's a bug in the wizard.
To save you doing the conversion in OpenOffice you could do it like this.
I've added more operators than is probably necessary, but hey I like to complicate things. ;D

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
    <process expanded="true" height="460" width="614">
      <operator activated="true" class="read_csv" compatibility="5.1.008" expanded="true" height="60" name="Read CSV" width="90" x="45" y="75">
        <parameter key="csv_file" value="C:\Desktop\EURUSD1a.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations"/>
        <parameter key="encoding" value="UTF-8"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="att1.true.attribute_value.attribute"/>
          <parameter key="1" value="att2.true.attribute_value.attribute"/>
          <parameter key="2" value="att3.true.real.attribute"/>
          <parameter key="3" value="att4.true.real.attribute"/>
          <parameter key="4" value="att5.true.real.attribute"/>
          <parameter key="5" value="att6.true.real.attribute"/>
          <parameter key="6" value="att7.true.integer.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="store" compatibility="5.1.008" expanded="true" height="60" name="Store" width="90" x="179" y="30">
        <parameter key="repository_entry" value="CSVStore"/>
      </operator>
      <operator activated="true" class="free_memory" compatibility="5.1.008" expanded="true" height="76" name="Free Memory" width="90" x="313" y="30"/>
      <operator activated="true" class="retrieve" compatibility="5.1.008" expanded="true" height="60" name="Retrieve" width="90" x="45" y="165">
        <parameter key="repository_entry" value="CSVStore"/>
      </operator>
      <operator activated="true" class="nominal_to_date" compatibility="5.1.008" expanded="true" height="76" name="Nominal to Date" width="90" x="179" y="120">
        <parameter key="attribute_name" value="att1"/>
        <parameter key="date_format" value="yyyy.MM.dd"/>
        <parameter key="locale" value="English (United Kingdom)"/>
      </operator>
      <operator activated="true" class="nominal_to_date" compatibility="5.1.008" expanded="true" height="76" name="Nominal to Date (2)" width="90" x="313" y="165">
        <parameter key="attribute_name" value="att2"/>
        <parameter key="date_type" value="time"/>
        <parameter key="date_format" value="HH:mm"/>
        <parameter key="locale" value="English (United Kingdom)"/>
      </operator>
      <operator activated="true" class="store" compatibility="5.1.008" expanded="true" height="60" name="Store (2)" width="90" x="45" y="255">
        <parameter key="repository_entry" value="CSVStore"/>
      </operator>
      <operator activated="true" class="free_memory" compatibility="5.1.008" expanded="true" height="76" name="Free Memory (2)" width="90" x="179" y="255"/>
      <operator activated="true" class="retrieve" compatibility="5.1.008" expanded="true" height="60" name="Retrieve (2)" width="90" x="313" y="345">
        <parameter key="repository_entry" value="CSVStore"/>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Store" to_port="input"/>
      <connect from_op="Store" from_port="through" to_op="Free Memory" to_port="through 1"/>
      <connect from_op="Retrieve" from_port="output" to_op="Nominal to Date" to_port="example set input"/>
      <connect from_op="Nominal to Date" from_port="example set output" to_op="Nominal to Date (2)" to_port="example set input"/>
      <connect from_op="Nominal to Date (2)" from_port="example set output" to_op="Store (2)" to_port="input"/>
      <connect from_op="Store (2)" from_port="through" to_op="Free Memory (2)" to_port="through 1"/>
      <connect from_op="Retrieve (2)" from_port="output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

awchisholm · August 2011

I did something similar using generate attributes...


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
    <process expanded="true" height="296" width="681">
      <operator activated="true" class="read_csv" compatibility="5.1.008" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
        <parameter key="csv_file" value="C:\temp\wiz.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations"/>
        <parameter key="locale" value="English (United Kingdom)"/>
        <parameter key="encoding" value="windows-1252"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="att1.true.text.attribute"/>
          <parameter key="1" value="att2.true.text.attribute"/>
          <parameter key="2" value="att3.true.real.attribute"/>
          <parameter key="3" value="att4.true.real.attribute"/>
          <parameter key="4" value="att5.true.real.attribute"/>
          <parameter key="5" value="att6.true.real.attribute"/>
          <parameter key="6" value="att7.true.integer.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="5.1.008" expanded="true" height="76" name="Generate Attributes" width="90" x="179" y="30">
        <list key="function_descriptions">
          <parameter key="dateTime" value="date_parse_custom(att1+&quot; &quot;+att2, &quot;yyyy.MM.dd HH:mm&quot;, &quot;gb&quot;)"/>
        </list>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.1.008" expanded="true" height="76" name="Select Attributes" width="90" x="313" y="30">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="|att2|att1"/>
        <parameter key="invert_selection" value="true"/>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

regards

Andrew

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Importing a time series from CSV - lots of funky problems"

Answers