Loading large data

wessel · November 2009

Hello,

An image dataset.
1250 features, 2000 positive, 2000 negative examples.

In .mat format 32MB.
In ASCII .csv format 58MB

Every time I start my RM progress, this dataset 23 seconds to load.
Is there anyway to keep the dataset cached?
Also I'd like to cache my PCA.
Or cache the transformed dataset.

PCA in matlab takes about 30 seconds, PCA in Rapidminer about 3 min.
That is a factor 6.

Why is matlab faster?

Regards,

Wessel Luijben

land · November 2009

Hi,
there are several possible reasons, why the matlab format is smaller: It might store data in a binary format or uses a less precise decimal format. If you want a small RapidMiner format, you could store it binary using the IOObjectWriter or wait for RapidMiner 5.

No, there's currently no way for caching the data. Although caching sounds easy, it is not, because on the one hand data is modified during the process, so that the cached version might get corrupted. To avoid this, you will need to store a complete copy in memory, which would pose memory problems especially on larger data sets, where the cache would be most helpful.
To speed up your loading, there are two possibilities: Work only on a subset of the data during process design time and only use the full data for the final run. Or you might simply use the binary format of rapid miner (which is not guaranteed to be compatible with any further version) but should be faster and nice for temporary copies.

Matlab is written in C as far as I know, which gives a a fair performance boost against a non native compiled Java program like RapidMiner. On the other hand, you might use Java on nearly every computer platform available, even on my handy...
Beside this there are many different algorithms for calculating the eigenvectors and eigenvalues needed for the PCA. Chances are, that Matlab uses a highly tuned and optimized algorithm. Feel invited to adapt this algorithm for RapidMiner, we would gratefully include it in the core

Greetings,
Sebastian

wessel · January 2010

I used the IOObjectWriter to write a 54.2 MB .csv file, it became 137 MB! xD
That is the opposite result

haddock · January 2010

Gosh Wessel, problems certainly seem to seek you out, if I run the following it only takes a second to generate, write, and read back such an example set, and it is only 39.5Mb long.

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function"	value="simple non linear classification"/>
        <parameter key="number_examples"	value="4000"/>
        <parameter key="number_of_attributes"	value="1250"/>
    </operator>
    <operator name="IOObjectWriter" class="IOObjectWriter">
        <parameter key="object_file"	value="C:\Users\CJFP\Documents\rm_workspace\wessel"/>
        <parameter key="io_object"	value="ExampleSet"/>
        <parameter key="output_type"	value="Binary"/>
    </operator>
    <operator name="IOObjectReader" class="IOObjectReader">
        <parameter key="object_file"	value="C:\Users\CJFP\Documents\rm_workspace\wessel"/>
        <parameter key="io_object"	value="ExampleSet"/>
    </operator>
</operator>

Did you by any chance write out your CSV as XML as well?

wessel · January 2010

I'm sorry for running into problems.
No I did not make this mistake, I used output_type : Binary also!
This is the dataset I used: http://77.93.77.78/download/MilkDataJoosten.csv

MilkDataJoosten.csv 54.4 MB (57.064.974 bytes) <-- load time 15s
big.ioo 137 MB (144.474.351 bytes) <-- load time 6s

Surprisingly big.ioo does load faster!

http://77.93.77.78/download/MilkDataJoosten.csv
<operator name="Root" class="Process" expanded="yes">
<operator name="CSVExampleSource" class="CSVExampleSource">
<parameter key="filename" value="D:\wessel\Desktop\MilkDataJoosten.csv"/>
<parameter key="column_separators" value=";"/>
</operator>
<operator name="IOObjectWriter" class="IOObjectWriter">
<parameter key="object_file" value="D:\wessel\Desktop\big.ioo"/>
<parameter key="io_object" value="ExampleSet"/>
<parameter key="output_type" value="Binary"/>
</operator>
<operator name="IOObjectReader" class="IOObjectReader" breakpoints="after">
<parameter key="object_file" value="D:\wessel\Desktop\big.ioo"/>
<parameter key="io_object" value="ExampleSet"/>
</operator>
</operator>

land · January 2010

Hi,
the result isn't very surprisingly if you take the internal encoding into account. If you create a data table in RapidMiner, it's values are stored in double arrays. Even the nominal ones, they are mapped from an index to a string. By the way, if double, float or integer is used you might determine with the appropriate parameter of the loading operator.
Your csv file contains mainly integer values, for example "43". This is represented by two characters plus one split character, hence 3 bytes. Each double will consume 4 bytes, so this increases the needed memory. Additionally you have several missing values, which take up 1 byte but are represented by 4 bytes. Additional memory is used for holding all the examples together and storing additional informations like statistics and so on. All this is saved if you select the binary format.

But why does this load faster? Simply because java only has to read the file and swap it directly into memory. There isn't any parsing, interpreting, object creation and repeated memory allocation needed.

Greetings,
Sebastian

land · January 2010

Let me add, that the binary format of course has it's disadvantages: If anything in the class structure changes, the bytes in the file cannot be interpreted correctly and so it cannot be read anymore. For storing informations for a longer time, you mustn't use the binary format...

Greetings,
Sebastian

wessel · January 2010

Woa thanks!

Your explanation is 100% clear.

fischer · January 2010

Hi,

Let me just add this:
Reading text files like CSV, is always a pain, and using Java serialization or XML serialization is even worse. This is what happens if you use the IOObjectWriter in 4.x.
In 5, there is a custom serialization method for example sets which should speed up this significantly.

Cheers,
Simon

Stefan_E · January 2010

Related to the subject heading - not necessarily to the thread content (Sorry!):

In RM-5, it appears that Process / Validate Automatically is by default enabled and can't be configured to be off (at least I haven't found it... I have source with relatively slow DB queries... and the validation process seems to run them... and I have a short memory switching Auto-Validation off before it hits me...

Greetings - Stefan

fischer · January 2010

Hi,

I fixed this. Validate Automatically now remembers its state.

Cheers,
Simon

jwalter · June 2010

Sebastian Land wrote:

Let me add, that the binary format of course has it's disadvantages: If anything in the class structure changes, the bytes in the file cannot be interpreted correctly and so it cannot be read anymore. For storing informations for a longer time, you mustn't use the binary format...

Greetings,
Sebastian

Are you refering to the file format (*.ioo / *.md) for data in the repository concept?

Why did you choose this format? It seems that repositories can not hold e.g. CSV files.
What is the meaning of the CONTENT files in a repo folder?
How can I write such files - with an external program? Can you provide a documentation for that format?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Loading large data

Answers