The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Loading large data
Hello,
An image dataset.
1250 features, 2000 positive, 2000 negative examples.
In .mat format 32MB.
In ASCII .csv format 58MB
Every time I start my RM progress, this dataset 23 seconds to load.
Is there anyway to keep the dataset cached?
Also I'd like to cache my PCA.
Or cache the transformed dataset.
PCA in matlab takes about 30 seconds, PCA in Rapidminer about 3 min.
That is a factor 6.
Why is matlab faster?
Regards,
Wessel Luijben
An image dataset.
1250 features, 2000 positive, 2000 negative examples.
In .mat format 32MB.
In ASCII .csv format 58MB
Every time I start my RM progress, this dataset 23 seconds to load.
Is there anyway to keep the dataset cached?
Also I'd like to cache my PCA.
Or cache the transformed dataset.
PCA in matlab takes about 30 seconds, PCA in Rapidminer about 3 min.
That is a factor 6.
Why is matlab faster?
Regards,
Wessel Luijben
0
Answers
there are several possible reasons, why the matlab format is smaller: It might store data in a binary format or uses a less precise decimal format. If you want a small RapidMiner format, you could store it binary using the IOObjectWriter or wait for RapidMiner 5.
No, there's currently no way for caching the data. Although caching sounds easy, it is not, because on the one hand data is modified during the process, so that the cached version might get corrupted. To avoid this, you will need to store a complete copy in memory, which would pose memory problems especially on larger data sets, where the cache would be most helpful.
To speed up your loading, there are two possibilities: Work only on a subset of the data during process design time and only use the full data for the final run. Or you might simply use the binary format of rapid miner (which is not guaranteed to be compatible with any further version) but should be faster and nice for temporary copies.
Matlab is written in C as far as I know, which gives a a fair performance boost against a non native compiled Java program like RapidMiner. On the other hand, you might use Java on nearly every computer platform available, even on my handy...
Beside this there are many different algorithms for calculating the eigenvectors and eigenvalues needed for the PCA. Chances are, that Matlab uses a highly tuned and optimized algorithm. Feel invited to adapt this algorithm for RapidMiner, we would gratefully include it in the core
Greetings,
Sebastian
That is the opposite result
No I did not make this mistake, I used output_type : Binary also!
This is the dataset I used: http://77.93.77.78/download/MilkDataJoosten.csv
MilkDataJoosten.csv 54.4 MB (57.064.974 bytes) <-- load time 15s
big.ioo 137 MB (144.474.351 bytes) <-- load time 6s
Surprisingly big.ioo does load faster!
http://77.93.77.78/download/MilkDataJoosten.csv
<operator name="Root" class="Process" expanded="yes">
<operator name="CSVExampleSource" class="CSVExampleSource">
<parameter key="filename" value="D:\wessel\Desktop\MilkDataJoosten.csv"/>
<parameter key="column_separators" value=";"/>
</operator>
<operator name="IOObjectWriter" class="IOObjectWriter">
<parameter key="object_file" value="D:\wessel\Desktop\big.ioo"/>
<parameter key="io_object" value="ExampleSet"/>
<parameter key="output_type" value="Binary"/>
</operator>
<operator name="IOObjectReader" class="IOObjectReader" breakpoints="after">
<parameter key="object_file" value="D:\wessel\Desktop\big.ioo"/>
<parameter key="io_object" value="ExampleSet"/>
</operator>
</operator>
the result isn't very surprisingly if you take the internal encoding into account. If you create a data table in RapidMiner, it's values are stored in double arrays. Even the nominal ones, they are mapped from an index to a string. By the way, if double, float or integer is used you might determine with the appropriate parameter of the loading operator.
Your csv file contains mainly integer values, for example "43". This is represented by two characters plus one split character, hence 3 bytes. Each double will consume 4 bytes, so this increases the needed memory. Additionally you have several missing values, which take up 1 byte but are represented by 4 bytes. Additional memory is used for holding all the examples together and storing additional informations like statistics and so on. All this is saved if you select the binary format.
But why does this load faster? Simply because java only has to read the file and swap it directly into memory. There isn't any parsing, interpreting, object creation and repeated memory allocation needed.
Greetings,
Sebastian
Greetings,
Sebastian
Your explanation is 100% clear.
Let me just add this:
Reading text files like CSV, is always a pain, and using Java serialization or XML serialization is even worse. This is what happens if you use the IOObjectWriter in 4.x.
In 5, there is a custom serialization method for example sets which should speed up this significantly.
Cheers,
Simon
In RM-5, it appears that Process / Validate Automatically is by default enabled and can't be configured to be off (at least I haven't found it... I have source with relatively slow DB queries... and the validation process seems to run them... and I have a short memory switching Auto-Validation off before it hits me...
Greetings - Stefan
I fixed this. Validate Automatically now remembers its state.
Cheers,
Simon
Why did you choose this format? It seems that repositories can not hold e.g. CSV files.
What is the meaning of the CONTENT files in a repo folder?
How can I write such files - with an external program? Can you provide a documentation for that format?