The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Working with large data sets
I have to work with a large data set, but I'm worried if RapidMiner supports very large amounts of data (let's say over a few hundred thousand records). Is there any place I can find some reference about performance and maximum number of records that the tool can work with?
Also, is there any data input format that is more recommended for dealing with large data sets?
Thanks.
Also, is there any data input format that is more recommended for dealing with large data sets?
Thanks.
0
Answers
Try playing around with the example generator operators, like the the massive data generator. It is probably also worth pointing out that other things affect performance, like OS, memory, and of course what you plan to do with the data; this also means that finding "some reference about performance and maximum number of records that the tool can work with" may be impossible. After all, how long is the longest piece of string?
haddock is absolutely right. There is no general answer to the maximum amount of data RapidMiner can handle. Since many calculations of data mining models are done in memory, the amount of main memory is one of the most important factors restricting the amount of data for modeling. However, for certain model types as well as for the most preprocessing tasks, one can process the data in batches and then you will have hardly any limitation at all if the data is stored in a database. Just to give you an idea, I am currently working on a project with about 700000 items and we perform a lot of preprocessing and also the modeling runs smoothly. The largest database I remember we were working on in a customer project contained more than 30 Mio records - and everything worked well in this project, if you are know what you are doing at least This can easily be answered: read your data from databases (the real relational ones, not access etc.) and store the results there as well. In this case you will also be able to work in batches fitting into memory during preprocessing and for certain models even for modeling. This is, by the way, always possible for scoring the data, i.e. applying a prediction model on large amounts of data.
Cheers,
Ingo
Since I had no idea how the tool processed with the data, I was just looking for a general idea if the robustness of the tool could be a limiting factor.
But you both had some good points there and you were very helpful. Thanks a lot!
So the conclusion could be to leave out those algorithms from RapidMiner and only keep those which are able to work on massive amounts of data only to make things more robust. We do not like that - simply because not every data set has an enormous size and why should we restrict ourself to a stripped-down sets of data mining algorithms we already have. So we decided for a everything-goes policy and moved the decision about the correct and robust analysis process from the tool to the user where in my opinion it is the only place it belongs. The only thing we can try is to support the users in their decisions for which the quick fixes from RapidMiner 5 are actually a first start and which will certainly be extended in future versions.
Another interesting side node: especially for classification settings the amount of unlabeled data is most often very large and the amount of labeled data used for modeling much smaller. Thanks to the preprocessing models and the looping operators of RapidMiner the preprocessing and model application (scoring) of the unlabeled data can be done in batches and there is no problem at all. And even if the amount of labeled data is large it is most often not the wisest thing to do to use all of it. At least on sufficient compute servers with a nice amount of memory the running time starts to restrict the applicability and no longer the memory and problems with robustness. And again I would always state that those problems root in analysis process design and are actually not a problem of the selected tool (independent of RapidMiner).
Just some additional thoughts some of you might find interesting.
Cheers,
Ingo