The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
What is the maximum number of instances handled by Rapidminer?
Hi all,
I have been using Rapidminer on Windows for 2-3 months now and have been very happy with the features and analysis tools it provides. Until now I have been using the feature selection operator on a data with 300 instances and ~300-400 varaiables and it gives me good results. So recently I increased my dataset to 1000 instances but since then i have been getting "Out of memory " errors and the process stops . I am at the last leg of my analysis so its kind of a dampener to get these errors now.
I even tried increasing the memory for java using the Xmx option but no success , so if anyone has ideas / suggestions to solve my problem please let me know .
Thanks,
Emma
I have been using Rapidminer on Windows for 2-3 months now and have been very happy with the features and analysis tools it provides. Until now I have been using the feature selection operator on a data with 300 instances and ~300-400 varaiables and it gives me good results. So recently I increased my dataset to 1000 instances but since then i have been getting "Out of memory " errors and the process stops . I am at the last leg of my analysis so its kind of a dampener to get these errors now.
I even tried increasing the memory for java using the Xmx option but no success , so if anyone has ideas / suggestions to solve my problem please let me know .
Thanks,
Emma
0
Answers
I already worked with rapid miner with over 8000 attributes and 25000 examples without getting out of memory. I have to admit that it needed 8 GB of RAM, but it worked flawlessly under XP 64 using a x64 java. So I'm a little bit surprised by your problem.
Did the memory monitor reflect the change sof the -Xmx parameter? Did you have more memory available before the execption?
Do you use any other memory consuming operators within your process like svms, pca?
Greetings,
Sebastian
there is no upper bound for the number of instances - at least not in principle, i.e. if the data storage was done appropriate. We often work with databases having hundres of millions of tupels without any problem (but of course this will not work for all processes - feature selection might be a problem here.)
Could you please post your process (from the XML tab) here? I probably could give some suggestions how you could tune your feature selection process so that it works.
Cheers,
Ingo
The windows computer in my lab has 1GB RAM and I used the -Xmx512m and -Xmx1024 option. In both cases the memory monitor reflected the changes . Also, the errors messages included comments like "exceeded maximum heap size".
As for the feature selection process I am using , the final goal is to attain a binary classification based on a decision tree.The process used is as follows,
<operator name="Root" class="Process" expanded="yes">
<operator name="ExampleSource" class="ExampleSource">
<parameter key="attributes" value="C:\Program Files\Rapid-I\RapidMiner-4.2\info_gain"/>
</operator>
<operator name="FeatureSelection" class="FeatureSelection" expanded="yes">
<operator name="FSChain" class="OperatorChain" expanded="yes">
<operator name="XValidation" class="XValidation" breakpoints="after" expanded="yes">
<parameter key="average_performances_only" value="false"/>
<parameter key="create_complete_model" value="true"/>
<operator name="DecisionTree" class="DecisionTree">
</operator>
<operator name="ApplierChain" class="OperatorChain" expanded="yes">
<operator name="Applier" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="Evaluator" class="Performance">
</operator>
</operator>
</operator>
<operator name="ProcessLog" class="ProcessLog">
<parameter key="filename" value="C:\Documents and Settings\emma\My Documents\rm_workspace\error.log"/>
<list key="log">
<parameter key="generation" value="operator.FS.value.generation"/>
<parameter key="performance" value="operator.FS.value.performance"/>
</list>
</operator>
</operator>
</operator>
</operator>
Thanks again,
Emma
Thanks for your comments.
Ignacio
@Emma => One idea from my side: As you using DecisionTree with InformationGain as splitting criterion, I suggest to use the operator "InfoGainWeighting". This one will calculate the weight of each feature according to informationgain as if the feature were the first one to use for splitting.
Then you can either use...
- WeightGuidedFeatureSelection instead of FeatureSelection
- AttributeWeightsSelection if you want to preselect some attributes. In this case I recommend to keep attributes at the upper end by using top k
hope this was helpfulSteffen
in fact I will be working with around 38 million instances and 40 attributes. Any comments on using RapidMiner on such a huge databese are welcome.
Ignacio
@Ignacio:
As I stated before we have already sucessfully worked with much larger data sets (several hundreds of millions tupels) with RapidMiner - the important thing is that not every operator / every process can be applied on such large data sets. But if you know what you are doing or if you can live with some trial and error this is certainly possible. Although 40 million instances with 40 attributes might still fit in memory (at least on a 16 Gb machine) it is probably better to work on the database as long as possible. The trick here is to use the CachedDatabaseExampleSource operator and use only the results of aggregations, samples, filtered sets, one-pass models etc. in memory and leave the original data in the database.
Cheers,
Ingo
It would be a huge help to have the help documentation indicate whether the operator is memory bound or not, single-passm etc...as well as to have a number of example of working with abitrailiy large N & M datasets.
Great software though! I was very impressed with the responsiveness when we discussed mutivariate series to windows way back when.
Jay
nice to hear from you again. I have added the extension of the documentation by this type of information to our TODO list.
Cheers,
Ingo