The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Standard Data Sets - memory issue
svendeswan
Member Posts: 8 Contributor II
Hi,
I am trying to run some standard datasets like 20 newsgroups or reuters21578 but unfortunately I run into memory problems. The reuters coul be used for nearest neighbour but nothing else, the 20 newsgroups didn't run at all... Maybe I am doing something wrong?!
I use the Rapidminer 4.5....
Do you have some hints for me?
Thanks,
Sven
I am trying to run some standard datasets like 20 newsgroups or reuters21578 but unfortunately I run into memory problems. The reuters coul be used for nearest neighbour but nothing else, the 20 newsgroups didn't run at all... Maybe I am doing something wrong?!
I use the Rapidminer 4.5....
Do you have some hints for me?
Thanks,
Sven
0
Answers
I guess you are using the TextPlugin, correct? Did you switch to sparse ExampleSet storage? It would help a lot, if you could paste the process below.
And another thing: How much memory does your rapid miner use? Please take a look in the memory monitor and tell us...
Greetings,
Sebastian
thank you for the answer. I did not quite understand how to set up the process for the sparse storage.
The memory is about 1.9 gByte....
Best,
Sven
And here is the process:
<?xml version="1.0" encoding="MacRoman"?>
<process version="4.5">
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#Learning and storing a text classifier#ylt#/h3#ygt##ylt#p#ygt#This experiments shows how to lear and store a model o a set of texts.#ylt#/p#ygt##ylt#p#ygt#Most important to notice here is, that the list of words used for learning must be stored, if the model should be applied to new texts. This will ensurethat new texts will be represented exactly in the same way then then the texts used during training. #ylt#/p#ygt#"/>
<parameter key="logverbosity" value="error"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<operator name="TextInput" class="TextInput" expanded="yes">
<list key="texts">
<parameter key="acq" value="/20news-bydate/20news-bydate-train/alt.atheism"/>
<parameter key="alum" value="/20news-bydate/20news-bydate-train/comp.graphics"/>
<parameter key="bop" value="/20news-bydate/20news-bydate-train/comp.os.ms-windows.misc"/>
<parameter key="carcass" value="/20news-bydate/20news-bydate-train/comp.sys.ibm.pc.hardware"/>
<parameter key="cocoa" value="/20news-bydate/20news-bydate-train/comp.sys.mac.hardware"/>
<parameter key="coffee" value="/20news-bydate/20news-bydate-train/comp.windows.x"/>
<parameter key="copper" value="/20news-bydate/20news-bydate-train/misc.forsale"/>
<parameter key="cotton" value="/20news-bydate/20news-bydate-train/rec.autos"/>
<parameter key="cpi" value="/20news-bydate/20news-bydate-train/rec.motorcycles"/>
<parameter key="cpu" value="/20news-bydate/20news-bydate-train/rec.sport.baseball"/>
<parameter key="crude" value="/20news-bydate/20news-bydate-train/rec.sport.hockey"/>
<parameter key="dlr" value="/20news-bydate/20news-bydate-train/sci.crypt"/>
<parameter key="dmk" value="/20news-bydate/20news-bydate-train/sci.electronics"/>
<parameter key="earn" value="/20news-bydate/20news-bydate-train/sci.crypt"/>
<parameter key="fuel" value="/20news-bydate/20news-bydate-train/sci.electronics"/>
<parameter key="gas" value="/20news-bydate/20news-bydate-train/sci.med"/>
<parameter key="gnp" value="/20news-bydate/20news-bydate-train/sci.space"/>
<parameter key="gold" value="/20news-bydate/20news-bydate-train/soc.religion.christian"/>
<parameter key="grain" value="/20news-bydate/20news-bydate-train/talk.politics.guns"/>
<parameter key="heat" value="/20news-bydate/20news-bydate-train/talk.politics.mideast"/>
<parameter key="housing" value="/20news-bydate/20news-bydate-train/talk.politics.misc"/>
<parameter key="income" value="/20news-bydate/20news-bydate-train/talk.religion.misc"/>
</list>
<parameter key="default_content_type" value=""/>
<parameter key="default_content_encoding" value=""/>
<parameter key="default_content_language" value=""/>
<parameter key="prune_below" value="-1"/>
<parameter key="prune_above" value="-1"/>
<parameter key="vector_creation" value="TFIDF"/>
<parameter key="use_content_attributes" value="false"/>
<parameter key="use_given_word_list" value="false"/>
<parameter key="return_word_list" value="false"/>
<parameter key="output_word_list" value="/RapidMinerWordProject/Traningsdaten/wordvectorList.txt"/>
<parameter key="id_attribute_type" value="number"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="false"/>
<parameter key="on_the_fly_pruning" value="-1"/>
<parameter key="extend_exampleset" value="false"/>
<operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
</operator>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
</operator>
<operator name="NearestNeighbors" class="NearestNeighbors">
<parameter key="keep_example_set" value="false"/>
<parameter key="k" value="1"/>
<parameter key="weighted_vote" value="false"/>
<parameter key="measure_types" value="MixedMeasures"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="EuclideanDistance"/>
<parameter key="divergence" value="GeneralizedIDivergence"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
</operator>
<operator name="ModelWriter" class="ModelWriter">
<parameter key="model_file" value="/RapidMinerWordProject/NearestNeighbor.mod"/>
<parameter key="overwrite_existing_file" value="true"/>
<parameter key="output_type" value="Binary"/>
</operator>
</operator>
</process>
the TextInput operator always creates a sparse example set if you don't switch on extend_exampleset. Then it would depend on the input example set.
I have downloaded the data set and will try myself. But I think I already know whats the problem: Unlike the data set, KNN does not save the data in a sparse format. That causes the memory consumption to explode. Just think of a matrix of 45000x10000 entries a 4 bytes to get an impression of how many data would have to be stored. Nearest Neighbors isn't a good idea on text data at all, especially on so many examples and becomes completely worthless, if you don't switch the distance measure to cosine similarity.
SVMs or NaiveBayes should cope with this amout of data much better and will have a better performance anyway.
Greetings,
Sebastian
thank you for your hints. Maybe I wait for your results :-) I am trying to build some kind of perfomance matrix for this dataset (and the Reuters too) using different learners and preprocessing. In my experience kNN worked well for big data sets in the past, but I never tried this with Rapidminer. Maybe you could paste the process then so I have the chance to build my matrix by simply exchanging the learner operators :-)
Best,
Sven
it just finished loading the data. The results are somehow overwhelming: around 46.000 examples with 120.000 attributes. If stored in a standard, non sparse array, this would consume around 36 GB of RAM. The standard kNN will not work on this. Never. Even if it would save it in a sparse array, it would have to look through each of the trainings examples for classifying ONE new example and each time it would have to compute the distance over all this attributes...
So you should simply replace the kNN in the operator tree of your process with the LibSVM or the NaiveBayes operator. This should work then...
Greetings,
Sebastian