The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Performance Issues"
svendeswan
Member Posts: 8 Contributor II
Hi,
I am currently working on supervised machine learning. The base is a dataset containing 75 classes and about 2700 documents. The documents itself are newsgroup entries with 2 to 10 lines of plain text. Furthermore I use the Text Plugin to create the feature vectors. I tried with several learners and several preprocessors like String Tokenization or nGrams.
The results are little embarassing so I think I did something wrong with that, for example:
Using String Tokenization and the NearestNeighbor learner the system needs about 1.5 gByte RAM and more than 1 hour for learning. In contrast to this I have an own implementation of the NearestNeighbor learner (and text2 feature vector too) which needs 40 seconds for the whole process and consumes about 50mByte memory.
Has anyone experienced similar problems and has possibly a solution?
Thanks in advance,
Sven
I am currently working on supervised machine learning. The base is a dataset containing 75 classes and about 2700 documents. The documents itself are newsgroup entries with 2 to 10 lines of plain text. Furthermore I use the Text Plugin to create the feature vectors. I tried with several learners and several preprocessors like String Tokenization or nGrams.
The results are little embarassing so I think I did something wrong with that, for example:
Using String Tokenization and the NearestNeighbor learner the system needs about 1.5 gByte RAM and more than 1 hour for learning. In contrast to this I have an own implementation of the NearestNeighbor learner (and text2 feature vector too) which needs 40 seconds for the whole process and consumes about 50mByte memory.
Has anyone experienced similar problems and has possibly a solution?
Thanks in advance,
Sven
Tagged:
0
Answers
Which RapidMiner version are you using?
How are you reading the texts? Are you using a sparse data representation?
Please post your processes here.
Best,
Simon
I am using the version 4.4 (under MacOsX). I currently do not use sparse data representation and I saw that maybe the log output could be the reason for reducing the speed?! Anyway, here ist my configuration xml:
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#Learning and storing a text classifier#ylt#/h3#ygt##ylt#p#ygt#This experiments shows how to lear and store a model o a set of texts.#ylt#/p#ygt##ylt#p#ygt#Most important to notice here is, that the list of words used for learning must be stored, if the model should be applied to new texts. This will ensurethat new texts will be represented exactly in the same way then then the texts used during training. #ylt#/p#ygt#"/>
<parameter key="logverbosity" value="almost_none"/>
<operator name="TextInput" class="TextInput" expanded="yes">
<list key="texts">
<parameter key="3" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/3"/>
<parameter key="4" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/4"/>
<parameter key="5" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/5"/>
<parameter key="6" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/6"/>
<parameter key="7" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/7"/>
<parameter key="8" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/8"/>
<parameter key="9" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/9"/>
<parameter key="10" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/10"/>
<parameter key="11" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/11"/>
<parameter key="12" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/12"/>
<parameter key="13" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/13"/>
<parameter key="14" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/14"/>
<parameter key="15" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/15"/>
<parameter key="16" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/16"/>
<parameter key="17" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/17"/>
<parameter key="18" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/18"/>
<parameter key="19" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/19"/>
<parameter key="20" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/20"/>
<parameter key="21" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/21"/>
<parameter key="22" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/22"/>
<parameter key="23" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/23"/>
<parameter key="24" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/24"/>
<parameter key="25" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/25"/>
<parameter key="26" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/26"/>
<parameter key="27" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/27"/>
<parameter key="28" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/28"/>
<parameter key="29" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/29"/>
<parameter key="30" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/30"/>
<parameter key="31" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/31"/>
<parameter key="32" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/32"/>
<parameter key="33" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/33"/>
<parameter key="34" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/34"/>
<parameter key="35" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/35"/>
<parameter key="36" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/36"/>
<parameter key="37" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/37"/>
<parameter key="38" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/38"/>
<parameter key="39" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/39"/>
<parameter key="40" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/40"/>
<parameter key="41" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/41"/>
<parameter key="42" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/42"/>
<parameter key="43" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/43"/>
<parameter key="44" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/44"/>
<parameter key="45" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/45"/>
<parameter key="46" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/46"/>
<parameter key="47" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/47"/>
<parameter key="48" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/48"/>
<parameter key="49" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/49"/>
<parameter key="50" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/50"/>
<parameter key="51" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/51"/>
<parameter key="52" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/52"/>
<parameter key="53" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/53"/>
<parameter key="54" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/54"/>
<parameter key="55" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/55"/>
<parameter key="56" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/56"/>
<parameter key="57" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/57"/>
<parameter key="58" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/58"/>
<parameter key="59" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/59"/>
<parameter key="60" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/60"/>
<parameter key="61" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/61"/>
<parameter key="62" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/62"/>
<parameter key="63" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/63"/>
<parameter key="64" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/64"/>
<parameter key="65" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/65"/>
<parameter key="66" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/66"/>
<parameter key="67" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/67"/>
<parameter key="68" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/68"/>
<parameter key="69" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/69"/>
<parameter key="70" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/70"/>
<parameter key="71" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/71"/>
<parameter key="72" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/72"/>
<parameter key="73" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/73"/>
<parameter key="74" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/74"/>
<parameter key="75" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/75"/>
</list>
<parameter key="output_word_list" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/Traningsdaten/wordvectorList.txt"/>
<list key="namespaces">
</list>
<operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
</operator>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
</operator>
<operator name="NearestNeighbors" class="NearestNeighbors">
<parameter key="k" value="3"/>
</operator>
<operator name="ModelWriter" class="ModelWriter">
<parameter key="model_file" value="/Users/sven/Documents/Projekte/dfki/dilia/Clustering/RapidMinerWordProject/NearestNeighbor_Modell.mod"/>
</operator>
</operator>
Thank you for your help,
Sven
your assumption is correct: If you set the log verbosity to Error in the process root operator, the speed will increase with around factor 1000. Please don't ask, why each text is logged as warning. There is simply no reason for doing that, beside the fact that there once was a non gui version...Unfortunately we did not have access to these lines of codes, producing the log messages. But the next version of RapidMiner will solve this issue...
Greetings,
Sebastian
indeed the performance is much better now (about 15 minutes for the whole process). So there are still some questions left:
- Why does the process consume that much memory? Is it because of the GUI?
- I applied the learnt model to 124 test documents. The model applier needs about 22 minutes to perform the classification (TextInput needed about 30 seconds, ModelLoader about 9 minutes). So the question is whether this is just because of the testing things or is it that classification of 1 document using the NearestNeighbor will need roughly 8 seconds in a later application?
Thanks again,
Sven
this sounds like it is still too slow.
Can you please try with 4.5? With some operators there were some performance issues in 4.4 when there were many attributes. This was fixed in 4.5.
For the ModelLoader I sould guess that you are using the XML format. This is pretty slow, but we cannot do anything about that. It's in an external library. You can try the binary serialization which is faster.
Cheers,
Simon
I updated to version 4.5 and changed to binary model which helped a lot in storing and reading time. Thank you for your advices. Unfortunately the classification time did not change... Is there a way to tune the Text Plugin to produce sparse vectors ? This should improve speed to a large extend.
Cheers,
Sven
the TextInput always uses a sparse representation.
But the use of the kNN learner slows things down a lot. Did you try a SVM using an rbf kernel? It's much the same, but should be faster.
By the way: For texts usually the cosine similarity yields better result.
Greetings,
Sebastian
yesterday I used the fast large margin learner and wrapped it with the Binary2MultiClass Learner.
The result was that i ran out of memory (I allocated 2.1 gB as heap memory)....
Do you have a hint for another SVM learner in your system that does not consume too much memory?
I also tried out the libsvm but learning time was pretty long as well (I let it run over night so I don't know the real learning time).
Thank you,
Sven