The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Time Optimization
Hi,
I am working with KMedoids clustering with 1.7MB text data.But it has been running for the last 3 and half days.The other operators took only 10 minutes .The KMedoids only taking the remaining time.Is there any way to optimize the process.The process is mentioned below.
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#Optimizing vector creation for text classification#ylt#/h3#ygt##ylt#p#ygt#This experiments shows how to apply a cross validation to a classifier that learns to separate two sets of texts.#ylt#/p#ygt#"/>
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="excel_file" value="C:\Documents and Settings\ADMIN\Desktop\data1.xls"/>
<parameter key="first_row_as_names" value="true"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<list key="namespaces">
</list>
<operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
</operator>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
</operator>
<operator name="KMedoids" class="KMedoids">
<parameter key="k" value="25"/>
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="is_nominal"/>
</operator>
<operator name="ExcelExampleSetWriter" class="ExcelExampleSetWriter">
<parameter key="excel_file" value="C:\Documents and Settings\ADMIN\Desktop\cluster1.xls"/>
</operator>
</operator>
Thanks
Ratheesan
I am working with KMedoids clustering with 1.7MB text data.But it has been running for the last 3 and half days.The other operators took only 10 minutes .The KMedoids only taking the remaining time.Is there any way to optimize the process.The process is mentioned below.
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#Optimizing vector creation for text classification#ylt#/h3#ygt##ylt#p#ygt#This experiments shows how to apply a cross validation to a classifier that learns to separate two sets of texts.#ylt#/p#ygt#"/>
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="excel_file" value="C:\Documents and Settings\ADMIN\Desktop\data1.xls"/>
<parameter key="first_row_as_names" value="true"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<list key="namespaces">
</list>
<operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
</operator>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
</operator>
<operator name="KMedoids" class="KMedoids">
<parameter key="k" value="25"/>
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="is_nominal"/>
</operator>
<operator name="ExcelExampleSetWriter" class="ExcelExampleSetWriter">
<parameter key="excel_file" value="C:\Documents and Settings\ADMIN\Desktop\cluster1.xls"/>
</operator>
</operator>
Thanks
Ratheesan
0
Answers
unfortunately it takes time to calculate all the distances needed. One hint: It might be useful to switch to CosineSimilarity. That's more suitable for text mining than euclidean distance.
Greetings,
Sebastian
Suppose I am using RM Enterprise edition,will it take the same amount of time when we are using RM Community version.
Thanks
Ratheesan
we have parallelized many important operators for the Enterprise Edition, but KMedoids is not part of it. But for the money of an Enterprise Edition, we could write you a parallelized KMedoids. One could even think about optimizing the operator for small example sets with many attributes like it is frequent in text mining tasks.
Greetings,
Sebastian
I have tried the above process with Cosine similarity.But always getting the message " There is no obvious error,check the log file".Before applying KMedoids I used Attribute filter operator and selected numeric attributes because in KMedoids Numerical measures only provides Cosine similarity.
Thanks
Ratheesan
please send me your process. I will check if there's a bug.
Greetings,
Sebastian
Thanks for your valuable help. This is my process
<operator name="Root" class="Process" expanded="yes">
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="excel_file" value="C:\Documents and Settings\ADMIN\Desktop\data1.xls"/>
<parameter key="first_row_as_names" value="true"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<list key="namespaces">
</list>
<operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
</operator>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="is_numerical"/>
<parameter key="parameter_string" value="sample"/>
<parameter key="apply_on_special" value="true"/>
</operator>
<operator name="KMedoids" class="KMedoids">
<parameter key="k" value="3"/>
<parameter key="max_runs" value="5"/>
<parameter key="max_optimization_steps" value="10"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<operator name="ExcelExampleSetWriter" class="ExcelExampleSetWriter">
<parameter key="excel_file" value="C:\Documents and Settings\ADMIN\Desktop\modelcluster.xls"/>
</operator>
</operator>
If am using up to 250 records,its working properly but if going for more than 250 records I am getting the above message.
Thanks
Ratheesan.
the process just runs fine on here. I used 722 texts, but there was no error, at least not at the first few minutes of the KMedoids run.
Of course I don't have exactly the same setup, because I'm using different texts. Uhm. I suggest, you should switch your RapidMiner to debug mode, so that you could post me the detailed error message. Go to the Tools menu and select Preferences. Enable the rapidminer.general.debugmode checkbox in the tab General.
Then please reexecute the process and send me the error message.
Greetings,
Sebastian
I reexecuted the process after changing to the debug mode.Here I am attaching the error message.
Root[1] (Process)
+- ExcelExampleSource[1] (ExcelExampleSource)
+- Nominal2String[1] (Nominal2String)
+- StringTextInput[1] (StringTextInput)
| +- ToLowerCaseConverter[600] (ToLowerCaseConverter)
| +- StringTokenizer[600] (StringTokenizer)
| +- EnglishStopwordFilter[600] (EnglishStopwordFilter)
| +- TokenLengthFilter[600] (TokenLengthFilter)
+- AttributeFilter (2)[1] (AttributeFilter)
here ==> +- KMedoids[1] (KMedoids)
java.lang.NullPointerException
at com.rapidminer.operator.clustering.clusterer.KMedoids.generateClusterModel(KMedoids.java:176)
at com.rapidminer.operator.clustering.clusterer.AbstractClusterer.apply(AbstractClusterer.java:60)
at com.rapidminer.operator.Operator.apply(Operator.java:671)
at com.rapidminer.operator.OperatorChain.apply(OperatorChain.java:424)
at com.rapidminer.operator.Operator.apply(Operator.java:671)
at com.rapidminer.Process.run(Process.java:735)
at com.rapidminer.Process.run(Process.java:704)
at com.rapidminer.Process.run(Process.java:694)
at com.rapidminer.gui.ProcessThread.run(ProcessThread.java:59)
Thanks
Ratheesan.
that's quite strange. The distance measure seems to return NaN, that's the only way, why this could happen.
Unfortunately I cannot debug anything more detailed, because I can't reproduce this error. Do you have any missing values in your data?
Greetings,
Sebastian
Here I have no missing value.But I am getting the output using Dice similarity.Is it meaningful for using Dice similarity in text mining.
Thanks
Ratheesan.
this is a forum, neither this is consulting nor is it a course. I cannot answer EACH question regarding this or that algorithm or measure. Just try it out yourself. In fact, you cannot even say what is a good measure or algorithm, because this always depends on the data, on your data, I don't have.
Greetings,
Sebastian