Time Optimization

ratheesan · December 2009

Hi,
I am working with KMedoids clustering with 1.7MB text data.But it has been running for the last 3 and half days.The other operators took only 10 minutes .The KMedoids only taking the remaining time.Is there any way to optimize the process.The process is mentioned below.

<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#Optimizing vector creation for text classification#ylt#/h3#ygt##ylt#p#ygt#This experiments shows how to apply a cross validation to a classifier that learns to separate two sets of texts.#ylt#/p#ygt#"/>
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="excel_file" value="C:\Documents and Settings\ADMIN\Desktop\data1.xls"/>
<parameter key="first_row_as_names" value="true"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<list key="namespaces">
</list>
<operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
</operator>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
</operator>
<operator name="KMedoids" class="KMedoids">
<parameter key="k" value="25"/>
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="is_nominal"/>
</operator>
<operator name="ExcelExampleSetWriter" class="ExcelExampleSetWriter">
<parameter key="excel_file" value="C:\Documents and Settings\ADMIN\Desktop\cluster1.xls"/>
</operator>
</operator>

Thanks
Ratheesan

land · December 2009

Hi,
unfortunately it takes time to calculate all the distances needed. One hint: It might be useful to switch to CosineSimilarity. That's more suitable for text mining than euclidean distance.

Greetings,
Sebastian

ratheesan · December 2009

Thanks Sebastian,
Suppose I am using RM Enterprise edition,will it take the same amount of time when we are using RM Community version.

Thanks
Ratheesan

land · December 2009

Hi,
we have parallelized many important operators for the Enterprise Edition, but KMedoids is not part of it. But for the money of an Enterprise Edition, we could write you a parallelized KMedoids. One could even think about optimizing the operator for small example sets with many attributes like it is frequent in text mining tasks.

Greetings,
Sebastian

ratheesan · December 2009

Hi Sebastian,

I have tried the above process with Cosine similarity.But always getting the message " There is no obvious error,check the log file".Before applying KMedoids I used Attribute filter operator and selected numeric attributes because in KMedoids Numerical measures only provides Cosine similarity.

Thanks
Ratheesan

land · December 2009

Hi,
please send me your process. I will check if there's a bug.

Greetings,
Sebastian

ratheesan · December 2009

Hi Sebastian,

Thanks for your valuable help. This is my process

<operator name="Root" class="Process" expanded="yes">
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="excel_file" value="C:\Documents and Settings\ADMIN\Desktop\data1.xls"/>
<parameter key="first_row_as_names" value="true"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<list key="namespaces">
</list>
<operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
</operator>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="is_numerical"/>
<parameter key="parameter_string" value="sample"/>
<parameter key="apply_on_special" value="true"/>
</operator>
<operator name="KMedoids" class="KMedoids">
<parameter key="k" value="3"/>
<parameter key="max_runs" value="5"/>
<parameter key="max_optimization_steps" value="10"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<operator name="ExcelExampleSetWriter" class="ExcelExampleSetWriter">
<parameter key="excel_file" value="C:\Documents and Settings\ADMIN\Desktop\modelcluster.xls"/>
</operator>
</operator>

If am using up to 250 records,its working properly but if going for more than 250 records I am getting the above message.

Thanks
Ratheesan.

land · December 2009

Hi,
the process just runs fine on here. I used 722 texts, but there was no error, at least not at the first few minutes of the KMedoids run.

Of course I don't have exactly the same setup, because I'm using different texts. Uhm. I suggest, you should switch your RapidMiner to debug mode, so that you could post me the detailed error message. Go to the Tools menu and select Preferences. Enable the rapidminer.general.debugmode checkbox in the tab General.
Then please reexecute the process and send me the error message.

Greetings,
Sebastian

ratheesan · December 2009

Hi Sebastian,

I reexecuted the process after changing to the debug mode.Here I am attaching the error message.

Root[1] (Process)
+- ExcelExampleSource[1] (ExcelExampleSource)
+- Nominal2String[1] (Nominal2String)
+- StringTextInput[1] (StringTextInput)
| +- ToLowerCaseConverter[600] (ToLowerCaseConverter)
| +- StringTokenizer[600] (StringTokenizer)
| +- EnglishStopwordFilter[600] (EnglishStopwordFilter)
| +- TokenLengthFilter[600] (TokenLengthFilter)
+- AttributeFilter (2)[1] (AttributeFilter)
here ==> +- KMedoids[1] (KMedoids)
java.lang.NullPointerException
at com.rapidminer.operator.clustering.clusterer.KMedoids.generateClusterModel(KMedoids.java:176)
at com.rapidminer.operator.clustering.clusterer.AbstractClusterer.apply(AbstractClusterer.java:60)
at com.rapidminer.operator.Operator.apply(Operator.java:671)
at com.rapidminer.operator.OperatorChain.apply(OperatorChain.java:424)
at com.rapidminer.operator.Operator.apply(Operator.java:671)
at com.rapidminer.Process.run(Process.java:735)
at com.rapidminer.Process.run(Process.java:704)
at com.rapidminer.Process.run(Process.java:694)
at com.rapidminer.gui.ProcessThread.run(ProcessThread.java:59)

Thanks
Ratheesan.

land · December 2009

Hi,
that's quite strange. The distance measure seems to return NaN, that's the only way, why this could happen.
Unfortunately I cannot debug anything more detailed, because I can't reproduce this error. Do you have any missing values in your data?

Greetings,
Sebastian

ratheesan · December 2009

Hi Sebastian,

Here I have no missing value.But I am getting the output using Dice similarity.Is it meaningful for using Dice similarity in text mining.

Thanks
Ratheesan.

land · December 2009

Hi,
this is a forum, neither this is consulting nor is it a course. I cannot answer EACH question regarding this or that algorithm or measure. Just try it out yourself. In fact, you cannot even say what is a good measure or algorithm, because this always depends on the data, on your data, I don't have.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Time Optimization

Answers