How to reuse preprocessing results in a range of k-means clustering
Hi all,
I am conducting a K-Means clustering analysis to several groups of documents and I would like to evaluate the clustering performance of different K ( K=4 to 20) by comparing their respective Davies-Bouldin indexes.
Previously to the clustering algorithm, I apply a preprocessing tasks (to transform cases, tokenize, filter stopwords, steeminng...creating a tf-if vector). The output of this preprocessing tasks is always the same for each group of texts (attached the general view of the process)
Now I am playing the process for each value of K, but I would like not to repeat this preprocessing tasks, which is the same for each group of text, every time I do the K clustering clustering and calculating davies-bouldin indexes, basically to save a lot of time
Thank you very much in advance
Alberto
Best Answers
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
Just add a loop after the preprocessing steps to run k-means and save the output you want and then cycle through the different k-values you would like using a loop macro.
An alternative would be to Store the results after pre-processing them and then create a separate process that starts by Retrieving that dataset before each run of the clustering (also within a loop). Either approach should work.
2 -
nmahesh Member Posts: 3 Contributor I
Hi Alberto,
Have you tried using the store operator for the pre-processing? I would then create different processes to try out different parameter changes to your clustering and performance.
Best,
Nithin Mahesh
1
Answers
Thank you Brian,
I´m a beginner using Rapidminer and I´ve not considered the option of storing/retrieving the output of the preprocessing tasks. It is a very good option and I´m sure it save me a lot of time.
I wouldn´t like to take up much of your time, but I have already considered the use of a loop for proving diferent K, but I have not found the right way to implement it. Could you provide an example? I tried with the cluster loop operator just between the retrieve operator and the clustering operator, but I don´t know how to change the k
Thanks again
alberto
Thank you Nithin, both Brian´s and your proposal about storing/retrieving the output of the preprocessing tasks have been very useful
Alberto
Sure, here's a sample process with k-means clustering and the Loop Parameters operator.
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.5.003" expanded="true" height="68" name="Retrieve Sonar" width="90" x="45" y="85">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="loop_parameters" compatibility="7.5.003" expanded="true" height="103" name="Loop Parameters" width="90" x="246" y="85">
<list key="parameters">
<parameter key="Clustering.k" value="[2.0;10;8;linear]"/>
</list>
<process expanded="true">
<operator activated="true" class="k_means" compatibility="7.5.003" expanded="true" height="82" name="Clustering" width="90" x="313" y="85">
<parameter key="k" value="10"/>
</operator>
<connect from_port="input 1" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve Sonar" from_port="output" to_op="Loop Parameters" to_port="input 1"/>
<connect from_op="Loop Parameters" from_port="result 1" to_port="result 1"/>
<connect from_op="Loop Parameters" from_port="result 2" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Thank you Telcontar120, I will prove this, it is vert useful, I really appreaciate your help!