How to reuse preprocessing results in a range of k-means clustering

albertoarenal · July 2017

Hi all,

I am conducting a K-Means clustering analysis to several groups of documents and I would like to evaluate the clustering performance of different K ( K=4 to 20) by comparing their respective Davies-Bouldin indexes.

Previously to the clustering algorithm, I apply a preprocessing tasks (to transform cases, tokenize, filter stopwords, steeminng...creating a tf-if vector). The output of this preprocessing tasks is always the same for each group of texts (attached the general view of the process)

Now I am playing the process for each value of K, but I would like not to repeat this preprocessing tasks, which is the same for each group of text, every time I do the K clustering clustering and calculating davies-bouldin indexes, basically to save a lot of time

Thank you very much in advance

Alberto

Telcontar120 · July 2017

Just add a loop after the preprocessing steps to run k-means and save the output you want and then cycle through the different k-values you would like using a loop macro.

An alternative would be to Store the results after pre-processing them and then create a separate process that starts by Retrieving that dataset before each run of the clustering (also within a loop). Either approach should work.

nmahesh · July 2017

Hi Alberto,

Have you tried using the store operator for the pre-processing? I would then create different processes to try out different parameter changes to your clustering and performance.

Best,

Nithin Mahesh

albertoarenal · July 2017

Thank you Brian,

I´m a beginner using Rapidminer and I´ve not considered the option of storing/retrieving the output of the preprocessing tasks. It is a very good option and I´m sure it save me a lot of time.

I wouldn´t like to take up much of your time, but I have already considered the use of a loop for proving diferent K, but I have not found the right way to implement it. Could you provide an example? I tried with the cluster loop operator just between the retrieve operator and the clustering operator, but I don´t know how to change the k

Thanks again
alberto

albertoarenal · July 2017

Thank you Nithin, both Brian´s and your proposal about storing/retrieving the output of the preprocessing tasks have been very useful

Alberto

Telcontar120 · July 2017

Sure, here's a sample process with k-means clustering and the Loop Parameters operator.

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.5.003" expanded="true" height="68" name="Retrieve Sonar" width="90" x="45" y="85">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="loop_parameters" compatibility="7.5.003" expanded="true" height="103" name="Loop Parameters" width="90" x="246" y="85">
<list key="parameters">
<parameter key="Clustering.k" value="[2.0;10;8;linear]"/>
</list>
<process expanded="true">
<operator activated="true" class="k_means" compatibility="7.5.003" expanded="true" height="82" name="Clustering" width="90" x="313" y="85">
<parameter key="k" value="10"/>
</operator>
<connect from_port="input 1" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve Sonar" from_port="output" to_op="Loop Parameters" to_port="input 1"/>
<connect from_op="Loop Parameters" from_port="result 1" to_port="result 1"/>
<connect from_op="Loop Parameters" from_port="result 2" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

albertoarenal · July 2017

Thank you Telcontar120, I will prove this, it is vert useful, I really appreaciate your help!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to reuse preprocessing results in a range of k-means clustering

Best Answers

Answers