X means always uses minimum cluster amount
Hi, I am pretty new to rapidminer, so I apologize if this question is trivial. I am trying to use X means to cluster some text files. At first I was using K means, but I didn't know how many clusters to use, so I decided to try X means instead. However, the X means operator always uses the minimum number of clusters in the given range. This doesn't seem correct to me, so I'm wondering if I have some settings incorrect or something. Here are the settings I am using:
add cluster attribute is checked
k min: 2
k max: 60
measure types: NumericalMeasures
numerical measure: CosineSimilarity
clustering algorithm: KMeans
max runs: 100
max optimization steps: 100
I have 150 text files that I am trying to cluster, maybe I am not using enough? Any thoughts and tips would be greatly appreciated!
Answers
hello @lizzie_a_martin - welcome to the community. Can you please post your XML in this thread so we can see what you are doing? Instructions are on the right (see "Read Before Posting #2).
Scott
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Thank you for the response! I'll try with more text files now. This is my xml code as asked, I'm not sure I'm doing what you're saying about normalizing. I assume that's another operator that I need?
[/code]
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:process_document_from_file" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Files (2)" width="90" x="112" y="34">
<list key="text_directories">
<parameter key="SampleSet" value="C:\Users\Lizzi\Desktop\Sample Data"/>
</list>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="999"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (3)" width="90" x="179" y="85"/>
<operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases (3)" width="90" x="179" y="187"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="179" y="289"/>
<operator activated="true" class="text:stem_porter" compatibility="7.5.000" expanded="true" height="68" name="Stem (Porter)" width="90" x="179" y="391"/>
<operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="447" y="85">
<parameter key="min_chars" value="2"/>
<parameter key="max_chars" value="999"/>
</operator>
<connect from_port="document" to_op="Tokenize (3)" to_port="document"/>
<connect from_op="Tokenize (3)" from_port="document" to_op="Transform Cases (3)" to_port="document"/>
<connect from_op="Transform Cases (3)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
<connect from_op="Stem (Porter)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Multiply" width="90" x="112" y="136"/>
<operator activated="true" class="x_means" compatibility="7.6.001" expanded="true" height="82" name="X-Means" width="90" x="313" y="187">
<parameter key="numerical_measure" value="CosineSimilarity"/>
<parameter key="max_runs" value="100"/>
</operator>
<operator activated="true" class="data_to_similarity" compatibility="7.6.001" expanded="true" height="82" name="Data to Similarity" width="90" x="313" y="85">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<connect from_op="Process Documents from Files (2)" from_port="example set" to_op="Multiply" to_port="input"/>
<connect from_op="Process Documents from Files (2)" from_port="word list" to_port="result 2"/>
<connect from_op="Multiply" from_port="output 1" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Multiply" from_port="output 2" to_op="X-Means" to_port="example set"/>
<connect from_op="X-Means" from_port="cluster model" to_port="result 3"/>
<connect from_op="X-Means" from_port="clustered set" to_port="result 4"/>
<connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>
[/code]
hello @lizzie_a_martin - thanks for posting your XML. It's hard to see exactly what you have without your test data file ("Sample Data") but I get the general idea.
What @Telcontar120 is saying is that, in order to look at all your attributes equally, you need to ensure that each attribute has the same "scale". If you had ages (say a range from 10-99) and then word vectors (range 0-1), then the ages are far more weighted than the words. But if you convert the ages to a normalized scale (usually z-scores), then you have converted to a 0-1 scale like the others. The operator in RapidMiner is called "Normalize".
As far as your question about concern about k=2 being optimal, it does not shock me at all.
Scott