X means always uses minimum cluster amount

lizzie_a_martin · November 2017

Hi, I am pretty new to rapidminer, so I apologize if this question is trivial. I am trying to use X means to cluster some text files. At first I was using K means, but I didn't know how many clusters to use, so I decided to try X means instead. However, the X means operator always uses the minimum number of clusters in the given range. This doesn't seem correct to me, so I'm wondering if I have some settings incorrect or something. Here are the settings I am using:

add cluster attribute is checked

k min: 2

k max: 60

measure types: NumericalMeasures

numerical measure: CosineSimilarity

clustering algorithm: KMeans

max runs: 100

max optimization steps: 100

I have 150 text files that I am trying to cluster, maybe I am not using enough? Any thoughts and tips would be greatly appreciated!

sgenzer · November 2017

hello @lizzie_a_martin - welcome to the community. Can you please post your XML in this thread so we can see what you are doing? Instructions are on the right (see "Read Before Posting #2).

Scott

Telcontar120 · November 2017

Assuming there isn't a problem with your process, it's probably because you don't have too many examples for clustering, or they are simply too similar to one another so the X-means always resorts to the simplest clustering scheme. But you should also make sure that you've normalized the data beforehand, because clustering is sensitive to absolute ranges of distances, and if you have any other attributes (other than the word vector created by TF-IDF) then differences in scale could be distorting the algorithm as well.

lizzie_a_martin · November 2017

Thank you for the response! I'll try with more text files now. This is my xml code as asked, I'm not sure I'm doing what you're saying about normalizing. I assume that's another operator that I need?

[/code]

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:process_document_from_file" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Files (2)" width="90" x="112" y="34">
<list key="text_directories">
<parameter key="SampleSet" value="C:\Users\Lizzi\Desktop\Sample Data"/>
</list>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="999"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (3)" width="90" x="179" y="85"/>
<operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases (3)" width="90" x="179" y="187"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="179" y="289"/>
<operator activated="true" class="text:stem_porter" compatibility="7.5.000" expanded="true" height="68" name="Stem (Porter)" width="90" x="179" y="391"/>
<operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="447" y="85">
<parameter key="min_chars" value="2"/>
<parameter key="max_chars" value="999"/>
</operator>
<connect from_port="document" to_op="Tokenize (3)" to_port="document"/>
<connect from_op="Tokenize (3)" from_port="document" to_op="Transform Cases (3)" to_port="document"/>
<connect from_op="Transform Cases (3)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
<connect from_op="Stem (Porter)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Multiply" width="90" x="112" y="136"/>
<operator activated="true" class="x_means" compatibility="7.6.001" expanded="true" height="82" name="X-Means" width="90" x="313" y="187">
<parameter key="numerical_measure" value="CosineSimilarity"/>
<parameter key="max_runs" value="100"/>
</operator>
<operator activated="true" class="data_to_similarity" compatibility="7.6.001" expanded="true" height="82" name="Data to Similarity" width="90" x="313" y="85">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<connect from_op="Process Documents from Files (2)" from_port="example set" to_op="Multiply" to_port="input"/>
<connect from_op="Process Documents from Files (2)" from_port="word list" to_port="result 2"/>
<connect from_op="Multiply" from_port="output 1" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Multiply" from_port="output 2" to_op="X-Means" to_port="example set"/>
<connect from_op="X-Means" from_port="cluster model" to_port="result 3"/>
<connect from_op="X-Means" from_port="clustered set" to_port="result 4"/>
<connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>

[/code]

sgenzer · November 2017

hello @lizzie_a_martin - thanks for posting your XML. It's hard to see exactly what you have without your test data file ("Sample Data") but I get the general idea.

What @Telcontar120 is saying is that, in order to look at all your attributes equally, you need to ensure that each attribute has the same "scale". If you had ages (say a range from 10-99) and then word vectors (range 0-1), then the ages are far more weighted than the words. But if you convert the ages to a normalized scale (usually z-scores), then you have converted to a 0-1 scale like the others. The operator in RapidMiner is called "Normalize".

As far as your question about concern about k=2 being optimal, it does not shock me at all.

Scott

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

X means always uses minimum cluster amount

Answers