"Clustering - how to determine

radone · March 2011

Hello,
could anyone point me how to do an unsupervised data clustering on data, where I am not sure how many clusters is present in data (i.e. how to determine k for e.g. k-means)?
Or is the best possible way to determine the k visually (I have 13 attributes and the data might be quite noisy)?

Thanks for any suggestion,
radone

awchisholm · March 2011

Hello

Clustering always requires a human to look at and interpret the results but a helping hand can be given by using various cluster performance operators.

Here's an example showing the Cluster Distance Performance operator producing measures for "average within centroid distance" and Davies-Bouldin as k is varied in a k-means clustering experiment. The example data in this case contains 1000 examples that are grouped into 8 neat clusters in a three dimensional space. At the end of the experiment look at the Log tab in the results and plot the two recorded measures as a function of k and you should see that something interesting is happening at k = 8.

Fortunately, this corresponds to the "correct" answer but in real life, it won't be as easy. The characteristics of the input data such as cluster shape, noise and data size will determine what clustering approach to use as well as what performance measure could be appropriate. Guidance is hard to give because a) it depends on the data and b) I probably don't know

regards,

Andrew

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.004">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.004" expanded="true" name="Process">
    <parameter key="random_seed" value="-1"/>
    <process expanded="true" height="665" width="710">
      <operator activated="true" class="generate_data" compatibility="5.1.004" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
        <parameter key="target_function" value="gaussian mixture clusters"/>
        <parameter key="number_examples" value="1000"/>
        <parameter key="number_of_attributes" value="3"/>
      </operator>
      <operator activated="true" class="loop_parameters" compatibility="5.1.004" expanded="true" height="76" name="Loop Parameters" width="90" x="246" y="30">
        <list key="parameters">
          <parameter key="Clustering.k" value="[2.0;20;19;linear]"/>
        </list>
        <process expanded="true" height="665" width="710">
          <operator activated="true" class="k_means" compatibility="5.1.004" expanded="true" height="76" name="Clustering" width="90" x="45" y="30">
            <parameter key="k" value="20"/>
            <parameter key="max_runs" value="1000"/>
            <parameter key="use_local_random_seed" value="true"/>
            <parameter key="local_random_seed" value="2"/>
          </operator>
          <operator activated="true" class="cluster_distance_performance" compatibility="5.1.004" expanded="true" height="94" name="Performance" width="90" x="246" y="30">
            <parameter key="normalize" value="true"/>
          </operator>
          <operator activated="true" class="log" compatibility="5.1.004" expanded="true" height="76" name="Log" width="90" x="447" y="30">
            <list key="log">
              <parameter key="DaviesBouldin" value="operator.Performance.value.DaviesBouldin"/>
              <parameter key="avgWithinDistance" value="operator.Performance.value.avg_within_distance"/>
              <parameter key="k" value="operator.Clustering.parameter.k"/>
            </list>
          </operator>
          <connect from_port="input 1" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
          <connect from_op="Clustering" from_port="clustered set" to_op="Performance" to_port="example set"/>
          <connect from_op="Performance" from_port="performance" to_op="Log" to_port="through 1"/>
          <connect from_op="Log" from_port="through 1" to_port="performance"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Loop Parameters" to_port="input 1"/>
      <connect from_op="Loop Parameters" from_port="result 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Clustering - how to determine

Answers