Clustering - weird salary values

tibor_sebok · April 2017

Hi guys,

I am new to Rapidminer, my first little project is clustering some of our customers. Attributes are like age, salary, job position and the amount of credit they have. In order to have a faster process, I have "translated" nominal data (like employement, channel) to numerical data.

I have run the clustering model, and except salary I got seemingly coherent data.

Here is my centroid table:

Attributes	cluster_0	cluster_1	cluster_2	cluster_3	cluster_4	cluster_5
Channel	1.0	0.0	1.0	1.0	3.0	0.0
Accomodation	1.0	2.0	2.0	2.0	3.0	1.0
Education	1.0	1.0	1.0	1.0	1.0	1.0
Employement	2.0	2.0	3.0	1.0	1.0	2.0
Credit (pcs)	4.0	1.0	2.0	1.0	2.0	2.0
Age	36.0	40.0	44.0	65.0	78.0	41.0
Salary	3417.0	4174.0	3100.0	79.0	226.0	8601.0
Credit (vol)	1181014.0	0.0	8690185.0	0.0	362750.0	3622658.0

What surprised me was the values in the Salary row - they should be much higher than that, the average salary in my data table for these customers are well above 188 k.

My question is: is there somethig I am missing or am I interpreteing the data wrong?

Thanks for the answers!

Tibor

Thomas_Ott · April 2017

Well yes, something sounds weird but without seeing your data and process it's hard to guess here.

Question: how did you convert your nominal values into numerical? I'm assuming you're using K-means. If you select mixed measures you can use the Nominal and Numerical values without modifying them.

tibor_sebok · April 2017

Hi T-Bone,

First of all, thanks for the quick answer. The conversion of nominal data to numerical was inspired by the fact that in my experience it took longer for Rapidminer to run the process, but next time I will try your suggestion.

The xml of my process is down below:

?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="701" width="955">
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="48" y="422">
<parameter key="repository_entry" value="Elemzés - RCC KHR adatok 201704 adatok"/>
</operator>
<operator activated="false" class="sample" expanded="true" height="76" name="Sample" width="90" x="179" y="480">
<parameter key="sample_size" value="21000"/>
</operator>
<operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes" width="90" x="380" y="390">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="JOVEDELEM|LAKHELY_ERTEK|ISKOLAZOTTSAG_ERTEK|ELETKOR|ALKALMAZOTTI_ERTEK|KULSO_HITEL_OSSZESEN|KULSO_HITEL_DB|CSATORNA"/>
</operator>
<operator activated="true" class="replace_missing_values" expanded="true" height="94" name="Replace Missing Values" width="90" x="313" y="75">
<list key="columns"/>
</operator>
<operator activated="true" class="k_medoids" expanded="true" height="76" name="Clustering" width="90" x="581" y="255">
<parameter key="k" value="6"/>
<parameter key="max_runs" value="2"/>
<parameter key="max_optimization_steps" value="5"/>
</operator>
<operator activated="true" class="extract_prototypes" expanded="true" height="76" name="Extract Cluster Prototypes" width="90" x="514" y="75"/>
<connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
<connect from_op="Replace Missing Values" from_port="example set output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
<connect from_op="Extract Cluster Prototypes" from_port="example set" to_port="result 1"/>
<connect from_op="Extract Cluster Prototypes" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

And I have attached an xlsx file with sample data.

Thanks!

Tibor

tibor_sebok · April 2017

Hi T-Bone!

Thanks for the quick answer! Here is the script of my process:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="701" width="955">
<operator activated="false" class="sample" expanded="true" height="76" name="Sample" width="90" x="179" y="480">
<parameter key="sample_size" value="21000"/>
</operator>
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve (2)" width="90" x="45" y="390">
<parameter key="repository_entry" value="Data- RapidMiner"/>
</operator>
<operator activated="true" class="replace_missing_values" expanded="true" height="94" name="Replace Missing Values" width="90" x="313" y="75">
<list key="columns"/>
</operator>
<operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes" width="90" x="380" y="390">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Channel|Salary|Accomodation|Education|Age|Employment|Credit vol|Credit (pcs)"/>
</operator>
<operator activated="true" class="k_medoids" expanded="true" height="76" name="Clustering" width="90" x="581" y="255">
<parameter key="k" value="6"/>
<parameter key="max_runs" value="2"/>
<parameter key="max_optimization_steps" value="5"/>
</operator>
<operator activated="true" class="extract_prototypes" expanded="true" height="76" name="Extract Cluster Prototypes" width="90" x="514" y="75"/>
<connect from_op="Retrieve (2)" from_port="output" to_op="Replace Missing Values" to_port="example set input"/>
<connect from_op="Replace Missing Values" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
<connect from_op="Extract Cluster Prototypes" from_port="example set" to_port="result 1"/>
<connect from_op="Extract Cluster Prototypes" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

And I've attached a CSV file with my data.

Thanks!

Tibor

Thomas_Ott · April 2017

Ah, you have a problem in your salary data. It comes in as Polynominal data format not Numeric. When I tried to parse the numbers, I see that there are errant "," in them. So I did a FE to filter out those errors and then a Parse #.

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="false" class="sample" compatibility="5.0.000" expanded="true" height="82" name="Sample" width="90" x="179" y="480">
        <parameter key="sample_size" value="21000"/>
        <list key="sample_size_per_class"/>
        <list key="sample_ratio_per_class"/>
        <list key="sample_probability_per_class"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.4.000" expanded="true" height="68" name="Retrieve Data - Rapidminer" width="90" x="112" y="34">
        <parameter key="repository_entry" value="../data/Data - Rapidminer"/>
      </operator>
      <operator activated="true" class="replace_missing_values" compatibility="5.0.000" expanded="true" height="103" name="Replace Missing Values" width="90" x="246" y="34">
        <list key="columns"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.0.000" expanded="true" height="82" name="Select Attributes" width="90" x="380" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Channel|Salary|Accomodation|Education|Age|Employment|Credit vol|Credit (pcs)"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.4.000" expanded="true" height="103" name="Filter Examples" width="90" x="514" y="34">
        <parameter key="invert_filter" value="true"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="Salary.contains.,"/>
        </list>
      </operator>
      <operator activated="true" class="parse_numbers" compatibility="7.4.000" expanded="true" height="82" name="Parse Numbers" width="90" x="648" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Salary"/>
      </operator>
      <operator activated="true" class="k_medoids" compatibility="5.0.000" expanded="true" height="82" name="Clustering" width="90" x="782" y="34">
        <parameter key="k" value="6"/>
        <parameter key="max_runs" value="2"/>
        <parameter key="max_optimization_steps" value="5"/>
      </operator>
      <operator activated="true" class="extract_prototypes" compatibility="5.0.000" expanded="true" height="82" name="Extract Cluster Prototypes" width="90" x="916" y="34"/>
      <connect from_op="Retrieve Data - Rapidminer" from_port="output" to_op="Replace Missing Values" to_port="example set input"/>
      <connect from_op="Replace Missing Values" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Parse Numbers" to_port="example set input"/>
      <connect from_op="Parse Numbers" from_port="example set output" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
      <connect from_op="Extract Cluster Prototypes" from_port="example set" to_port="result 1"/>
      <connect from_op="Extract Cluster Prototypes" from_port="model" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Telcontar120 · April 2017

@tibor_sebok keep in mind that you probably want to add a "normalize" operator to this process as well. The K-means algorithm family is sensitive to the absolute scale of the attributes, and with mixed measures in particular, you have multiple attributes with very different ranges of raw data. Currently it will be dominated by the attributes with the highest absolute variance.

Thomas_Ott · April 2017

Yep @Telcontar120 is right, you might want to use a Normalize operator. On seperate note, you might want to try a Sample operator. I just realized that crunching through your entire dataset is taking some time.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Clustering - weird salary values

Answers