Clustering - weird salary values
Hi guys,
I am new to Rapidminer, my first little project is clustering some of our customers. Attributes are like age, salary, job position and the amount of credit they have. In order to have a faster process, I have "translated" nominal data (like employement, channel) to numerical data.
I have run the clustering model, and except salary I got seemingly coherent data.
Here is my centroid table:
Attributes | cluster_0 | cluster_1 | cluster_2 | cluster_3 | cluster_4 | cluster_5 |
Channel | 1.0 | 0.0 | 1.0 | 1.0 | 3.0 | 0.0 |
Accomodation | 1.0 | 2.0 | 2.0 | 2.0 | 3.0 | 1.0 |
Education | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
Employement | 2.0 | 2.0 | 3.0 | 1.0 | 1.0 | 2.0 |
Credit (pcs) | 4.0 | 1.0 | 2.0 | 1.0 | 2.0 | 2.0 |
Age | 36.0 | 40.0 | 44.0 | 65.0 | 78.0 | 41.0 |
Salary | 3417.0 | 4174.0 | 3100.0 | 79.0 | 226.0 | 8601.0 |
Credit (vol) | 1181014.0 | 0.0 | 8690185.0 | 0.0 | 362750.0 | 3622658.0 |
What surprised me was the values in the Salary row - they should be much higher than that, the average salary in my data table for these customers are well above 188 k.
My question is: is there somethig I am missing or am I interpreteing the data wrong?
Thanks for the answers!
Tibor
Answers
Well yes, something sounds weird but without seeing your data and process it's hard to guess here.
Question: how did you convert your nominal values into numerical? I'm assuming you're using K-means. If you select mixed measures you can use the Nominal and Numerical values without modifying them.
Hi T-Bone,
First of all, thanks for the quick answer. The conversion of nominal data to numerical was inspired by the fact that in my experience it took longer for Rapidminer to run the process, but next time I will try your suggestion.
The xml of my process is down below:
?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="701" width="955">
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="48" y="422">
<parameter key="repository_entry" value="Elemzés - RCC KHR adatok 201704 adatok"/>
</operator>
<operator activated="false" class="sample" expanded="true" height="76" name="Sample" width="90" x="179" y="480">
<parameter key="sample_size" value="21000"/>
</operator>
<operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes" width="90" x="380" y="390">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="JOVEDELEM|LAKHELY_ERTEK|ISKOLAZOTTSAG_ERTEK|ELETKOR|ALKALMAZOTTI_ERTEK|KULSO_HITEL_OSSZESEN|KULSO_HITEL_DB|CSATORNA"/>
</operator>
<operator activated="true" class="replace_missing_values" expanded="true" height="94" name="Replace Missing Values" width="90" x="313" y="75">
<list key="columns"/>
</operator>
<operator activated="true" class="k_medoids" expanded="true" height="76" name="Clustering" width="90" x="581" y="255">
<parameter key="k" value="6"/>
<parameter key="max_runs" value="2"/>
<parameter key="max_optimization_steps" value="5"/>
</operator>
<operator activated="true" class="extract_prototypes" expanded="true" height="76" name="Extract Cluster Prototypes" width="90" x="514" y="75"/>
<connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
<connect from_op="Replace Missing Values" from_port="example set output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
<connect from_op="Extract Cluster Prototypes" from_port="example set" to_port="result 1"/>
<connect from_op="Extract Cluster Prototypes" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
And I have attached an xlsx file with sample data.
Thanks!
Tibor
Hi T-Bone!
Thanks for the quick answer! Here is the script of my process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="701" width="955">
<operator activated="false" class="sample" expanded="true" height="76" name="Sample" width="90" x="179" y="480">
<parameter key="sample_size" value="21000"/>
</operator>
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve (2)" width="90" x="45" y="390">
<parameter key="repository_entry" value="Data- RapidMiner"/>
</operator>
<operator activated="true" class="replace_missing_values" expanded="true" height="94" name="Replace Missing Values" width="90" x="313" y="75">
<list key="columns"/>
</operator>
<operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes" width="90" x="380" y="390">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Channel|Salary|Accomodation|Education|Age|Employment|Credit vol|Credit (pcs)"/>
</operator>
<operator activated="true" class="k_medoids" expanded="true" height="76" name="Clustering" width="90" x="581" y="255">
<parameter key="k" value="6"/>
<parameter key="max_runs" value="2"/>
<parameter key="max_optimization_steps" value="5"/>
</operator>
<operator activated="true" class="extract_prototypes" expanded="true" height="76" name="Extract Cluster Prototypes" width="90" x="514" y="75"/>
<connect from_op="Retrieve (2)" from_port="output" to_op="Replace Missing Values" to_port="example set input"/>
<connect from_op="Replace Missing Values" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
<connect from_op="Extract Cluster Prototypes" from_port="example set" to_port="result 1"/>
<connect from_op="Extract Cluster Prototypes" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
And I've attached a CSV file with my data.
Thanks!
Tibor
Ah, you have a problem in your salary data. It comes in as Polynominal data format not Numeric. When I tried to parse the numbers, I see that there are errant "," in them. So I did a FE to filter out those errors and then a Parse #.
@tibor_sebok keep in mind that you probably want to add a "normalize" operator to this process as well. The K-means algorithm family is sensitive to the absolute scale of the attributes, and with mixed measures in particular, you have multiple attributes with very different ranges of raw data. Currently it will be dominated by the attributes with the highest absolute variance.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Yep @Telcontar120 is right, you might want to use a Normalize operator. On seperate note, you might want to try a Sample operator. I just realized that crunching through your entire dataset is taking some time.