Cross Distances operator : Weird results
Hi,
I allow myself to create a dedicated topic for a subject that has not been answered in a previous topic.
In this previous topic, the goal was to calculate the similarity between "employees caracteristics" and "a position".
I decided to use the Cross Distances operator, but I got weird results :
The calculated similarity is always the same regardless of the "position" and "employees caracteristics".
I performed some tests without results and this topic running through my mind.
NB : I used Read Excel operator to introduce my example sets.
You can find my process here :
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="8.0.001" expanded="true" height="68" name="Employees" width="90" x="45" y="85">
<parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\HR_Sourcing\Employees.xlsx"/>
<parameter key="imported_cell_range" value="A1:F5"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="Id_employee.true.integer.id"/>
<parameter key="1" value="name.true.polynominal.attribute"/>
<parameter key="2" value="skills.true.polynominal.attribute"/>
<parameter key="3" value="department.true.polynominal.attribute"/>
<parameter key="4" value="language.true.polynominal.attribute"/>
<parameter key="5" value="experience.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply" width="90" x="179" y="85"/>
<operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Id_employee|department|experience|language|skills"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes (3)" width="90" x="313" y="136">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="name|Id_employee"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="read_excel" compatibility="8.0.001" expanded="true" height="68" name="Position" width="90" x="45" y="238">
<parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\HR_Sourcing\Employees.xlsx"/>
<parameter key="sheet_number" value="2"/>
<parameter key="imported_cell_range" value="A1:E2"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="Id_position.true.integer.id"/>
<parameter key="1" value="skills.true.polynominal.attribute"/>
<parameter key="2" value="department.true.polynominal.attribute"/>
<parameter key="3" value="language.true.polynominal.attribute"/>
<parameter key="4" value="experience.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes (2)" width="90" x="179" y="238">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="department|experience|language|skills|Id_position"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" breakpoints="before" class="cross_distances" compatibility="8.0.001" expanded="true" height="103" name="Cross Distances" width="90" x="447" y="85">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
<parameter key="compute_similarities" value="true"/>
</operator>
<operator activated="true" class="rename" compatibility="8.0.001" expanded="true" height="82" name="Rename" width="90" x="581" y="85">
<parameter key="old_name" value="document"/>
<parameter key="new_name" value="Employee"/>
<list key="rename_additional_attributes">
<parameter key="request" value="position"/>
<parameter key="distance" value="similarity"/>
</list>
</operator>
<operator activated="true" class="set_role" compatibility="8.0.001" expanded="true" height="82" name="Set Role (3)" width="90" x="715" y="85">
<parameter key="attribute_name" value="Employee"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="join" compatibility="8.0.001" expanded="true" height="82" name="Join" width="90" x="849" y="136">
<list key="key_attributes"/>
</operator>
<connect from_op="Employees" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Select Attributes (3)" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Cross Distances" to_port="reference set"/>
<connect from_op="Select Attributes (3)" from_port="example set output" to_op="Join" to_port="left"/>
<connect from_op="Position" from_port="output" to_op="Select Attributes (2)" to_port="example set input"/>
<connect from_op="Select Attributes (2)" from_port="example set output" to_op="Cross Distances" to_port="request set"/>
<connect from_op="Cross Distances" from_port="result set" to_op="Rename" to_port="example set input"/>
<connect from_op="Cross Distances" from_port="request set" to_port="result 3"/>
<connect from_op="Cross Distances" from_port="reference set" to_port="result 1"/>
<connect from_op="Rename" from_port="example set output" to_op="Set Role (3)" to_port="example set input"/>
<connect from_op="Set Role (3)" from_port="example set output" to_op="Join" to_port="right"/>
<connect from_op="Join" from_port="join" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>
My (fictive) example sets can be downloaded by following this link :
https://drive.google.com/open?id=18JFovsp_pk7l-1SNx-oeywdwzVSeG-r0
Is it a bug ? if not can you tell me what I missed/forgot?
Thanks you for your responses,
Regards,
Lionel
Answers
I wasn't able to retrieve your dataset to check this, but if your attributes are both nominal and numerical, then the distance metric will be "Mixed Euclidean" which sets differences in nominal categories to equal 1 if they are not the same and 0 if they are the same. That can often lead to identical differences regardless of the specific values that are contained.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hi,
Thanks you for your feedback @Telcontar120. In deed, my four attributes are both nominal (3) and numerical (1):
- skills, departement and language : nominal
- experience : numerical
As proposed, I used the "MixedEuclideanDistance". However, when a position and employee caracteristics are strictly
the same, the distance is different from 0 (here the Id_employee = 3) : it seems that RapidMiner don't detect that the nominal attributes are equals in the position and the employee caracteristics.
Here the employee caracteristics of my example set :
Here the position :
and here the results :
NB : My nominal attributes are imported as "Nominal" via Read Excel operator
What have I missed / forgotten ?
Thanks you,
Regards,
Lionel
OK, agreed, that looks unusual! Did you make sure to "Trim" your nominals? It could be errant (and invisible) leading or trailing spaces are causing a mismatch when it looks like they should match. One other point is to make sure the spelling is exactly the same on the nominal attributes (I noticed for example that "engineering" is misspelled in the examples you have shown, but maybe it is not misspelled everywhere?) Other than that, I have no idea why you would get the results you are seeing. Maybe @mschmitz has an idea?
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hi,
Thanks you for your feedback @Telcontar120.
1. I did'nt know the Trim operator, but unfortunately, it does not change the results of the process.
2. The (mis)spelling (English not fluently spoken.......) is strictly the same on the nominal attributes between the position and the employee caracteristics : I did a copy/paste between the two example sets).
Best regards,
Lionel
Hi,
I took a look at your process from above, where the chosen Similarity Measure is "Cosine Similarity", which is a plain numeric measure.
So, RapidMiner would be right in just computing the similarity with using the single numerical attribute. However that doesn't match what we see there and the cosine would also not really be different if we just have one axis.
To do it correctly you will need to change the nominal attributes into numerical ones. Use Dummy Encoding if you want to use Cosine Similarity.
You can try with mixed euclidean as well, then experience attribute might dominate the distance as it's possible distances are 0 to 4 while all others are 0 to 1.
Greetings,
Sebastian
Hi,
Thanks you for your feedback @land.
I experimented the process by dummy encoding the nominal attributes.
But RapidMiner don't perform the calculations of the distances/similarity.
I think it's because the number of attributes is different in the Employee caracteristics example set and in the Position example set :
1. Here the "dummy encoded" Position example set :
2. and here the "dummy encoded" Employee caracteristics example set :
pmpm
What do you think ?
Concerning mixed euclidean, I experimented it and how said in the previous post, I don't understand why for a Position and employee caracteristics which are strictly the same, the associated distance is different from "0".
Best regards,
Lionel
@land I agree with @lionelderkrikor here. After looking at his examples, regardless of the distance metric used, I cannot understand why the cross-distance would be greater than 0 if all the attributes have the same values. Can you clarify? Or perhaps @sgenzer can ask one of the developers to take a look at this in more detail?
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
hi all -
so I played with this a bit and there is something wonky with the two excel docs coming in. If you simply multiply one sheet and then filter out one row as reference, it works just fine:
Scott
Hi @sgenzer,
Thanks you for your feedback.
Unfortunately, the problem does not come from Excel files.
In deed, with "CSV files" (see attached files), the results of the process are strictly the same as with the Excel files.
But thanks to your test, a priori, we can conclude that the problem come from the Ids of the files.
In deed, that 's the only one difference between your test process and my process. (and the only one difference between the Employee example set and the Position example set in my process).
So is there any possibility that the Ids are taken into account in the calculation of similarity/distances ?
Thanks you for your response
Best regards,
Lionel