Re: Cross Distances operator : Weird results
Hi Guys,
that it works if the data comes from two different sources strongly implicates that there is a problem with the internal representation of nominal values as numerical ids. As this is a big problem and we need to make sure that distance calculation works as expected, I took a look in the code of Version 7.5 as I have this at hand, but I can confirm the problem is still persisting in 8.0.
The bottom line is: The numerical distance measures are broken as they aren't initialized correctly anymore. Their init method is never called any more, so that they treat every single attribute as numerical. So they also calculate a cosine similarity on the nominal attributes using the internal id of the nominal values.
As this id is arbitrary and especially can change when another data set is loaded, there can be arbitrary results. The original init method did a check that there may be no nominal attributes and otherwise raised a UserError message, aborting the process. This is lost, as a new init method was written in a super class, not calling this part of the original code any more.
I would recommend a fast fix from RapidMiner side, as this creates WRONG results, which is even worse than an exception. @sgenzer Would be even worth a hot fix 8.1.001, what do you think?
It simply requires that the new method:
public DistanceMeasureConfig init(Attributes firstSetAttributes, Attributes secondSetAttributes)
calls the old method or does what the old method does, which correctly does the checks:
public void init(ExampleSet exampleSet) throws OperatorException
Simple process showing that still nominal values are treated as numerical ones:
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_nominal_data" compatibility="8.0.001" expanded="true" height="68" name="Generate Nominal Data" width="90" x="112" y="34"/>
<operator activated="true" class="generate_nominal_data" compatibility="8.0.001" expanded="true" height="68" name="Generate Nominal Data (2)" width="90" x="112" y="136"/>
<operator activated="true" class="cross_distances" compatibility="8.0.001" expanded="true" height="103" name="Cross Distances" width="90" x="514" y="85">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
<parameter key="only_top_k" value="true"/>
<parameter key="k" value="1"/>
<parameter key="compute_similarities" value="true"/>
</operator>
<connect from_op="Generate Nominal Data" from_port="output" to_op="Cross Distances" to_port="request set"/>
<connect from_op="Generate Nominal Data (2)" from_port="output" to_op="Cross Distances" to_port="reference set"/>
<connect from_op="Cross Distances" from_port="result set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
How you can circumvent the problem for now: Remove any nominal attributes before calculating a numerical distance measure. If you need to incorporate them, transform them into dummy encoding using Nominal to Numerical operator first on the large (reference) data set. Then apply the created preprocessing model (3rd purple port) on the request data set using Apply Model.
Greetings,
Sebastian
Comments
thanks, @land. Much appreciated. Pushing to Product Feedback.
Scott
Pushed to Dev Team.
rest of request is here: https://community.rapidminer.com/t5/Product-Feedback/Re-Cross-Distances-operator-Weird-results/idc-p/47368#M232
dev team Jira ticket RM-3522 created. Will update when available.
After a little test, It seems that the bug is fixed (here in RM 9.4) :
The process in attached file.
Thanks to @land and the dev team for solving this issue.
Regards,
Lionel