Reverse map a nominal to numerical transform
I am using K-means to cluster the data. To do so, I have transformed my nominal values into numerical ones using the Nominal to Numerical operator, but using the coding type parameter set to "unique integers." How do I reverse this transformation so on output I can see what these values were in the clusters before they were transformed. For example, if "sandwich" gets mapped to 0, I would like to reverse map 0 back to sandwich.
Best Answer
-
FBT Member Posts: 106 Unicorn
It may not be the most elegant solution, but what you could do is the following:
Multiply your example set prior to the type conversation. Connect the first output of the multiply operator to your current process, after which you add a join operator and connect the resulting example set to the left port. Connect the second output of multiply to the right port of the join.
You will need an id on which to make the join and you may want to make some pre-processing (renaming attributes, etc.).
1
Answers
Thanks that works. Would have never thought of it.
Be very careful with "unique integers" mapping if your nominal categories are not inherently ordinal. For example, if you have sandwich, bread, and butter mapped as 1, 2, and 3, then k-means thinks that the distance between 1 and 3 is larger than the distance between 1 and 2 or 2 and 3. But for non-ordered categories, this doesn't make any sense and can lead to strange and distorted results when clustering. If your nominal categories are not ordered, you are better off with numerical dummy coding or simply using mixed Euclidean distance (which assumes a distance of 1 between all nominal values that are not the same, precisely to avoid this problem).
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
thanks. I originally used dummy coding, but it blows up the record, as I have lots of unordered nominal values. I will try using mixed Euclidean distance. How does one use this?
You could use effect code too, assuming your don't have too many nominal values per attribute.
Never mind, I figured out how to use mixed Euclidean distance
Thanks!
¿Is there any current accepted solution in the latest version of the program?
¿How can be do this in 2020?
¿Does the same mentioned methodology work?
If possible please provide the diagram!