The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Clustering and Normalization"
Dear All,
I have a dataset which consists of 20 numeric variables.
I would like to apply z-score transformation to all variables : I use normalization node and all ok until here
The problem now is that i want to de-normalize values of all 20 fields to the original values so that cluster values make sense.
1) Is there a nore to do this for all 20 fields
2) If not can someone provide an example on how to do it for a single field only?
Thanks!
I have a dataset which consists of 20 numeric variables.
I would like to apply z-score transformation to all variables : I use normalization node and all ok until here
The problem now is that i want to de-normalize values of all 20 fields to the original values so that cluster values make sense.
1) Is there a nore to do this for all 20 fields
2) If not can someone provide an example on how to do it for a single field only?
Thanks!
Tagged:
0
Answers
The only hint I can give you is to use AttributeConstruction. Unfortunately you have to include the mean and stdev manually.
regards,
Steffen
The nice thing about RM is that you can do things in many different ways... Bit of a mess, because normalization seems to hit objects even if you store them away, but it does the job... I think.
PS Can someone prod Ingo towards his PM box here, thanks.
However i do not understand the example given : Where is the DE-normalization happening for every attribute?
Thanks again!
Really interesting method
The problem is though that the clustering output still does not show you the DE-normalized values such as in:
Cluster 0 :
attr1 : x
attr2 : y
attr3 : z
with x,y,z being DE-normalized
Perhaps a DE-normalize operator would be useful!?
The original problem was... The method shows the original values, or do you not agree? If by "DE-normalized" you mean something other than the "original" values then perhaps so, but that was not the question.
In short, I disagree that a de-normalizer operator is necessary, because you can always just keep the originals!
First of all : ***Thanks for your help*** i do not mean to sound rude :-)
However the *full* quote was :
Notice that the last part says : "so that cluster values make sense"
Unfortunately this is not the case with your solution. Again i do not want to appear rude i am just giving my opinion that perhaps an operator would prove helpful. Just trying to add my 2 cents...
Thanks!
I'm always amused by posts that start "i do not mean to sound rude".
Versions one and two of the code did the job. Did you run them? Version three was only put in to make things clearer for you. Something got flipped and the clusters got lost. So I'll edit version three out.
Maybe you'll want to edit your last post as well.
So that means that there can be an output like the one i explained? To have the numbers in the cluster model prior the normalization? I sure would like to see how this is possible because this is actually what i wanted originally. Sure if you explain why should i, no problem!
The point is that your solution does NOT output a ***Clustering Model window*** with de-normalized values! The sequence should be the following
1) Get unnormalized values
2) Normalize them
3) run clustering model using normalized values
4) show the CLUSTERING MODEL'S RESULTS DENORMALIZED. I do *not* want for every row it's associated de-normalized value!!
Your solution does not do step (4) , It writes each de-normalized values to a table! Do you understand the difference Haddock??
Please try to understand what is sought here..
From what i can tell (as steffen said) there is no way to do this automatically in RM. If someone else can help on this, please do so
Thanks!
He's not talking about having an ExampleSet that contains both the raw values and normalized values for each data point. He wants to describe the clusters in the data's natural scale. This would help, for example, in explaining the clusters are to other people, or even just to better interpret the model himself.
If my reading of the problem is correct, then the following discussion may be helpful...
You'd need to know the mean and standard deviation of each attribute in the original data to convert the normalized centroid values to original scale values (i.e. "denormalize"). While RM computes the sum and std dev as part of the meta data view of an ExampleSet, I'm not sure there's a way to get to those values. If you're reading data from a database, you might be able to have a second DatabaseExampleSource with a query that returns the mean and std dev for each attribute.
Once you have the mean and std dev, you need to get the centroid values into an example set. I haven't worked with clustering models, so I don't know how this would be done in RM. But once you have both the mean+stddev and the centroid values, you can probably use one of the Join operators to match up the clusters with their mean+stdev, and then use AttributeConstruction (as steffen mentioned in the first reply to this thread) to build the centroid values on the original data's scale.
Hopefully this doesn't add further confusion to the situation...
Keith
Now that I do understand, and curiously he'll still need the original/raw data I think this does the necessary. and this works out the average for each cluster - just added a change of role on the cluster and an OLAP operator to my original offering. Thanks again for bringing clarity to the question, how we were meant to get that from the original question remains a mystery to me.
This is what i am talking about and steffen understood what i meant right from my 1st post.
So by using attribute construction it can be done but imagine building new attributes for 60 input variables! so the question is whether some node can be used to calculate all this information for all -say- 60 attributes and i guess this cannot happen (?) as steffen originally said.
@haddock
It appears that you still don't get it but may be i am wrong...can you do the same example that you last posted for 60 input variables? How much time will it take you to do it? Let alone also having to do a log transformation to each of 60 variables to fix their skewed distributions...
@haddock:
I did not know the operator MovingAverage yet ... really nice. However, it seems the calculation of stdev is messed up, isn't it ?
@hgwelec:
The second process of haddock does exactly what you want. He was able to calculate the cluster centroids for the denormalized (ie. not normalized) values and hence the denormalized cluster centers (this is only correct if the cluster centroids of the cluster operator are calculated as mean .. which is correct for KMeans). The issue of scalability remains, but: Either you add an entry for each attribute in the aggregation operator manually OR you use a loop .... in JAVA, which means hacking an operator yourself. I do not see another option.
Again we have faced an example of the law of leaky abstraction ...
kind regards,
Steffen
PS: the process of haddock is ok, but I did not check the calculation of the values by an example (just to be sure) .. my head is a little fuzzy today...
Reminds me of an old Oxford philosophy exam story.....
Is this a question?
Yes, if this is an answer.
If it was possible to access the centroid values directly and apply the mean/stdev calculations from your first code sample, that would probably be a more scalable solution than joining the data to itself and computing the sum/stdev across the entire data set (depends on how many rows he's dealing with). It would also (I think) handle the case where the cluster centers are calculated by something other than mean (as steffen alludes to). But what you presented certainly solves the problem as presented. Thanks, I learned something today.
That's what great about having a forum where you get many eyeballs looking at a question. For example, to me, when I read: ... it was pretty quickly apparent that, even if he didn't have the terminology quite right, he was talking about data that describe the clusters ("cluster values" a.k.a. centroids), and meant "original scale" rather than "original values". But I never would have come up with the solution haddock did.
Despite the frustrations expressed on this thread, this forum is still a friendlier place for earnest newbies (which I was not that long ago) to learn RapidMiner than the R-help list is for R, and is one of the many things I think is great about RM.
Keith
Both you and Steffen come out of this episode as very solid citizens who deserve the respect you get, so many thanks to you both on behalf of all Rapido heads. I've learnt from two sources, Ralf's most excellent course, and trying to answer the puzzles set right here, so absolutely spot on, my friend, spot on.
The below is a tricky (infact, a very tricky) way of extracting the centroid values directly from the model.
A Note:
1. This method can be applied even for KMedoids....I meant to say, this also eludes the issue of "What if the cluster centers are not the mean?".
2. The centroid values are acurate for three decimal places, because the centroid values are read as it is from the "Text View" of the model. If the "Text view" gave, say five digits after the decimal point, then the same would be the result in the exampleset produced.
Best,
Shubha Karanth
I think there is a problem with your first example, because it only covers the case where there are two clusters, and with the second there is no data by the time of the first split, so I'm not sure why it is here at all . Bemused readers should run to the break, like this ( I've just removed the drive letter and put in a break )... Perhaps you could explain what I've missed ? ;D
Good weekend!
I have seen other users express that my terminology was not correct i have no reason to think otherwise and for that i have to agree. It wasn't.
But since the essence of discussions in this forum is to both solve our problems *and* to draw some insights as to how RM can become better, i feel that even though a JAVA code could be a solution (when the dataset contains MANY attributes) for users that do no have the necessary programing skills the problem cannot be easily fixed.
Since normalization prior any clustering process is usually required, perhaps a De-Normalize node would prove to be very useful. .
Many Thanks!