The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Normalization between two different data sets
hi all,
Ive understood the 'Normalize" operator is to normalize " within " the attributes of a particular data set.
However, I have a case :
Ive trained and tested the classification model with a particular data set (A)
while deploying with new fresh data set (B) - the attributes are given not in same scale as the above data set (A) .
eg: attribute 'X' in data set 'A' is in scale : 0 to 100
attribute 'X' in data set 'B' is in scale: 0 to 350
My qn. is:
Does rapid miner have any operator to normalize 'Between' the two different data sets? or do we have to do manually before feed in.
Kindly let me know. thanks.
thiru
Ive understood the 'Normalize" operator is to normalize " within " the attributes of a particular data set.
However, I have a case :
Ive trained and tested the classification model with a particular data set (A)
while deploying with new fresh data set (B) - the attributes are given not in same scale as the above data set (A) .
eg: attribute 'X' in data set 'A' is in scale : 0 to 100
attribute 'X' in data set 'B' is in scale: 0 to 350
My qn. is:
Does rapid miner have any operator to normalize 'Between' the two different data sets? or do we have to do manually before feed in.
Kindly let me know. thanks.
thiru
1
Answers
this is a conceptional question.
What does it mean for a model that attribute X has a value of 30 (normalized e. g. -0.2)? Should a value of 30 in example set B handled by the model in the same way?
RapidMiner lets you store the "preprocessing model" from the Normalization and apply it (Retrieve = > Apply Model) on the new data. That would make sure that the predictive model sees the same normalized input from identical numbers. (In your case, the normalization model from A will assign a high value to B if X = 350, but that's the correct approach.)
It's even more elegant to build one stacked model from the normalization and then the predictive model, using Group Models. (The example process in the help illustrates the concept.) You would do this inside a cross validation on the left side, and then just apply the grouped model on the right side. This is the conceptionally correct approach.
The process from @rfuentealba is correct in the generic case. However, what does it mean for your data and your model to normalize in a different way? How would you normalize *one* example later, if you're applying the model to single examples? If you don't have very good reasons to normalize in a different way, you should keep the normalization-for-the-model parameters using one of the described methods.
Regards,
Balázs