"Is possibile (and correct) to replace missing values keeping the same distribution of values?"
Hi, I have some attributes with missing values and I want to find the best way to replace them.
Usually you can replace them with the "average" (or most frequent value) but is it possible in Rapid Miner (but more important, is it correct) to replace them by keeping the same distribution of the non-missing values?
I try to explain better with an example:
Let's say I have an attribute "Nationality" with this distribution of values:
ENG: 50%
ITA: 22%
DEU: 20%
FRA: 8%
I would like to replace the missing values with: 50% of values "ENG", 22% of values "ITA" and so on.
Note that I don't have other attributes which give me more knowledge about it and that I can use to better estimate the nationality.
What do you think? Do you have suggestion or better ways to do it?
Thank you in advance
Best Answer
-
BalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
Hi!
It might be possible (e. g. something with random numbers and using Generate Attributes depending on the value falling between 0.0 and 0.5, 0.5 and 0.72 etc.) but it's certainly not correct.
You have data with a known value (people with the attribute value ENG) and data with a missing value. If you randomly assign someone the value ENG without knowing if it is right, you'll get a worse model.
What to do depends on different things. Is a large percentage of the values missing? Then it might be better to just drop the attribute. Might the "missingness" of the value have a meaning on its own? Then you might want to change "missing" to another value like "MISSING Nationality" (if your model required data without missing values). Are there very few missing nationalities? You might build the model without those examples (if you can accept a model that won't work on new examples with a missing nationality).
These are correct approaches. Filling missings with random data is not better than randomly changing non-missing data. (Which might be a sensible thing to do in some circumstances, for example if you'd like to test the robustness of your model. But that would happen in a later phase.)
Regards,
Balázs
3
Answers
Ok thank you. I was quite sure it was not correct to do that, this is why I asked. Since the number of missing is not so big I will just exclude these records from the model.