The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Choosing the best approach to impute numerical missing values
Hello everybody,
as part of a scientific project I have to develop a data preprocessing model for the university. Currently I am struggling with missing values.
I have a data set with exclusively numerical attributes, in which numerous values are missing. Now I would like to implement the following in RM:
- for each attribute I would like to use 2-3 different methods (e.g. linear interpolation, quadratic interpolation, cubic interpolation, kNN algorithm; other algorithms which can used to impute missing values are also welcome) to replace the missing values with statistically calculated values.
- Then I want to calculate the performance of each method for each attribute and at the end select the best method for imputing missing values for each attribute.
It would be great if someone could help me.
Many thanks in advance
Moritz
Tagged:
0
Answers
Dortmund, Germany
Surely, you can get the performance of imputation. Here is an example validation on the knn imputation method applied to missing age of titanic data. For a regression performance, we would need two columns: a "ground truth" column and an estimation from knn.
The "generate attribute" will create a new column new_age for missing data imputation and another label for the class of train/test sets. 80% of the non-missing age will be kept before imputation, and you can change the split ratio in "split data". We pretend the rest 20% age will be missing and use knn to impute that.
To impute the missing values in new_age, I dropped the ground truth "age", and skipped imputation for life boat and cabin.
YY
many thanks for your answer.
Correct. After your XML code, I now have one way to remove missing values, along with an evaluation (RMSE). But now I would like to use two more methods to see if they are better than the kNN algorithm. At the end I want to select the best method to derive the missing values depending on the best RMSE value. I don't currently know how to do this.
Can you help me?
BR
Moritz
You can easily replicate the imputation by other machine learning algorithms. Check out this
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Dortmund, Germany
thank you very much for your answers.
@yyhuang: Thanks for your work. Even if your approach doesn't do everything I would like to do, it helps me a lot. With my dataset the GBL, GLM and DL algorithm don not work. I always get the error message as shown on the screenshot. Can you tell me what the problem is? (To be honest, I don't know exactly how the algorithms work, I hope that's not mandatory) Do you have any idea how I can integrate linear interpolation or quadratic interpolation into the system?
To everybody:
BG
Moritz
Thanks for sharing the screenshot. If there is any nominal ID-like attribute in the input, the H2O model will have errors. That 's why I dropped some columns that is ID-like with "set role" before the imputation. You can apply invert-selection of some attributes to troubleshoot. A quick check on the input node of the deep learning learner by right-clicking the node and view example set could be useful. You can also share your data and process here for us to further investigate.
YY