Replace missing values with average in each cluster

painfulover · January 2022

Hello,
I'm new to Rapidminer and I would like to replace missing values based on clustering, which means I have used k-means on columns which have no missing values and divide the original exampleset into 5 clusters. Now I would like know how to replace each row's missing values by the averages of the cluster it belongs to instead of the averages of whole attributes. I can only find the way to do the latter by the operator [replace missing values].
Thank you very much.

BalazsBarany · January 2022

Hi!

This process is a bit involved. You get a "cluster model" from the clustering operator that you can apply to the data with missing values. However, you need to choose an operator that can work with missing values itself. Then you would aggregate the clustered original data (the non-missing data), grouping by the cluster to get the averages. You can join the result with the missing values and use e. g. Loop Attributes to fill in the missing values using Generate Attributes with a formula like if(missing(%{attr}), eval("average(" + %{attr} + ")"), %{attr}).

It is much easier to use the Impute Missing Values operator that automates the selection of missing values, building a model for predicting them (you can select the model type) and putting the predicted values into the missing cells. There is an example process in the operator help that shows you how to use it, with k-NN as the example learning algorithm.

Regards,
Balázs

MartinLiebig · January 2022

Hi,

I would just use Group into Collection and create a collection (list) of example sets, where each example set only contains one cluster. You can then use Loop Collection and in there use Replace Missing to replace it with the respective means. Afterwards you just Append the resulting collection again.

Cheers,

Martin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Replace missing values with average in each cluster

Best Answers