The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
How to handle empty fields problems (Not missing data) in a data set
Hello guys.
I have a data set that I collected from 35 companies. one of my attributes is: "do they have this type of plan" and the values will be "Yes" and "No" and my second attribute is "how much is the price of this plan" so for the companies that their first attribute is "Yes" the value would be a number like 30 euros, but for the companies that their first attribute is "No" this filled is empty.
I want to do clustering but because of the empty fields, I can't proceed. I don't want to remove this attribute or any example or even fill up these fields with any missing data techniques, because they are not missing.
is there any technique in Rapidminer to define: if the first attribute is no then ignored the second attribute for that example?
Thank you very much
I have a data set that I collected from 35 companies. one of my attributes is: "do they have this type of plan" and the values will be "Yes" and "No" and my second attribute is "how much is the price of this plan" so for the companies that their first attribute is "Yes" the value would be a number like 30 euros, but for the companies that their first attribute is "No" this filled is empty.
I want to do clustering but because of the empty fields, I can't proceed. I don't want to remove this attribute or any example or even fill up these fields with any missing data techniques, because they are not missing.
is there any technique in Rapidminer to define: if the first attribute is no then ignored the second attribute for that example?
Thank you very much
0
Best Answer
-
jacobcybulski Member, University Professor Posts: 391 UnicornI agree with @David_A. You can replace those missing values with something meaningful, e.g. 0 for missing (but meaningful) numerical values (I assume if it is not there it can be interpreted as zero) and "undefined" for nominal attributes (so that you could treat these in a special way). If you are concerned that those extra zeroes are going to upset your statistics, e.g. during your cluster analysis, this means that in your mind you want these cases to be treated separately. If this is the case and you wanted to do segmentation analysis, conduct your clustering in two different processes (filter them out or in for each) and interpret each separately. If you wanted to use cluster attribute for building some predictive model, you could then rename these cluster attributes C1 and C2 (create dummy attributes C2 and C1 each, with some specific values - in a sense putting them all in a separate cluster) and append all examples back, generating two extra columns, for further processing.Jacob6
Answers
David
Thank you very much for your quick response. Actually, I have around 30 different attributes of 35 companies and I want to cluster these companies based on their features.
1- Replace Missing Values: I don't want to replace any value in these fields since they are not missing. they do not have any value because they do not have this type of plan and i think replacing a value like 0 or average can affect the clustering process.
2- Filter Examples: I don't want to filter any example because my examples are my companies and my main goal is clustering them, so I need them.
Do you have any other idea?
Thank you in advance.
Masoud
If you need to remove all your missing values in order to run the clustering algorithm you want then you can populate them appropriately with a two-step process. First use Generate Attributes and an expression to say something like PricePlan=if(HavePlan="Yes",Priceplan,"N/A"). This will keep whatever the value is in the price of the plan variable if they answered yes to whether they have the plan, and if they did not answer yes then it will set the value of the price of the plan to "N/A" (or you can make this whatever you want). Then you can run a subsequent Replace Missing Values and decide how to represent the missing prices where they answered yes to having the plan (for example, with the average price).
If the fields are not technically missing but simply populated by a space or similar, then you should be fine.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts