Duplicate Data but different value in target
Hi All,
I am dealing with a small data of 120 rows and 5 features with binary target Valid or Not Valid.I have some duplicate rows where all the input features are same but the target values is different as you can see below (sample data its nor original data).How will the model treat those values ? is it ambiguous data ? i ran the model and it was not able to classify the not valid cases as i have only 32 cases out of 120 as Not Valid and most of them are having the duplicates where it has a valid result also with same inputs ? what should i do ?
Att1 Att2 Att3 Target
F3 G929 P2 Valid
F3 G929 P2 Not Valid
F2 G929 P3 Not Valid
F2 G929 P3 Valid
Regards,
Vishnu
Best Answers
-
Knut-RM Administrator, Employee-RapidMiner, Member, University Professor Posts: 113 Administrator
given that you have valid and invalid flags for the same combination of values in the attributes how can you expect the model to learn and consequently identify those?
The model needs to find patterns in order to make a prediction. If you are not providing a pattern then there is no real result to be expected. You should go through the data and make sure you have one lable with the same combination of data. So you want to use a remove duplicate. Probably you need to sort them first on order to maintain (Valid/invalid) the "right" one from the filtering or you do it manually given your small data set.
2 -
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
Sorry, this is a community support forum but not an academic research journal! And I'm an experienced data scientist but not an academic myself--so this type of thinking is actually somewhat mystifying to me. There is much about current best practice in data science that you would have a hard time finding specific academic references to substantiate.
1
Answers
Actually, in these ambiguous cases, you might be better off removing BOTH of the conflicting input records. It somewhat depends on the data and the use case, but the consequence of removing only one duplicate and leaving the other in is that you are teaching the model to associate a particular pattern with one particular outcome that is actually ambiguous in real life. If one outcome is much more important to you than another, this may be sensible (e.g., in fraud detection), but in other types of outcomes, this may lead to undesirable results. So if you have a large enough sample and your misclassification costs are somewhat symmetrical, I would recommend to omit them all.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
@Telcontar120@Knut-RM
Hi All,
i just want to confirm one thing regarding the duplicates.if i have 10 record all are duplicates and 9 of them have taget label as pass and 1 as fail.so in this case if i remove the diplicates then i will end up with 2 record with all input features are same but the target is different(one pass and one fail) which is ambiguous . if i don't remove those duplicates i am giving more weight to those 9 records than the last record ? is it correct?
Regards,
Vishnu
Correct. And if you remove all the ambiguous records (per my suggestion) then you are not giving weight to either side.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
@Telcontar120 is there any offical page or book where it was mentioned the same information,actually my mananger asked me to show the proper referene for this explanation.
Regards,
Vishnu