Duplicate Data but different value in target

k_vishnu772 · August 2018

Hi All,

I am dealing with a small data of 120 rows and 5 features with binary target Valid or Not Valid.I have some duplicate rows where all the input features are same but the target values is different as you can see below (sample data its nor original data).How will the model treat those values ? is it ambiguous data ? i ran the model and it was not able to classify the not valid cases as i have only 32 cases out of 120 as Not Valid and most of them are having the duplicates where it has a valid result also with same inputs ? what should i do ?

Att1 Att2 Att3 Target

F3 G929 P2 Valid

F3 G929 P2 Not Valid

F2 G929 P3 Not Valid

F2 G929 P3 Valid

Regards,

Vishnu

Knut-RM · August 2018

given that you have valid and invalid flags for the same combination of values in the attributes how can you expect the model to learn and consequently identify those?

The model needs to find patterns in order to make a prediction. If you are not providing a pattern then there is no real result to be expected. You should go through the data and make sure you have one lable with the same combination of data. So you want to use a remove duplicate. Probably you need to sort them first on order to maintain (Valid/invalid) the "right" one from the filtering or you do it manually given your small data set.

Telcontar120 · October 2018

Sorry, this is a community support forum but not an academic research journal! And I'm an experienced data scientist but not an academic myself--so this type of thinking is actually somewhat mystifying to me. There is much about current best practice in data science that you would have a hard time finding specific academic references to substantiate.

Telcontar120 · August 2018

Actually, in these ambiguous cases, you might be better off removing BOTH of the conflicting input records. It somewhat depends on the data and the use case, but the consequence of removing only one duplicate and leaving the other in is that you are teaching the model to associate a particular pattern with one particular outcome that is actually ambiguous in real life. If one outcome is much more important to you than another, this may be sensible (e.g., in fraud detection), but in other types of outcomes, this may lead to undesirable results. So if you have a large enough sample and your misclassification costs are somewhat symmetrical, I would recommend to omit them all.

k_vishnu772 · October 2018

@Telcontar120 @Knut-RM

Hi All,

i just want to confirm one thing regarding the duplicates.if i have 10 record all are duplicates and 9 of them have taget label as pass and 1 as fail.so in this case if i remove the diplicates then i will end up with 2 record with all input features are same but the target is different(one pass and one fail) which is ambiguous . if i don't remove those duplicates i am giving more weight to those 9 records than the last record ? is it correct?

Regards,

Vishnu

Telcontar120 · October 2018

Correct. And if you remove all the ambiguous records (per my suggestion) then you are not giving weight to either side.

k_vishnu772 · October 2018

@Telcontar120 is there any offical page or book where it was mentioned the same information,actually my mananger asked me to show the proper referene for this explanation.

Regards,

Vishnu

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Duplicate Data but different value in target

Best Answers

Answers