The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Newbie - expected performance output -after using the sample operator
Hi, sorry for the beginners question... I have a data set with 30,000 lines. The target variable is imbalanced : total false: 24000 / total true: 6000. So I have used the operator "sample" to balance it ( 1000 each) . At the end the performance classification operator gives the confusion matrix with only 2000 results ( from the sample). I was expecting the evaluation ( totals per TP/ TN/ FP/ FN) based on the total lines of the entire dataset ( 30,000 in total ) in order to evaluate costs as well ( on the performance costs operator ). What have I missed ? Maybe the issue is in the wrong lines used for the input/ outputs connectors ? Any tips where it can go wrong? I have tried many ways.... Thanks in advance for your help!
Tagged:
0
Best Answers
-
jacobcybulski Member, University Professor Posts: 391 UnicornAs you selected only 2000 examples for model building and validation, this is what you get in the confusion matrix. However, since you use cost as a method of model evaluation, you can also use a cost sensitive model to deal with class imbalance, e. g. decision tree. I assume the cost if misclassifying the minority class is high (e. g. positive case, when representing fraud) and the cost of misclassifying the majority class is low (negative case). When cost structure is set up in this way, in model training, the importance of the majority class can be weighed down in favour of the minority class, thus overcoming the problem of class imbalance.1
-
BalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 UnicornAnother way to solve this is moving the sampling *into* the training phase of the cross validation. That way, you're building balanced models, but still validating on all data.
Also, sampling before the validation creates additional "knowledge" for the modeling process that you won't have later when applying the model.
Regards,
Balázs0 -
AmsDani Member Posts: 3 Contributor IThanks for your answers ! I will try it in this way you proposed Balázs!0