The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
How to resolve 100% Data accuracy in rapid miner ?? [Urgent]
StudentNeedsHelp
Member Posts: 2 Learner I
Hello everyone,
The aim is to catch and predict fraud cases with optimum accuracy based on the dataset provided. For example, cases that are nominated to be fraudulant and turn out to be non fraudulant are not as critical as cases which are predicted to be non fraud and turn out to be.
For this, I wanted to use the Logistic Regression ,Neural Net and Decision Tree for comparison (the work is provided). Whenever I run the models all accuracy is near 100%, surely this is not correct.
I am new to rapid miner and data pre processing, could someone advise me to which direction I should be heading?
The aim is to catch and predict fraud cases with optimum accuracy based on the dataset provided. For example, cases that are nominated to be fraudulant and turn out to be non fraudulant are not as critical as cases which are predicted to be non fraud and turn out to be.
For this, I wanted to use the Logistic Regression ,Neural Net and Decision Tree for comparison (the work is provided). Whenever I run the models all accuracy is near 100%, surely this is not correct.
I am new to rapid miner and data pre processing, could someone advise me to which direction I should be heading?
0
Answers
Given that your dataset is highly imbalanced (there is much more "non fraudulant" than "fraudulent" cases in your dataset)
that's why the model has difficulties to establish the relationship between your features and the minority class of your label ("fraudulent")
and in fine the model is considering all the your transaction as "non fraudulent" that's why you have an accuracy near from 100%.
I think that in your case a better performance indicator is the "class recall". You want in priority correctly predict the fraudulant cases , isn't it ?
For that you have to upsample your initial dataset by increasing the number of examples of "fraudulent" cases by using for example the
SMOTE Upsampling operator. This way, you will increase the class recall of the fraudulent cases.
Ideally, you can use Auto-Model after the upsampling operator and define the cost matrix at the "prepare target" scrreen (typically you "quantify" cost of a misclassifcation of "False negative" and the cost of a misclassificartion of a "false positive" ).
Auto-Model will be executed to minimize the cost of a misclassification and in fine to maximize the gain...
Hope this helps,
Regards,
Lionel
thanks
@StudentNeedsHelp
Yes, without Auto-Model, you can use the Performance (Costs) operator to first quantify the cost of a FN and the cost of a FP and to calculate the final cost of a misclassification.
Please take a look at the process in attached file using your data to experiment and to understand....
Hope this helps,
Regards,
Lionel