The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Used the template to learn about outliers for credit card fraud detection
With a few changes to the template, this is my process.
I used x-means and Detect Outlier (LOF) to detect possible fraud. The original data set contains over 284,000 rows. I selected out the first 3,000 rows for my first try.
These are the results, left half and right half. I see Outlier(s) from high to low.
In the right half, I see Class = 1 only in rows 2 and 5. I would guess those are outliers.
Row 2 Outlier = 12.559. Row 5 Outlier = 8.030. There are higher value outliers nearby. Since both these have Class = 1, do I assume these are probably instances of fraud?
To compare, I selected out 5,000 rows for a bigger data set. Detect Outlier (LOF) took longer to run, but I got results. The process remained the same, the retrieve data set now has 5,000 rows.
This time Class = 1 happens twice, Outliers are 16.921 and 10.364, not high on the list of Outlier(s) from high to low.
Where Class = 1 (fraud?), should not Outlier scores be higher?
What am I possibly missing here?
Thanks for your time.
Tony
I used x-means and Detect Outlier (LOF) to detect possible fraud. The original data set contains over 284,000 rows. I selected out the first 3,000 rows for my first try.
These are the results, left half and right half. I see Outlier(s) from high to low.
In the right half, I see Class = 1 only in rows 2 and 5. I would guess those are outliers.
Row 2 Outlier = 12.559. Row 5 Outlier = 8.030. There are higher value outliers nearby. Since both these have Class = 1, do I assume these are probably instances of fraud?
To compare, I selected out 5,000 rows for a bigger data set. Detect Outlier (LOF) took longer to run, but I got results. The process remained the same, the retrieve data set now has 5,000 rows.
This time Class = 1 happens twice, Outliers are 16.921 and 10.364, not high on the list of Outlier(s) from high to low.
Where Class = 1 (fraud?), should not Outlier scores be higher?
What am I possibly missing here?
Thanks for your time.
Tony
Tagged:
0
Best Answer
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 UnicornIt is hard to say for sure because I am not familiar with the details of your dataset. But it means, technically speaking, that these 4 are least like the other observations in their respective clusters. So probably that does mean that these are most likely to be fraudulent, but you should review the details of those individual cases to confirm that.
5
Answers
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Please see my screen shot, where Outliers are high to low. Four of the highest are in clusters 0 and 1. Does this not mean the higher the Outlier score, the farther out is the Outlier, therefore fraud is more likely in those first four row numbers?
Thanks once more.
Tony