Predicting Unknowns from Known's via Supervised
Hi,
We are trying to model revenue assurance predictive model in identifying the possible electricity theft. Our approach is to take the already known (theft meter hourly reads) and predict if any other meters follow similar usage patterns (anomalies and pattern matching to fraud).
The ratio is we have around 400 known theft meters and 110k unknown. As you can see we have very small ratio of known that we need to match up with unknowns(example set). I have tried KNN,GBT and Naive Bayes and tracking the performance using "Performance Binominal classification" (i.e.) LABEL=FRAUD =TRUE/FALSE. Also, Tried SVM as recommend by most research papers and its performance was terrible, trying parameter optimization and it is running from 2 days:-(
Below are my questions
(1) What would be the best supervised machine learning algorithms for these kind of prediction classifications?
(2) Also, how do we feed back the confirmed false positive meters as not theft to the model, so that model refines and start treating these as not theft and yields a better output(prediction)-Would appreciate if you can share a sample process on how to perform a feedback to model
Thx for the valuable input.
Answers
You may want to try the one-class label SVM approach instead and focus on the characteristics of the known fraud cases. There is a related thread discussion here you should review with a link to a sample process: https://community.rapidminer.com/t5/Getting-Started-Forum/One-class-label-learning/m-p/44038#M1350
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Thank you. How diffeent is this one-class as oppsoed to C-SVC or radial?? The current problem with other svm types is that they are terribly slow..
I suspect the reason the current SVM is so slow is because of the large number of examples of the "unknown" class. If you are using only the "known" class, which is much smaller, then the SVM algorithm will be much faster.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
What @Telcontar120 said. Focus on training the 'knowns' and go from there.
Thank you guys. I liked Rumsfeld analogy :-)
I trained "Knowns" (True’s) with C-SVC and then tested with "Unknows" (False) and it just predicted everything as True. misery..
I wanted to try "one-class", but SVM operator complains about not supported binominal (True/False) or numerical (1/0) labels.
How do we define a label as "one class"?? see attached my process
Attached sample data
@sunnyal Loading in your sample data you can do something like this. With the "one class" application you just train the model on the knowns and exclude the other class completely. Then when it scores it generates how far inside or outside you are from what it trained one.
Note this is just a sample template, I think you're going to have to do some feature generation to make it better). Just make sure to set your Meters to an ID role.
Tom,
Thank you. After modyfing my design as per teh sample I get all 400k examples treated as "outside". I guess SVM isnt doing right thing for me. When I use Naive Bayes or GBT I get some predictions though, but way too many fasle postives.
To further refine my other working models, is there a way we can feed the confirmed false positive meters as an additional input data as a feed back (not theft/false postive) to the model, so that model refines and start treating these as not theft and yields a better output(prediction)?
Thx
Hi,
What you describe is Boosting. This is the technique GBTs are using internally.
Did you run a Grid optimize for GBT and SVMs? What kernels did you try?
Best,
Martin
Dortmund, Germany
Hi Martin,
Thanks for your note.
Yes, I tried optimizing parameters for SVM and it didn’t yield much of benefit. I used rbf kernel for SVM and tried optimizing SVM for Gamma and C values, but it was running for 2 days and still going. I tried limiting example set and optimize for only actual known theft and yet it results were terrible. I also tried GBT, but not better results. Can you suggest me the what parameters and appropriate values one should optimize for GBT?? However, Naive Bayes yielded a better result than any other learners as it predicted few flat line power consumption (which are possible candidates), However, all of them seem false positives when we actually investigated those homes. As such, is there any way we can feed these false positives back to NB or GBT model to not treat these meters as positives??
Thanks for your support