The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Anomaly detection in Rapidminer with one label column Yes/No
Indhumathi
Member Posts: 3 Learner I
in Help
Hello All,
I have a dataset with 1000 rows which has one column contains Yes/No to identify as Anomaly. I want to use this dataset to train the model. Which model i should use in supervised techniques and how can I design my job which has 2 inputs one as training set with label and another one without label.
Any sample process will be very helpful.
Thanks,
Indhumathi
I have a dataset with 1000 rows which has one column contains Yes/No to identify as Anomaly. I want to use this dataset to train the model. Which model i should use in supervised techniques and how can I design my job which has 2 inputs one as training set with label and another one without label.
Any sample process will be very helpful.
Thanks,
Indhumathi
Tagged:
0
Answers
I advice two things :
- Use AutoModel with your labeled dataset : AutoModel will automatically find the best classifier model(s) for you.
- About how to train a model with a labelled dataset and then how to score an unlabelled dataset with this trained model, you can take a look to the samples processes proposed by RapidMiner and more generally you can take a look at the videos of the RapidMiner Academy to familiarize yourself with the process of a data science project :
https://academy.rapidminer.com/
Hope this helps,
Regards,
Lionel
A B Anomaly
1000 0 0 (No)
50 0 1(Yes)
40 1 0(No)
23 1 0(No)
0 0(No)
Now I want to know any other columns affects the Anomaly i.e Instead of I am telling the model that based on only 2 columns Anomaly flag is marked, the system should tell me these other columns C ,D ,E also affects Anomaly flag, these could also possibilities.
To achieve above I have tried below 2 methods:
1)Built a LOF unsupervised model. ---I don't know based on which column it is assigning the outlier score
2)Feed the LOF output column - "outlier score" as label into Decision tree Automodel to check which attribute is contributing to the score.I have checked in predictions tab that various color depth of red(contradict) and green(Support).But I am sure that the green highlighted columns should not cause anomaly.How can I change that?
Also I want to provide a solution for pattern in the anomaly.How can I achieve that with models?
Thanks,
Indhumathi
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Thank you for your suggestions.
Yes I have analysed the step 2 output i.e decision tree predictions/Simulator and could see the set of attributes affecting the score.If I did the same LOF output into Random Forest Model I could see different set of attributes affecting the score.Now both Decision Tree and Random forest prediction output are not much closure to original LOF outlier score.So which method can I prefer ?
1)How can I compare which method is predicting correctly?
2)I mean if anomaly is based on particular set of attribute (A,B) then I need to provide a solution like atribute A and B to be properly configured in system. If its based on C,D then correct threshold should be set to avoid overbooking.