The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Anomaly detection in Rapidminer with one label column Yes/No

IndhumathiIndhumathi Member Posts: 3 Learner I
Hello All,

I have a dataset with 1000 rows which has one column contains Yes/No to identify as Anomaly. I want to use this dataset to train the model. Which model i should use in supervised techniques and how can I design my job which has 2 inputs one as training set with label and another one without label.

Any sample process will be very helpful.

Thanks,
Indhumathi

Answers

  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hello @Indhumathi,

    I advice two things : 

     - Use AutoModel with your labeled dataset : AutoModel will automatically find the best classifier model(s) for you.
     - About how to train a model with a labelled dataset and then how to score an unlabelled dataset with this trained model, you can take a look to the samples processes proposed by RapidMiner and more generally you can take a look at the videos of the RapidMiner Academy to familiarize yourself with the process of a data science project : 
    https://academy.rapidminer.com/

    Hope this helps,

    Regards,

    Lionel
  • IndhumathiIndhumathi Member Posts: 3 Learner I
    lionelderkrikor.I have used automodel with Random forest to train the model and then used Apply model to test on TEST set.Now its working fine.The Anomaly flag column I have created manually based on 2 column values like below,

    A          B      Anomaly
    1000    0         0 (No)
    50        0          1(Yes)
    40        1          0(No)
    23        1          0(No)
                0           0(No)

    Now I want to know any other columns affects the Anomaly i.e Instead of I am telling the model that based on only 2 columns Anomaly flag is marked, the system should tell me these other columns C ,D ,E also affects Anomaly flag, these could also possibilities.
    To achieve above I have tried below 2 methods:

    1)Built a LOF unsupervised model.    ---I don't know based on which column it is assigning the outlier score
    2)Feed the LOF output column - "outlier score" as label into Decision tree Automodel to check which attribute is contributing to the score.I have checked in predictions tab that various color depth of red(contradict) and green(Support).But I am sure that the green highlighted columns should not cause anomaly.How can I change that?

    Also I want to provide a solution for pattern in the anomaly.How can I achieve that with models?

    Thanks,
    Indhumathi
  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    What do you mean by "provide a solution for pattern in the anomaly"?  If you are talking about describing the relationship between individual attributes and the outcome, take a look at the operators "Explain Predictions" and "Model Simulator".  These allow you to look at how changes in independent variables affect your predictions based on the selected model, even when it is very complex.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • IndhumathiIndhumathi Member Posts: 3 Learner I
    Hello,

    Thank you for your suggestions.

    Yes I have analysed the step 2 output i.e decision tree predictions/Simulator and could see the set of attributes affecting the score.If I did the same LOF output into Random Forest Model I could see different set of attributes affecting the score.Now both Decision Tree and Random forest prediction output are not much closure to original LOF outlier score.So which method can I prefer ?

    1)How can I compare which method is predicting correctly?

    2)I mean if anomaly is based on particular set of attribute (A,B) then I need to provide a solution like atribute A and B to be properly configured in system. If its based on C,D then correct threshold should be set to avoid overbooking.
Sign In or Register to comment.