The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Evaluating Anomaly and signature detection methods

fwood201fwood201 Member Posts: 13 Contributor I
edited November 2018 in Help

Hi, 

 

I am a 4th year student trying to do an experiment comparing signature based and anomaly based detection methods. I would like to do this using decision trees and random forest algorithms. The end goal would be to measure the rate of false positives in both methods and to conclude with which is better for deterring cyber attacks.  

 

I am not too sure how I am going to undertake this experiment but RM looks very helpful. I have publicly available security logs to use as data, have downloaded the anomaly detection extension, text mining extension and have set up an IDS system in Sec Onion to monitor my network. 

 

Any advice/solutions would be much-appreciated apologises if ive not made what im doing clear enough I am far from an expert on the subject 

Cheers

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    So the first thing I would ask is if those logs you have are labeled. Do they have a tag for "intrusion" or "no intrusion"?

     

    I would then load the data and use a Process Documents by Data operator and embed Tokenization/TransformCases/etc inside and then create the TFIDF word vectors and do a Cross Validation with maybe a Naive Bayes algo inside. You would have to ouput the PER port to see how well the model classifiies this data (a confusion matrix will be generated).

  • fwood201fwood201 Member Posts: 13 Contributor I

    Yes I have my data labelled as 'attack' or 'normal'. Will cross referencing with naive bayes give me the desired result of determining rate of false positives/classification accuracy etc? Additionally, I have a signature database containing a list of known attacks, do i have to manually train the model or something or does RM handle that? 

  • fwood201fwood201 Member Posts: 13 Contributor I

    Furthermore do I have to upload two sets of data, one for training and one for testing, into the cross validation operator or the same set to allow the operator to split it up? 

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    The Cross Validation operator will automatically handle splitting up the training data into a training and testing set, based on the # of k-folds and how you want to sample it. 

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    I should move your thread to the Studio forum, hardly anyone comes to this place. 

  • fwood201fwood201 Member Posts: 13 Contributor I

    My data is the labelled NSL-KDD dataset for intrusion detection. I want to show number of false positives as well  but the only performance operator that shows this requires a binominal label? which the data doesnt have because it is labelled with the specific attack rather than 'normal' or 'attack' is it possible to make it think it is just one or the other so it can be set as binominal? 

     

    Cheers

    F

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    Hi,

    can you simply generate a new attribute with attack/noAttack from your given signiturate and use it as label for comparison?

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • fwood201fwood201 Member Posts: 13 Contributor I

    Could I do that inside RM or would i have to do it on the original dataset? I mean i know the generate attribute operator but im not sure how it works honestly

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    Of course,

    Generate Attributes is the way to go. It would be something like

     

    if(contains(signiture,"attack"),"attack","noAttack")

     

    or something.

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • fwood201fwood201 Member Posts: 13 Contributor I

    Thats pretty cool! so here's my dataset - as you can see some data is labelled normal others the name of the attack. Would this generate attribute operator recognise the difference between the attacks and the normal ones? How could i put this in an intrusion detection process?

     

    Cheers

    F

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    Hi,

     

    look like a equation like:

    if(label!="normal","attack","normal")

    would to the trick.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • fwood201fwood201 Member Posts: 13 Contributor I

    Thanks that worked perfectly! My last question is - do you think it would be possible to somehow feed the model a novel attack that isnt in the signature database to demonstrate that the IDS' wont detect them if the signature pattern isnt there?

     

    F.

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    Hi,

     

    arguably yes. In the area of fraud you often run into the issue of a tradeoff between being good in detecting known patterns (low false positive rate) and being good in also finding unknown patterns.

    I usually work in environments where these fraud detections are handed to a human experts who decides. In this case I recommend to populate the list with 3 different sources of potential frauds/attacks:

     

    1. A list generated by a supervised algorithm, which is very good in finding "known" (= already seen) patterns

    2. A list generated by a unsupervised algorithm, which is good in also finding new patterns

    3. Randomly selected instances as control group

     

    The results of 2 and 3 serve as new tagged examples for 1. in a feedback loop.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • fwood201fwood201 Member Posts: 13 Contributor I

    Can you suggest a process for implementing this?

     

    F.

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    Hi,

     

    i think the parts in itself are fairly straight forward to built. It's what you already do + a standard supervised method like we explain it in getting started. The tricky thing is how to merge the results and how to built the feedback loop. This is usually customer depended.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • yyhuangyyhuang Administrator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist

    Dear @fwood201,

     

    I fully agree with @mschmitz on "

    1. A list generated by a supervised algorithm, which is very good in finding "known" (= already seen) patterns

    2. A list generated by a unsupervised algorithm, which is good in also finding new patterns

    3. Randomly selected instances as control group

     

    The results of 2 and 3 serve as new tagged examples for 1. in a feedback loop."

    Some fraud detection templates are built in RapidMiner studio. You can run it for a quick demo of supversied algo (as suggest in Martin's list #1) on medical fraud instances. I also installed the anomaly detection extension from Marketplace and run HBOS, for instance, to get the risk scores from unsupervised algo (Martin's list #2). Actually the input example set has negative (non-fraud control group) data randomly selected from a big population. So we can later look into the prediction results for false postives (pre-labeled as 'false' but predicted as 'true' fraud), and manually correct (feedback) the label after some invesitigations.

    template.pngaddHBOS.png

Sign In or Register to comment.