The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
What model to use?
Backstory: I just started using RapidMiner and I'm working with a system where a node will get pinged randomly throughout the day. From this I'm given a timestamp and I've also managed to split that timestamp up to give me month, day, hour, day of the week, and frequency per hour (not really sure if any of these features are actually significant). I'm trying to use RapidMiner to predict when a node goes 'missing'.
I want RapidMiner to take in all of this info and then spit out how confident it is that a node is missing/not missing based on how long it's been since the last ping vs. the frequency that the node has gotten in similar situations (ex. same day of week, same hour in previous days, etc). I'd be very thankful if anyone could point out some viable data models for me. If it changes anything, I also have a pretty large amount of data (been running my app for over 3 months).
I want RapidMiner to take in all of this info and then spit out how confident it is that a node is missing/not missing based on how long it's been since the last ping vs. the frequency that the node has gotten in similar situations (ex. same day of week, same hour in previous days, etc). I'd be very thankful if anyone could point out some viable data models for me. If it changes anything, I also have a pretty large amount of data (been running my app for over 3 months).
0
Answers
An advantage of decision trees is that you can work straight with nominal attributes. You have the same advantage with an anomaly detection operator such as k-NN Global Anomaly Score. You can go either way.
internally dates are stored as integers since 1970. Some algorithms from the anomaly extensions are indeed treating them as this number. My personal tip would be to use Date to Numerical first and translate it to something useful. E.g. Week since 1970.
Another point is, that you seem to have a very imbalanced problem. Means you have way more not missing points than missing. You should consider either to use downsampling (Sample operator) or using Weights (Generate Weights (Stratification) operator. Be also sure to use a correct performance measure.
While i generally agree that decision trees are a fine way to start, i would recommend considering to use a Random Forest as a second step. It is known to be stronger than a decision tree.
And a point on k-NN Global anomaly score: Consider to use LOF instead. It is a bit stronger in my eyes.
~Martin
Dortmund, Germany