Rolling features (Rolling mean, max, min, sum, ...) would be nice
Whenever I am doing Data Science and try to predict a target variable I almost certainly include the past of the target variable. For example: When predicting how long some process will take then I would almost always include 'how long does it usually take' as feature or even as a baseline model. One can compute this 'how long does it usually take' in different ways. For example: For every different process one could just take the average over the whole training set. However, this could be a bad idea due to the fact that the length of the process may depend on seasonalities or other mechanisms in the training data. That is why I prefer rolling window functions to do so, i.e.
rollingMean((15,17,12,11,19,25,27,30,28), 3) would be something like (14.66667, 13.33333, 14.00000, 18.33333, 23.66667, 27.33333, 28.33333, 29.00000)
This is not yet at all included in RM although it is a rather common thing to do in the DS business.
Comments
Hi,
this feature does already exist in RapidMiner. If you install the free Series extension, there's a moving average operator that does exactly what you want. It aggregates over a fixed window length and moving this window over the dataset. You can select the usual aggregation functions, so you can also compute the standard deviation of a window, which can also be helpful.
Greetings,
Sebastian
@land, If I may ask a question as an extension to @fryasdf's question on rolling features.
Thank you for the moving average operator you pointed out. It does not solve all my problems though. I would still like to do the following:
1) I want to be able to limit the recaluclation of the moving average to an index, say id. Take for instance (for a window of 2), the operator currently does
id... x... MovingAverage...
1... 3... ?
1... 2... 2.5
1... 4... 3
2... 1... 2.5
2... 3... 2
but what I want is:
id... x... MovingAverage...
1... 3... ?
1... 2... 2.5
1... 4... 3
2... 1... ?
2... 3... 2
2) I want to be able to tell the operator to use the corresponding x value when the operator has a blank cell. In my above example, I would want the result to finally look like:
id... x... MovingAverage...
1... 3... 3...
1... 2... 2.5...
1... 4... 3...
2... 1... 1...
2... 3... 2...
Can you help?
Hi,
sure, this is our everyday work...
1) Put the current process into a Loop Groups Operator of the Jackhammer Extension. Select the id as attribute in the Loop Groups operator, so that it processes all rows with the same values at one time in its subprocess. Append the result again
2) Simply Use a Generate Attributes Operator afterwards where you test with if(missing([average(x)]),x,[average(x)])
Hope that helps! If you have such problems more often, you might want to consider Old World Computing's support services ;-)
Greetings,
Sebastian
Time Series Extension
Great news!