Should I use a Red Status Highly Correlated Attribute in Auto Model?
Hi there,
If the results of a Linear Model are too good to be true, I got 0.3% Relative Error on a Linear Model, can I conclude that this has happened because I've included (red status) attributes that are too closely correlated to the Label (Dow Jones closing price)? Should I trust the result?
“High Correlation: a correlation of more than 40% may be an indicator for information you don't have at prediction time. In that case, you should remove this column. Sometimes, however, the prediction problem is simple, and you will get a better model when the column is included. “
For example I included a red status, 2 day moving average which is 99% correlated to the closing price (0.7 weight). If a simple indicator like this, (which is effective when day trading spot forex) is a good predictor — also confirmed according to my Explain Predictions Random Forest model — should I include it? Why is RM Auto Model saying don’t use it? I get the concept that RM is looking for patterns and that it is looking for “underlying reasons” to explain the Label.
Also in the Auto Model help notes it states:
"The performance is calculated on a 40% hold out set which has not been used for any of the performed model optimisations. This hold-out set is then used as input for a multi-hold-out-set validation where we calculate the performance for 7 disjoint subsets. The largest and the highest performance are removed and the average of the remaining 5 performances is reported here."
Do the RM Closing prices not consequentially match the Excel file Column E closing prices because of this disjoint subset testing and is that why there are no dates provided in RM Auto Model results? (The closing price in Excel in row 5186 is 27686, not 27386 as in row 2074 in RM).
Lastly why is the Simulator prediction price so far from the actual current closing price? The Dow Jones is currently at 27778, how do I interpret this 14466 result?
Cheers for any insights,
Answers
Jacob
Hi @jacobcybulski,
Thanks for the reply.
In your experience what would you consider to be a good Relative Error rate for Random Forests or ARIMA for Time Series and why is ARIMA classifying indicators predictability different to Random Forests?
Moving averages aren’t predictive indicators and only represent past values.
Do you know how do I get dates to show in Auto Model results column?
Cheers,
Jacob