Is auto model showing test or train error?
Reposting this as a new thread, but my basic question is, is the auto model showing a testing accuracy or a training accuracy in the results view? Because I ran a GBT in auto model on 4500 lines of data with 15 features, received "accuracy" of 90% and f-measure of 84%, but when i applied the model to new unseen data (which i actually purposely held-out from the training and cross validation process), the accuracy rates declines to well below 50%. So I am not sure if I am running the validation process incorrectly, or perhaps not understanding what the results of the cross validation are telling me - as I had expected the auto model to produce an accuracy rate that was reflective of how well the model will perform in the future (i.e. testing error), especially given that the auto model uses a cross validation process inside optimize parameter. Though I am concerned that the split data operator that occured before the CV is perhaps causing an issue for me. Appreciate any thoughts.
Answers
Hi @tkaiser
All performance measures shown by auto-model are from testing data of course.
But you should also note that it uses random stratified split into test and training sets (20/80). This means that training and test subsets are evenly distributed across all data points.
So one of the guesses is that your problem might be dependant on how exactly you take our another testing subset from your data, and what is more important, is there any time serie component in the data. It might happen for example that for a number of reasons the distribution of parameters in this subset differs significantly from other pieces of data. This is something which in called a dataset shift in ML.
Pretty often I use this technique with transactional or payments data. For example, if have 1 year of transactions, I might train and validate model on data from January to November but test it separately on December. This often leads to lower performance metrics but for certain applications shows true model perfromance ("What would happen if we had this model deployed since December?").
So again, it can highly depend on the splitting you use and specifics of your data.
Vladimir
http://whatthefraud.wtf
HI @kypexin,
Thank you, that's helpful. So if the problem is higher variance in the new data, is there any way to fix that problem? For example, i can adjust the original data to measure it in different ways but still have it be meaningful. So maybe reduce the variance of the entire dataset? Or bin certain attributes? Or change some of the data types to characters?
Or is this a situation where i just need to continue to update and maintain the model to keep up with changes in the data through time?
Thanks
Tripp
Hi @tkaiser
Honestly, is can be hard to come up with some certain receip to cure this problem. There is even a whole book (!) dedicated explicitly to this problem: http://www.acad.bg/ebook/ml/The.MIT.Press.Dataset.Shift.in.Machine.Learning.Feb.2009.eBook-DDU.pdf -- maybe it could give you some bright ideas.
In short, I would say that modifying the initial dataset in any way might not help much for just one reason: in reality you don't know for sure how structure of your data can change next time (and when it would happen).
I could give you a good example I have seen once. I've been working on a credit risk model and used few months of historic data to optimize it, so the perfomance was pretty decent, and it worked good on current data as well. Then, suddenly, on one day there was a marketing mailing sent and it was fucked up a bit and went to much higher number of recipients than initially planned. As a result, in one day a huge amount of low profile and high risk customers were drawn in (which was not supposed for the certaing credit product), so the model then failed for just that reason: it has never seen these particular customer segment before, as well as their key features and their behaviour.
So the approach I would choose here is a constant monitoring of the model performance, regular updates and keeping an eye on sudden spikes, so it could be possible to find the anomaly as soon as possible. But in many domains like credit risk or payments it can be hard because you cannot notice the predicted metric changing significantly (and thus the model performance drops) until certain time passes (sometimes can be months!). To deal with this some types of anomanly detection algorithms might be used, just to timely find out the changes in data structures. In some cases, some statistical tests also could be useful (see http://www.statisticshowto.com/homogeneity-homogeneous/ for example).
But at the end, of course, everything depends on your industry and type of data.
Vladimir
http://whatthefraud.wtf
Vladimir is probably on the right track here. However, with a drop that drastic it might also be a different problem: did you made 100% sure that the new validation set was going through EXACTLY the same data preprocessing before you applied the model? Auto Model is actually performing a lot of preprocessing as well and this has a massive effect on model performance which is why we do it of course :-)
You probably know this already but I thought it might be worth pointing out as another likely source for such a performance drop. The easiest way to ensure the exact same treatment BTW is to keep the validation set as part of the input data but set the labels to missing. You will then get the predictions at the end and can compare them to the true labels you still have elsewhere. This is the first thing I would test since it is the simplest thing to do. If the performance no longer drops (that drastically), the cause is likely a difference in data prep between training and your validation data.
Hope this helps,
Ingo