Using the full feature set, I trained a GBT that is about 65% accurate in training:

The testing performance is “really bad”, however, less than 50% accuracy:

| 💬0 Comments | 🔥0 Discussions | 👤0 Members | 🔌0 Online |




hughesfleming68
Member Posts: 323
Unicorn
MartinLiebig
Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533
tftemme
Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 164
tftemme
Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 164
Answers
Is it possible that the time series aspect of this data set (or the way I structured my process in terns of the GBT and sliding validation) is contributing to the disconnect between training and testing performance?
Thanks,
Noel
( @IngoRM, @yyhuang, @varunm1, @hughesfleming68, @mschmitz, @sgenzer )
I took a step back this weekend and tried to enumerate all the moving parts in my analysis:
- Label creation (criteria, related calculations, *alignment*)
- Matters relating to the TimeSeries aspect of my data (aggregation periods and types, window size, validation methodology)
- GBT tuning (both trees in general and boosting specifically: max depth, num trees, num bins, learning rate, min split improvement, etc.)
- Feature creation (some overlap with timeseries aggregations) and selection
I read a bunch of posts in the community and came away thinking that its best to configure the GBT (thank you, @mschmitz ) and be sure to have a solid validation approach in place (thank you, @Telcontar120 ) before focusing on feature weighting, creation, selection, etc.So, I covered much of #2 and #3 (see below). If anyone has any suggestions for other GBT and timeseries tweaks, please let me know.
At this point, is it all about the features? Current results; training on top, testing on bottom (process and data attached):
Thanks,
Noel
-----
TimeSeries: I went with the basic aggregations to start (mean, median, max, min, stdev) and looked at aggregation periods and window sizes:
Aggregation period: 6, Window size: 5
I looked carefully at the Sliding Window Validation operator. I had been using training and testing windows of 100 with steps sizes of their combined width. I came across @sgenzer 's timeseries challenge and tried the validation settings discussed therein: cumulative training, single period test windows, multiple iterations, and none of it seemed to have any impact:
I also did my best to nail down the GBT parameters:
Num trees vs Depth for three learning rates (0.09, 0.10, 0.11)
I'll have a look at that. Great suggestion.
Let me know if I am at least reading the right files. Usually if you feel that something really should work better, the problem is most likely some transformation on your attributes that is killing your signal by mistake.
Any kind of feature selection risks over fitting the training data especially when the signal to noise ratio is low. It can certainly make a good base model better but watch out if it is making a really big difference. You may have to shift your data few times to see if there is consistency with regards to which attributes are being thrown out.
What is really jumping out at me is that you are sampling down your training set to 1000 before automatic feature selection. I wouldn't do this. Try and keep the sequences in tact and remove any randomness. Try using the last 1000 samples instead.
Your process is complicated but still a lot easier than digging through code. I see that you are down sampling a couple of times in your other processes and you are not using local random seed. My fear of this maybe unjustified. It might be fine to do this. I don't but that is just me. I am actually curious what other people think. Anyone?
Alex
Thanks
It comes from the windowing operator. Substituting the value series windowing operator fixes the problem. We can continue this via private mail if you wish. I just want to make sure that I am seeing what you are seeing.
I also had to adjust the filter examples attribute names for the data attribute.
When it runs, I get this. Using GLM is slightly better.
Check my adjusted version to see.
( @IngoRM, @yyhuang, @varunm1, @hughesfleming68, @mschmitz, @sgenzer @CraigBostonUSA @Pavithra_Rao )
Sorry for not responding earlier. This seems to be solved, right? I just skimmed through the thread. There seemed to be an issue with the Windowing operator and the GBT, I think @hughesfleming68 is reporting about this. Is this still an issue?
Best regards,
Fabian
Alex
There are two issues. The first has to do with GBTs and time series data. For daily data, is there a "right" amount of training that is sufficient for the task, but avoids overfit and the divergence between model testing and training performance?
The second issue I think has to do with the core windowing operator's behavior in 9.4. It seems to change all the labels to a single value which leads to the GBT complaining about the response being constant during validation (the error @hughesfleming68 reported).
Thanks,
Noel