The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Gradient Boosted Tree and performance
Dear community,
I want to understand my GBT algorithm. I trained it, validated it on new data with quite a good result. Now, I would like to understand the model to find out, which attributes were the most decisive ones, but here I fail. For example, my Tree 1 is described as
ch1 in {1009351207,1047831207,... (46 more)}: 0.013 {}
ch1 not in {1009351207,1047831207,... (46 more)}
| ch1 in {1009351207,1000751092,... (49 more)}: -0.009 {}
| ch1 not in {1009351207,1000751092,... (49 more)}: -0.027 {}
Could you please, explain, where can I find these 46 more atributes? Or 49 more attributes?
Thanks a lot.
0
Best Answer
-
BalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 UnicornHi @Barborka,
if you're looking at the description of one tree and it only contains ch1, then it only considers ch1. Other trees might consider different attributes. The weights output of the entire model shows the summary - single trees are not that relevant.
I couldn't find a way to extract the whole list of values going into the rules. There are some promising operators like Tree to Rules and DecisionTree to ExampleSet (in the Converters extension) but these don't work with GBT, only single trees.
Regards,
Balázs0
Answers
with a complex model like GBT it's very complicated to derive the attribute importance directly from the model.
In your example ch1 is the attribute name, the 1009... (46 more) entries are different values (data in the ch1 column).
So in this example only the attribute ch1 is relevant at all.
The Gradient Boosted Trees operator has an output called "wei". These are the attribute weights calculated by the model. Higher values in this table mark the more important attributes for predicting the label.
If I saw a model like this, I would suspect that these are IDs and the model is just learning them. This would mean that the model is overfitted. I hope this is not the case with your data, but you should check.
Regards,
Balázs