The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Weight/select attributes used by model or model ensemble?

mafern76mafern76 Member Posts: 45 Contributor II
edited November 2018 in Help
I made a bagging ensemble of different trees, each using different samples and therefore different attributes.

I got about 100 models attributes. I would like to easily obtain a list of attributes used without to manually scroll through every model.

I haven't found a way to do so, the closest thing I found is Weight by Tree Importance, which could work for me but I haven't used Random Forest because I needed different samples for each iteration.

Thanks for your help!

Best regards.

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hi,

    i am not aware of any prebuilt operator for that. I guess you need to use a Excute Script for that.

    By the way - the RF also uses a bagged example set per tree. But i guess you need specific examples per tree?

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • mafern76mafern76 Member Posts: 45 Contributor II
    Hi Martin, thanks for your answer. I'll look into Execute Script.

    About the process - what do you mean by "a bagged example set"? I meant I needed different samples of cases, not attributes. I left feature selection to the DT itself.

    I had a very imbalanced data set, for speed I got parameters using 3/1 false/true ratio, when originally it was about 19/1.

    I noticed some variability in results with different samples, so I thought about using a "random sample forest" to smooth out edges due to that factor. I simply put the sampling operator inside a 1.0 ratio Bagging node, so in each iteration I would get a different sample of false cases. The sampling operator was set to balance false cases but leave at 1.0 true cases.

    Unfortunately only afterwards I realized, as you could see on my other thread, this variability was also due to the DT Parallel algorithm itself. Even though maybe that variability helped the forest in the end, I wanted to at least know what was going on, exactly what determined that randomness.

    Thanks!

    Best regards.
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hi,

    the random forest (by breiman et al) as two different sources of randomness.

    1. At each leaf (not tree!) you have different attributes "visible" for the split
    2. Each tree is built on 90% of the data, which is generated via bootstrapping (->Bagging)

    Have you thought about using weights to balance classes? Using weights is different to sampling for a tree-based algorithm, because it can built way deeper.

    Another tip: I would personally not prune random forests unless i need to do it because of overtraining. I personally do not like the RM preset here. Somehow the RM preset is a bit too much into protecting against overtraining.
    And: Do you know the WEKA plugin?

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • mafern76mafern76 Member Posts: 45 Contributor II
    Hi Martin! Thanks for your reply.

    Thanks for the clarification on the Random Forests.

    "2. Each tree is built on 90% of the data, which is generated via bootstrapping (->Bagging)"

    By bootstrapping do you mean resampling? When trying to find the right operators for the process I described, I did some breaking point tests using IDs to check if 1.0 subset ratio on Bagging operator would give me resampling or not, and apparently not. What I mean is that if I had 100 records, every single record would come up and not a bootstrap resampling of 100 containing duplicates.

    I did use weights (always do), but for speed I used only a 3/1 false/true ratio. Another option would have been to also optimize the ratio, but I opted to keep the 3/1 ratio and try what I described before to avoid having to optimize parameters for the DT again. I was optimizing gain and leaf size and somewhere down the road I think I concluded different ratios would have different optimal parameters, but I may be wrong. I think I simply tried a full sample tree with the 3/1 ratio parameters and it performed worse than the 3/1 tree.

    Thanks for the tip on the pruning, I'm already not doing it, I agree with you. Pruning was usually too harsh for me, I only keep prepruning to control leaf size.

    About WEKA: I looked into the WEKA ensembles but none of them have connectors inside (???). Haven't looked into the trees though...

    Best regards, thanks for your insight!
Sign In or Register to comment.