The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
RM Decision Trees, Adaboost
Legacy User
Member Posts: 0 Newbie
Random question on how decision trees work in rapidminer. I'm running a decision tree for a predictive model and at the moment just splitting my dataset into 80% train/ 20% test. It's a polynomial classification problem with numerica and nominal attributes. 2 questions:
1) When I run a single decision tree with the % split validation operator, how come it runs the decision tree training twice? I'm just looking at the log and it runs it once, then I see validation still running and a [2] Decision Tree in the log.
2) When I use adaboost to boost the decision trees, the run time and memory usage exponentially increase with each iteration... e.g. 30 mins first, then 1 hour, then 2 hours etc. Obviously I can't run a model with this kind of resource usage, but why is this the case? I've tried boosting methods in other programs and have not run into exponentially increasing runtimes. Do I have a parameter set wrong?
Thanks!
1) When I run a single decision tree with the % split validation operator, how come it runs the decision tree training twice? I'm just looking at the log and it runs it once, then I see validation still running and a [2] Decision Tree in the log.
2) When I use adaboost to boost the decision trees, the run time and memory usage exponentially increase with each iteration... e.g. 30 mins first, then 1 hour, then 2 hours etc. Obviously I can't run a model with this kind of resource usage, but why is this the case? I've tried boosting methods in other programs and have not run into exponentially increasing runtimes. Do I have a parameter set wrong?
Thanks!
1
Answers
I noticed this when I analized the logs about execution times. I also checked if this is indicated by the process status bar and I noticed that there is indeed a modeling operator (Neural Net or SVM) with an index of [2]. So the training phase runs twice .
Edit: I investigated the issue using brakepoints after the Neural Net operator. The first time, it uses only 70% of the examples to train the network but the second time, the training was executed using the entire dataset.
Edit 2: As I further investigated the issue, I think I figured out why does the split validation operator behave like this. The main steps of the Split Validation operator are:
1) Runs the training subprocess using the training data set which is 70% of the entire sample by default. Stores the resulting model (let's call it model1) for later use in the testing subprocess. The performance of this model (if it is measured) is stored as one of the later outputs of the Split Validation operator on one of the corresponding ave ports.
2) Runs the training subprocess again using the entire sample (100%). Sets the resulting model (let's call it model2) as the later output of the Split Validation operator on the output port mod.
3) Runs the testing subprocess using the remaining portion of the entire sample (30% by default). The inner mod input port of the testing subprocess delivers model1 for testing purposes. The performance of this model (if it is measured) is stored as one of the later outputs of the Split Validation operator on one of the corresponding ave ports.
So this behavior is intentional, but it would be better if I could turn off the learning for the entire data set using a parameter while I am searching for the best parameter combination. It could reduce the time of search to the half.
if the Model output of the validation is not connected, it shouldn't run the model building twice.
Regards,
Balázs
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts