Classification model and preparing data
Hello everyone,
I want to build different classification models. I have two questions.
1) At first, I want to build a decision tree. So I have to change the numeric values into nominal. I can do this with the discretizing operator. But all my numeric attributes are differently distributed. Do you know any literature which says the best method in each case? I also read that I can do it with k-means clustering, but it doesn’t work with missing values.
2) I often read that I have to split my dataset into a training and a testing part. I can do this with the splitting operator. I don’t understand why I have to split only into two parts and not into three. Because what is about my non-classified observations? Are they included in each of them (training and testing)? In my opinion I have to split in a training, a testing and a real prediction part.
Thank you very much.
Regards
Best Answers
-
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data ScientistHi @keb1811,
for 1) RapidMiner's DecisionTree can cope with numerical values. You do not need to convert them. It sometimes may help to do so, but then there is no "best thing to do".
for 2) you are basically right: In literature your real prediction (you may rather call it application data set) is often neglected. You basically separate this away first (using the Filter Examples operator) and the do you splitting.
Cheers,
Martin- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany5 -
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data ScientistHi,
you got the idea right, great! you can basically replace the Decision Tree operator with you Cross Validation. You will receive the model on the mod port on top.
Best,
Martin- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany5
Answers
Now i tried it without discretizing.
And to the second question:
Do you mean it like this? In the beginning I seperated all classified from all non-classified. (Filter 3 and 4) Then I tried in the upper part to train the model, but i don't know how to implement the cross validation when I want also a connection to the bottom "apply model". The bottom one should do the application with the non-classified data.
thanks again for your answer. Is it correctly implemented as you indicated? (see picture)
And one other question: Can I use this "Layout" also for other classification algorithms like KNN, Naive Bayes, Neural Nets and SVM when I only change the Operator (decision tree) in the Cross Validation process?
I think the diffrent Algorithms will need a diffrent preparation, does it work if I prepare after the operator "set role" or do I have to insert it before the "multiply" operator (because of "filter examples 4" )?
Thank you very much!
Dortmund, Germany
Dortmund, Germany