Build decision tree using Python and embed in Rapid Miner
Hi guys,
I am doing a project where I need to create decision tree using Python and then embed it in Rapid Miner using Execute Python operator.
These are screenshots of my process:
Subprocess in Cross Validation
This is my code for the decision tree:
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
# rm_main is a mandatory function,
# the number of arguments has to be the number of input ports (can be none)
def rm_main(data):
#import data
file = '04_Class_4.1_german-credit-decoded.xlsx'
xl = pd.ExcelFile(file)
print(xl.sheet_names)
#load a sheet into a DataFrame
gr_raw = xl.parse('RapidMiner Data')
#create arrays for the features, X, and response, y, variable
y = gr_raw['Credit Rating=Good'].values
X = gr_raw.drop('Credit Rating=Good', axis=1).values
#split data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=50)
#build decision tree classifier using gini index
clf_gini = DecisionTreeClassifier(criterion='gini', random_state=50, max_depth=10, min_samples_leaf=5)
clf_gini.fit(X_train, y_train)
return clf_gini
When executed it gives me an error, I am not sure which part of this code that I should ignore for a successfule execution.
Would appreciate any advice or help on this!
Thank you.
Regards,
Azmir F
Best Answer
-
10383721 Member Posts: 3 Contributor I
Thanks guys for the solutions you have provided. I have managed to come up with my own solution.
I did not know that python needs numerical data to apply the model. So I have modified my process and used Execute Python operators twice, once in Training and once in Testing. I used the Numerical to Binominal operator after the second Excecute Python operator.
Note that I have renamed it to Build Model and Apply Model.
This is my updated process:
Cross Validation Subprocess
My Python script for Build Model is as below:
from sklearn.tree import DecisionTreeClassifier
def rm_main(data):
# build decision tree
X = data[['Age', 'Duration in month', 'Installment rate in % of disposable income','Credit Amount', 'Present residence since', 'Number of existing credits', 'Number of dependents']]
y = data[['Credit Rating']]
clf = DecisionTreeClassifier(min_samples_split = 20, max_depth = 10, random_state = 99)
clf.fit(X, y)
return clfMy Python script for Apply model is as below:
from sklearn.tree import DecisionTreeClassifier
def rm_main(model, data):
X = data[['Age', 'Duration in month', 'Installment rate in % of disposable income','Credit Amount', 'Present residence since', 'Number of existing credits', 'Number of dependents']]
data['prediction'] = model.predict(X)
#set role of prediction attribute to prediction
data.rm_metadata['prediction']=(None,'prediction')
return dataLet me know if you have other relevant solution or better script to produce a more stable model.
Thank you.
Regards,
Azmir F
2
Answers
Hi Azmir
1. I think it's impossible to do only the model in Python inside the "Cross-validation" operator because the "Apply Model" operator (in the test part) expect a "RM model input" and recept a "Python object" and then the process fail.
Maybe someone has a solution to this problem. (if not rdv to the 2. ) However I have corrected some points in the process (i worked with the same datasets few weeks ago....) :
- add of a "nominal to numerical" operator (python need numerical value to perform model)
- Building the model with the entire dataset (you performed a split validation inside a cross validation, for me it's not relevant)
- suppression of the import of data in your "Execute python".(the parameter "data" of the python function is in fact the dataset which enter in the python operator).
Here this process :
2. I think the solution is to perform all the subprocess (building/applying/cross-validation/performance) with "Execute Python" operators
(only the data preprocessing is made with RM operator).
In the process below, in addition to the modifications described at 1., I have created an applying/cross validation/performance "Execute Python" operator with in exit :
- the y_prediction (applying the decision tree model at the training dataset) which is added to the dataset (last column)
- the associated accuracy (~70%)
- the feature importance
Here this process :
I hope this will be helpful,
Regards,
Lionel
Here's the building block I use for XValidation with Python. I have one that also works with the Compare Models operator, but that is very complex.
I think the process is correct, there were similar processes with R in the forum.
As a side note, can I ask why do you need to use the Python decision tree? By using the Execute Python operator several times (2 times per CV fold) you are generating a huge overhead and also messing up with the parallelization features of RapidMiner. I would say that the smarter thing to do would be to use the Decision Tree operator or do CV inside the Execute Python operator.
It is for our assignment to introduce the functionality of Execute Python in Rapid Miner.
Thanks for the info!
@JEdward Thanks for sharing, your sample code is going to be a life saver for me!!