Build decision tree using Python and embed in Rapid Miner

10383721 · December 2017

Hi guys,

I am doing a project where I need to create decision tree using Python and then embed it in Rapid Miner using Execute Python operator.

These are screenshots of my process: Screen Shot 2017-12-12 at 11.14.02.png

Screen Shot 2017-12-12 at 11.14.16.png Subprocess in Cross Validation

This is my code for the decision tree:

import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree

# rm_main is a mandatory function, 
# the number of arguments has to be the number of input ports (can be none)
def rm_main(data):
	#import data
	file = '04_Class_4.1_german-credit-decoded.xlsx'
	xl = pd.ExcelFile(file)
	print(xl.sheet_names)

	#load a sheet into a DataFrame 
	gr_raw = xl.parse('RapidMiner Data')

	#create arrays for the features, X, and response, y, variable
	y = gr_raw['Credit Rating=Good'].values
	X = gr_raw.drop('Credit Rating=Good', axis=1).values

	#split data into training and testing set
	X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=50)

	#build decision tree classifier using gini index
	clf_gini = DecisionTreeClassifier(criterion='gini', random_state=50, max_depth=10, min_samples_leaf=5)
	clf_gini.fit(X_train, y_train)

	return clf_gini

When executed it gives me an error, I am not sure which part of this code that I should ignore for a successfule execution.

Would appreciate any advice or help on this!

Thank you.

Regards,

Azmir F

10383721 · December 2017

Thanks guys for the solutions you have provided. I have managed to come up with my own solution.

I did not know that python needs numerical data to apply the model. So I have modified my process and used Execute Python operators twice, once in Training and once in Testing. I used the Numerical to Binominal operator after the second Excecute Python operator.

Note that I have renamed it to Build Model and Apply Model.

This is my updated process: Screen Shot 2017-12-14 at 14.42.32.png

Screen Shot 2017-12-14 at 14.42.45.png Cross Validation Subprocess

My Python script for Build Model is as below:

from sklearn.tree import DecisionTreeClassifier
def rm_main(data):

# build decision tree
    X = data[['Age', 'Duration in month', 'Installment rate in % of disposable income','Credit Amount', 'Present residence since', 'Number of existing credits', 'Number of dependents']]
    y = data[['Credit Rating']]
    clf = DecisionTreeClassifier(min_samples_split = 20, max_depth = 10, random_state = 99)
    clf.fit(X, y)  

    return clf

My Python script for Apply model is as below:

from sklearn.tree import DecisionTreeClassifier
def rm_main(model, data):
    X = data[['Age', 'Duration in month', 'Installment rate in % of disposable income','Credit Amount', 'Present residence since', 'Number of existing credits', 'Number of dependents']]
    data['prediction'] = model.predict(X)

    #set role of prediction attribute to prediction
    data.rm_metadata['prediction']=(None,'prediction')
    return data

Let me know if you have other relevant solution or better script to produce a more stable model.

Thank you.

Regards,

Azmir F

lionelderkrikor · December 2017

Hi Azmir

1. I think it's impossible to do only the model in Python inside the "Cross-validation" operator because the "Apply Model" operator (in the test part) expect a "RM model input" and recept a "Python object" and then the process fail.

Maybe someone has a solution to this problem. (if not rdv to the 2. ) However I have corrected some points in the process (i worked with the same datasets few weeks ago....) :

- add of a "nominal to numerical" operator (python need numerical value to perform model)

- Building the model with the entire dataset (you performed a split validation inside a cross validation, for me it's not relevant)

- suppression of the import of data in your "Execute python".(the parameter "data" of the python function is in fact the dataset which enter in the python operator).

Here this process :

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="6.0.003" expanded="true" height="68" name="Read Excel" width="90" x="45" y="30">
        <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\04_Class_4.1_german-credit-raw.xlsx"/>
        <parameter key="imported_cell_range" value="A1:U1001"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Checking Account Status.true.polynominal.attribute"/>
          <parameter key="1" value="Duration in month.true.integer.attribute"/>
          <parameter key="2" value="Credit History.true.polynominal.attribute"/>
          <parameter key="3" value="Purpose.true.polynominal.attribute"/>
          <parameter key="4" value="Credit Amount.true.integer.attribute"/>
          <parameter key="5" value="Savings Account/Bonds.true.polynominal.attribute"/>
          <parameter key="6" value="Present Employment since.true.polynominal.attribute"/>
          <parameter key="7" value="Installment rate in % of disposable income.true.integer.attribute"/>
          <parameter key="8" value="Personal Status.true.polynominal.attribute"/>
          <parameter key="9" value="Other debtors.true.polynominal.attribute"/>
          <parameter key="10" value="Present residence since.true.integer.attribute"/>
          <parameter key="11" value="Property.true.polynominal.attribute"/>
          <parameter key="12" value="Age.true.integer.attribute"/>
          <parameter key="13" value="Other installment plans.true.polynominal.attribute"/>
          <parameter key="14" value="Housing.true.polynominal.attribute"/>
          <parameter key="15" value="Number of existing credits.true.integer.attribute"/>
          <parameter key="16" value="Job type.true.polynominal.attribute"/>
          <parameter key="17" value="Number of dependents.true.integer.attribute"/>
          <parameter key="18" value="Telephone.true.binominal.attribute"/>
          <parameter key="19" value="Foreign worker.true.binominal.attribute"/>
          <parameter key="20" value="Credit Rating.true.integer.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="120">
        <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\04_Class_4.1_german-credit-value-modification.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="encoding" value="windows-1252"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="OldValue.true.polynominal.attribute"/>
          <parameter key="1" value="NewValue.true.polynominal.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="replace_dictionary" compatibility="7.1.001" expanded="true" height="103" name="Replace (Dictionary)" width="90" x="179" y="75">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Checking Account Status"/>
        <parameter key="attributes" value="|Property|Other installment plans"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="from_attribute" value="OldValue"/>
        <parameter key="to_attribute" value="NewValue"/>
      </operator>
      <operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="68" name="Read CSV (2)" width="90" x="45" y="255">
        <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\04_Class_4.1_german-credit-value-modification-chk-acc.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="encoding" value="windows-1252"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="OldValue.true.polynominal.attribute"/>
          <parameter key="1" value="NewValue.true.polynominal.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="replace_dictionary" compatibility="7.1.001" expanded="true" height="103" name="Replace (2)" width="90" x="179" y="300">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Checking Account Status"/>
        <parameter key="attributes" value="|Property|Other installment plans"/>
        <parameter key="from_attribute" value="OldValue"/>
        <parameter key="to_attribute" value="NewValue"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="8.0.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="345">
        <parameter key="attribute_name" value="Credit Rating"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="numerical_to_binominal" compatibility="6.0.003" expanded="true" height="82" name="Numerical to Binominal" width="90" x="447" y="345">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Credit Rating"/>
        <parameter key="include_special_attributes" value="true"/>
        <parameter key="min" value="1.0"/>
        <parameter key="max" value="1.0"/>
      </operator>
      <operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="380" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Credit Rating"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="true"/>
        <list key="comparison_groups"/>
      </operator>
      <operator activated="true" class="concurrency:cross_validation" compatibility="8.0.001" expanded="true" height="145" name="Cross Validation" width="90" x="514" y="34">
        <process expanded="true">
          <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="380" y="34">
            <parameter key="script" value="import numpy as np&#10;import pandas as pd&#10;from sklearn.cross_validation import train_test_split&#10;from sklearn.tree import DecisionTreeClassifier&#10;from sklearn.metrics import accuracy_score&#10;from sklearn import tree&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#9;&#10;&#9;#create arrays for the features, X, and response, y, variable&#10;&#9;y = data['Credit Rating'].values&#10;&#9;X = data.iloc[:,1:]&#10;&#10;&#9;#split data into training and testing set&#10;&#9;#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=50)&#10;&#10;&#9;#build decision tree classifier using gini index&#10;&#9;clf_gini = DecisionTreeClassifier(criterion='gini', random_state=50, max_depth=10, min_samples_leaf=5)&#10;&#9;#clf_gini.fit(X_train, y_train)&#10;&#9;clf_gini.fit(X, y)&#10;&#10;&#9;return clf_gini"/>
          </operator>
          <connect from_port="training set" to_op="Execute Python" to_port="input 1"/>
          <connect from_op="Execute Python" from_port="output 1" to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="8.0.001" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="8.0.001" expanded="true" height="82" name="Performance" width="90" x="246" y="34"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Replace (Dictionary)" to_port="example set input"/>
      <connect from_op="Read CSV" from_port="output" to_op="Replace (Dictionary)" to_port="dictionary"/>
      <connect from_op="Replace (Dictionary)" from_port="example set output" to_op="Replace (2)" to_port="example set input"/>
      <connect from_op="Read CSV (2)" from_port="output" to_op="Replace (2)" to_port="dictionary"/>
      <connect from_op="Replace (2)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
      <connect from_op="Numerical to Binominal" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
      <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Cross Validation" to_port="example set"/>
      <connect from_op="Cross Validation" from_port="model" to_port="result 1"/>
      <connect from_op="Cross Validation" from_port="example set" to_port="result 2"/>
      <connect from_op="Cross Validation" from_port="performance 1" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

2. I think the solution is to perform all the subprocess (building/applying/cross-validation/performance) with "Execute Python" operators

(only the data preprocessing is made with RM operator).

In the process below, in addition to the modifications described at 1., I have created an applying/cross validation/performance "Execute Python" operator with in exit :

- the y_prediction (applying the decision tree model at the training dataset) which is added to the dataset (last column)

- the associated accuracy (~70%)

- the feature importance

Here this process :

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="6.0.003" expanded="true" height="68" name="Read Excel" width="90" x="45" y="30">
        <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\04_Class_4.1_german-credit-raw.xlsx"/>
        <parameter key="imported_cell_range" value="A1:U1001"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Checking Account Status.true.polynominal.attribute"/>
          <parameter key="1" value="Duration in month.true.integer.attribute"/>
          <parameter key="2" value="Credit History.true.polynominal.attribute"/>
          <parameter key="3" value="Purpose.true.polynominal.attribute"/>
          <parameter key="4" value="Credit Amount.true.integer.attribute"/>
          <parameter key="5" value="Savings Account/Bonds.true.polynominal.attribute"/>
          <parameter key="6" value="Present Employment since.true.polynominal.attribute"/>
          <parameter key="7" value="Installment rate in % of disposable income.true.integer.attribute"/>
          <parameter key="8" value="Personal Status.true.polynominal.attribute"/>
          <parameter key="9" value="Other debtors.true.polynominal.attribute"/>
          <parameter key="10" value="Present residence since.true.integer.attribute"/>
          <parameter key="11" value="Property.true.polynominal.attribute"/>
          <parameter key="12" value="Age.true.integer.attribute"/>
          <parameter key="13" value="Other installment plans.true.polynominal.attribute"/>
          <parameter key="14" value="Housing.true.polynominal.attribute"/>
          <parameter key="15" value="Number of existing credits.true.integer.attribute"/>
          <parameter key="16" value="Job type.true.polynominal.attribute"/>
          <parameter key="17" value="Number of dependents.true.integer.attribute"/>
          <parameter key="18" value="Telephone.true.binominal.attribute"/>
          <parameter key="19" value="Foreign worker.true.binominal.attribute"/>
          <parameter key="20" value="Credit Rating.true.integer.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="120">
        <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\04_Class_4.1_german-credit-value-modification.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="encoding" value="windows-1252"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="OldValue.true.polynominal.attribute"/>
          <parameter key="1" value="NewValue.true.polynominal.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="replace_dictionary" compatibility="7.1.001" expanded="true" height="103" name="Replace (Dictionary)" width="90" x="179" y="75">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Checking Account Status"/>
        <parameter key="attributes" value="|Property|Other installment plans"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="from_attribute" value="OldValue"/>
        <parameter key="to_attribute" value="NewValue"/>
      </operator>
      <operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="68" name="Read CSV (2)" width="90" x="45" y="255">
        <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\04_Class_4.1_german-credit-value-modification-chk-acc.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="encoding" value="windows-1252"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="OldValue.true.polynominal.attribute"/>
          <parameter key="1" value="NewValue.true.polynominal.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="replace_dictionary" compatibility="7.1.001" expanded="true" height="103" name="Replace (2)" width="90" x="179" y="300">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Checking Account Status"/>
        <parameter key="attributes" value="|Property|Other installment plans"/>
        <parameter key="from_attribute" value="OldValue"/>
        <parameter key="to_attribute" value="NewValue"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="8.0.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="340">
        <parameter key="attribute_name" value="Credit Rating"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="numerical_to_binominal" compatibility="6.0.003" expanded="true" height="82" name="Numerical to Binominal" width="90" x="447" y="345">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Credit Rating"/>
        <parameter key="include_special_attributes" value="true"/>
        <parameter key="min" value="1.0"/>
        <parameter key="max" value="1.0"/>
      </operator>
      <operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="380" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Credit Rating"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="true"/>
        <parameter key="coding_type" value="unique integers"/>
        <list key="comparison_groups"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply" width="90" x="581" y="187"/>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Build model" width="90" x="782" y="34">
        <parameter key="script" value="import numpy as np&#10;import pandas as pd&#10;from sklearn.cross_validation import train_test_split&#10;from sklearn.tree import DecisionTreeClassifier&#10;from sklearn.metrics import accuracy_score&#10;from sklearn import tree&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#9;&#10;&#9;#create arrays for the features, X, and response, y, variable&#10;&#9;y = data['Credit Rating']&#10;&#9;X = data.drop('Credit Rating', axis=1)&#10;&#10;&#9;#split data into training and testing set&#10;&#9;#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=50)&#10;&#10;&#9;#build decision tree classifier using gini index&#10;&#9;clf_gini = DecisionTreeClassifier(criterion='gini', random_state=50, max_depth=10, min_samples_leaf=5)&#10;&#9;#clf_gini.fit(X_train, y_train)&#10;&#9;clf_gini.fit(X, y)&#10;&#10;&#9;return clf_gini"/>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="145" name="Apply/cross_validation/performance" width="90" x="849" y="289">
        <parameter key="script" value="import numpy as np&#10;import pandas as pd&#10;from sklearn.cross_validation import cross_val_score&#10;from sklearn.tree import DecisionTreeClassifier&#10;from sklearn.metrics import accuracy_score&#10;from sklearn import tree&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(model,data):&#10;&#9;&#10;&#9;#create arrays for the features, X, and response, y, variable (the same as  the training set)&#10;  #y = data['Credit Rating']&#10;  y = data['Credit Rating']&#10;  X = data.drop('Credit Rating', axis=1)&#10;&#10;  feature = list(X)&#10;&#10;&#9;#Apply the model&#10;  y_pred = model.predict(X)&#10;&#10;&#9;#perform the cross validation and calculate the mean and std of accuracy&#10;  accuracy_mean = 100*(cross_val_score(model,X,y,scoring = 'accuracy',cv = 10)).mean()&#10;  accuracy_std = 100*(cross_val_score(model,X,y,scoring = 'accuracy',cv = 10)).std()&#10;  accuracy = str(accuracy_mean) + &quot; +/- &quot; + str(accuracy_std)&#10;&#10;&#9;#Calculation of feature importance&#10;&#10;  feat_importance = model.feature_importances_&#9;&#10;&#9;&#10;&#9;#Write the results&#10;&#10;  accuracy = pd.DataFrame(data = [accuracy],columns = ['accuracy'])&#10;  y_prediction = pd.DataFrame(data = y_pred,columns = ['Credit Rating (prediction)']) &#10;  feature_importance = pd.DataFrame(data = feat_importance,columns = ['feature importances']) &#10;  features = pd.DataFrame(data = feature,columns = ['features'])&#10;  &#10;  data = data.join(y_prediction)&#10;  features = features.join( feature_importance)&#10;&#10;&#9;&#10;  return data,accuracy,feature_importance,features "/>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Replace (Dictionary)" to_port="example set input"/>
      <connect from_op="Read CSV" from_port="output" to_op="Replace (Dictionary)" to_port="dictionary"/>
      <connect from_op="Replace (Dictionary)" from_port="example set output" to_op="Replace (2)" to_port="example set input"/>
      <connect from_op="Read CSV (2)" from_port="output" to_op="Replace (2)" to_port="dictionary"/>
      <connect from_op="Replace (2)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
      <connect from_op="Numerical to Binominal" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
      <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Build model" to_port="input 1"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Apply/cross_validation/performance" to_port="input 2"/>
      <connect from_op="Build model" from_port="output 1" to_op="Apply/cross_validation/performance" to_port="input 1"/>
      <connect from_op="Apply/cross_validation/performance" from_port="output 1" to_port="result 1"/>
      <connect from_op="Apply/cross_validation/performance" from_port="output 2" to_port="result 2"/>
      <connect from_op="Apply/cross_validation/performance" from_port="output 3" to_port="result 3"/>
      <connect from_op="Apply/cross_validation/performance" from_port="output 4" to_port="result 4"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

I hope this will be helpful,

Regards,

Lionel

JEdward · December 2017

Here's the building block I use for XValidation with Python. I have one that also works with the Compare Models operator, but that is very complex.

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="x_validation" compatibility="7.6.001" expanded="true" height="124" name="Validation" width="90" x="380" y="34">
        <process expanded="true">
          <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="BDT (sklearn)" width="90" x="112" y="34">
            <parameter key="script" value="&#10;import pandas as pd&#10;from sklearn.ensemble import GradientBoostingClassifier&#10;&#10;# This script creates a GradientBoostingClassifier from SKLearn on RM data&#10;# It can be used as a generic template for other sklearn classifiers or regressors&#10;&#10;# Author: mschmitz&#10;&#10;def rm_main(data):&#10;    metadata =  data.rm_metadata&#10;&#10;    # Get the list of regular attributes and the label&#10;    &#10;    df = pd.DataFrame(metadata).T&#10;    label = df[df[1]==&quot;label&quot;].index.values&#10;    regular = df[df[1] != df[1]].index.values&#10;    &#10;    # Create the Tree, for more options see&#10;    # For details see:&#10;&#10;    clf = GradientBoostingClassifier(&#10;        n_estimators=10,&#10;        max_features=&quot;sqrt&quot;)&#10;        &#10;    # learn it&#10;    clf.fit(data[regular], data[label])&#10;&#10;    # Return also the list of regulars and labels for later application&#10;    &#10;    return (clf,regular,label[0]), data&#10;"/>
          </operator>
          <connect from_port="training" to_op="BDT (sklearn)" to_port="input 1"/>
          <connect from_op="BDT (sklearn)" from_port="output 1" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="103" name="Apply Model (2)" width="90" x="45" y="34">
            <parameter key="script" value="import pandas as pd&#10;&#10;&#10;# rm_main is a mandatory function,&#10;# the number of arguments has to be the number of input ports (can be none)&#10;&#10;&#10;def rm_main(clfinfo, data):&#10;    clf = clfinfo[0]&#10;    regular = clfinfo[1]&#10;    label = clfinfo[2]&#10;    meta = data.rm_metadata&#10;    predictions = clf.predict(data[regular])&#10;    confidences = clf.predict_proba(data[regular])&#10;&#10;&#10;    predictions = pd.DataFrame(predictions, columns=[&quot;prediction(&quot;+label+&quot;)&quot;])&#10;    confidences = pd.DataFrame(confidences,&#10;                               columns=[&quot;confidence(&quot; + str(c) + &quot;)&quot; for c in clf.classes_])&#10;&#10;    data = data.join(predictions)&#10;    data = data.join(confidences)&#10;    data.rm_metadata = meta&#10;    data.rm_metadata[&quot;prediction(&quot;+label+&quot;)&quot;] = (&quot;nominal&quot;,&quot;prediction&quot;)&#10;&#10;    for c in clf.classes_:&#10;        data.rm_metadata[&quot;confidence(&quot;+str(c)+&quot;)&quot;] = (&quot;numerical&quot;,&quot;confidence_&quot;+str(c))&#10;&#10;    return data, clf&#10;"/>
          </operator>
          <operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="179" y="34">
            <list key="class_weights"/>
          </operator>
          <connect from_port="model" to_op="Apply Model (2)" to_port="input 1"/>
          <connect from_port="test set" to_op="Apply Model (2)" to_port="input 2"/>
          <connect from_op="Apply Model (2)" from_port="output 1" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

SGolbert · December 2017

I think the process is correct, there were similar processes with R in the forum.

As a side note, can I ask why do you need to use the Python decision tree? By using the Execute Python operator several times (2 times per CV fold) you are generating a huge overhead and also messing up with the parallelization features of RapidMiner. I would say that the smarter thing to do would be to use the Decision Tree operator or do CV inside the Execute Python operator.

10383721 · December 2017

It is for our assignment to introduce the functionality of Execute Python in Rapid Miner.

Thanks for the info!

Thomas_Ott · May 2018

@JEdward Thanks for sharing, your sample code is going to be a life saver for me!!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Build decision tree using Python and embed in Rapid Miner

Best Answer

Answers