Logistic regression: Select or change reference group
I am new to RapidMiner but have been working with logistic regression in SAS for years. When working with categorical attributes in logistic regression, how does RapidMiner choose which cateogry to be the reference category? Is it possible to change this to assign a different reference category?
For example, say I have Race in my model with five possible values of white, black, asian, other, and unknown and RapidMiner is assigning a weight of 0 to black (with all other weights being relative to black) but I want to change it so asian or white is the reference group with a weight of 0. Is there a way to do this?
Thanks.
Best Answer
-
earmijo Member Posts: 271 Unicorn
The solution to your problem is that you could create the dummies yourself.
In this first example, I let RM choose the reference category (they turn out to be Female for gender and First for Passenger Class.
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.5.003" expanded="true" height="68" name="Retrieve Titanic" width="90" x="45" y="238">
<parameter key="repository_entry" value="//Samples/data/Titanic"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role" width="90" x="179" y="238">
<parameter key="attribute_name" value="Survived"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.5.003" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="238">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Age|Sex|Passenger Class"/>
</operator>
<operator activated="true" class="h2o:logistic_regression" compatibility="7.5.000" expanded="true" height="103" name="Logistic Regression" width="90" x="514" y="238"/>
<connect from_op="Retrieve Titanic" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Logistic Regression" to_port="training set"/>
<connect from_op="Logistic Regression" from_port="model" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Then you get:
Say you want the reference categories to be Male and Third Class. You have to create dummies and use comparison groups. This gives you more control but you have to work more.
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.5.003" expanded="true" height="68" name="Retrieve Titanic" width="90" x="45" y="238">
<parameter key="repository_entry" value="//Samples/data/Titanic"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role" width="90" x="179" y="238">
<parameter key="attribute_name" value="Survived"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.5.003" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="238">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Age|Sex|Passenger Class"/>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="7.5.003" expanded="true" height="103" name="Nominal to Numerical" width="90" x="447" y="238">
<parameter key="use_comparison_groups" value="true"/>
<list key="comparison_groups">
<parameter key="Sex" value="Male"/>
<parameter key="Passenger Class" value="Third"/>
</list>
</operator>
<operator activated="true" class="h2o:logistic_regression" compatibility="7.5.000" expanded="true" height="103" name="Logistic Regression" width="90" x="648" y="238"/>
<connect from_op="Retrieve Titanic" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_op="Logistic Regression" to_port="training set"/>
<connect from_op="Logistic Regression" from_port="model" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Then you get:
Obviously you can get the original result using:
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.5.003" expanded="true" height="68" name="Retrieve Titanic" width="90" x="45" y="238">
<parameter key="repository_entry" value="//Samples/data/Titanic"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role" width="90" x="179" y="238">
<parameter key="attribute_name" value="Survived"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.5.003" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="238">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Age|Sex|Passenger Class"/>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="7.5.003" expanded="true" height="103" name="Nominal to Numerical" width="90" x="447" y="187">
<parameter key="use_comparison_groups" value="true"/>
<list key="comparison_groups">
<parameter key="Sex" value="Female"/>
<parameter key="Passenger Class" value="First"/>
</list>
</operator>
<operator activated="true" class="h2o:logistic_regression" compatibility="7.5.000" expanded="true" height="103" name="Logistic Regression" width="90" x="648" y="238"/>
<connect from_op="Retrieve Titanic" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_op="Logistic Regression" to_port="training set"/>
<connect from_op="Logistic Regression" from_port="model" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>And you get:
3
Answers
Hi Chris,
They way to control which target or reference variable you want to learn to is using the Set Role operator. Just select the variable name and set the parameter role to 'label.'
Thanks for the reply but I think maybe I didn't clearly state my question. I have the label set correctly, that's not an issue. What I'm trying to do is determine which level of category within my categorical independent variable in the model is set as the reference group that has a weight of zero within that categorical variable. The weights/coefficients that the model generates are relative to the reference group in the category.
In my particular model, race is one of the independent variables. When I run the model, RapidMiner is setting the reference group for the categorical race variable as the "black" group. All the coefficients associated with race in the model are then the relative coefficients for each race category relative to the "black" race group. Instead I want to set the "white" group as the reference group and show the coefficients for each race cateogry relative to the "white" group. Some races have positive coefficeint values right now relative to black but may have negative coefficient values when compared to the white group. Race isn't the only categorical predictor that I have in the model, it's just the one I'm using in my example since it's easily understood.
Does that help clear up what I'm trying to do?
Thanks.
That's perfect, exactly what I was trying to do. Thanks for your help!
You can also use the "Nominal to Numerical" operator and use the "effect coding" option, which allows you to specify your own comparison groups.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts