[SOLVED] Survival Analysis in RapidMiner -- Help Preparing Dataset

earmijo · December 2014

I teach a course of Data Mining in an MBA program. I have done it for several years now and I use RapidMiner as the main software program.

This year I want to introduce the topic of Survival Analysis in Data Mining. The main application is to model customer retention. I have searched this forum and I have concluded that the standard models for doing SA are not available and will not be available anytime soon.

That was bad news for me because I don't want to use two packages (I could use R). And then.... I found this magnificent paper by Singer & Willet on Discrete-Time Survival Analysis. http://gseacademic.harvard.edu/~willetjo/pdf%20files/Singer%20&%20Willett%201993.pdf

Bottom line: All you need is Logistic Regression. So far so good. There is a little problem. The dataset has to be put a specific format (the so called person-period format).

I'll explain with an example:

Suppose I have the following dataset:

id,month,event,x1,x2
1,5,0,0.19,0.65
2,6,1,0.41,0.33
3,7,0,0.22,0.79
4,8,1,0.56,0.91
5,9,0,0.71,0.36

id = patient's id
months = months to event or censoring time
event = 1 if event (death for instance) occurred , 0 if censored (at the time study finished event hadn't taken place)
x1, x2 are potential explanatory variables.

To be able to run the model suggested by Willet & Singer I need that dataset in the format below.

id,month,event,x1,x2
1,1,0,0.19,0.65
1,2,0,0.19,0.65
1,3,0,0.19,0.65
1,4,0,0.19,0.65
1,5,0,0.19,0.65
2,1,0,0.41,0.33
2,2,0,0.41,0.33
2,3,0,0.41,0.33
2,4,0,0.41,0.33
2,5,0,0.41,0.33
2,6,1,0.41,0.33
3,1,0,0.22,0.79
3,2,0,0.22,0.79
3,3,0,0.22,0.79
3,4,0,0.22,0.79
3,5,0,0.22,0.79
3,6,0,0.22,0.79
3,7,0,0.22,0.79
4,1,0,0.56,0.91
4,2,0,0.56,0.91
4,3,0,0.56,0.91
4,4,0,0.56,0.91
4,5,0,0.56,0.91
4,6,0,0.56,0.91
4,7,0,0.56,0.91
4,8,1,0.56,0.91
5,1,0,0.71,0.36
5,2,0,0.71,0.36
5,3,0,0.71,0.36
5,4,0,0.71,0.36
5,5,0,0.71,0.36
5,6,0,0.71,0.36
5,7,0,0.71,0.36
5,8,0,0.71,0.36
5,9,0,0.71,0.36

We want to create a separate observation for each period that each
person was observed, up to the year in which a patient
change occurred.

Thus persons who died in
year 1 contributed 1 person-year each; those who died
in year 6 (like individual 2) contributed 6 person-years.
The value of the variable event is 0 for the first 5 periods and
1 for the sixth period.

Censored individuals (those who were still alive at the study) as many periods as they were observed.
For instance, individual 5, contributes 5 periods. For all the periods observed
the variable event takes the value of 0.

Help is greatly appreciated.

awchisholm · December 2014

hello earmijo

I made a small process that you could use and modify as you need. It uses the Fill Data Gaps and Cartesian Product operators with some macros to control it.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.1.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.1.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data_user_specification" compatibility="6.1.000" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="45" y="75">
        <list key="attribute_values">
          <parameter key="id" value="1"/>
          <parameter key="month" value="5"/>
          <parameter key="event" value="0"/>
          <parameter key="x1" value="0.19"/>
          <parameter key="x2" value="0.65"/>
        </list>
        <list key="set_additional_roles">
          <parameter key="month" value="id"/>
        </list>
      </operator>
      <operator activated="true" class="real_to_integer" compatibility="6.1.000" expanded="true" height="76" name="Real to Integer" width="90" x="45" y="165">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="month|id"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="6.1.000" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="120">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="month"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="extract_macro" compatibility="6.1.000" expanded="true" height="60" name="Extract Macro" width="90" x="313" y="75">
        <parameter key="macro" value="maxMonth"/>
        <parameter key="macro_type" value="data_value"/>
        <parameter key="attribute_name" value="month"/>
        <parameter key="example_index" value="1"/>
        <list key="additional_macros"/>
      </operator>
      <operator activated="true" class="fill_data_gaps" compatibility="6.1.000" expanded="true" height="76" name="Fill Data Gaps" width="90" x="447" y="75">
        <parameter key="use_gcd_for_step_size" value="false"/>
        <parameter key="start" value="1"/>
        <parameter key="end" value="%{maxMonth}"/>
      </operator>
      <operator activated="true" class="cartesian_product" compatibility="6.1.000" expanded="true" height="76" name="Cartesian" width="90" x="447" y="210"/>
      <connect from_op="Generate Data by User Specification" from_port="output" to_op="Real to Integer" to_port="example set input"/>
      <connect from_op="Real to Integer" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
      <connect from_op="Select Attributes" from_port="original" to_op="Cartesian" to_port="right"/>
      <connect from_op="Extract Macro" from_port="example set" to_op="Fill Data Gaps" to_port="example set input"/>
      <connect from_op="Fill Data Gaps" from_port="example set output" to_op="Cartesian" to_port="left"/>
      <connect from_op="Cartesian" from_port="join" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

regards

Andrew

earmijo · December 2014

Andrew:

Brilliant. Thank you very much. Although it took me a few hours to figure out how to extend your program ( I am that slow), I finally did it.

Here's the code in case anybody need to acomplish the same task. I'm not sure it's the most elegant or efficient code since I'm a rookie but it does the job. I'll try to turn it into a template and post it here.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="5.3.015" expanded="true" height="60" name="Retrieve test" width="90" x="45" y="75">
        <parameter key="repository_entry" value="//Clases/test"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.3.015" expanded="true" height="76" name="Set Role" width="90" x="179" y="75">
        <parameter key="attribute_name" value="month"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="loop_examples" compatibility="5.3.015" expanded="true" height="94" name="Loop Examples" width="90" x="313" y="75">
        <process expanded="true">
          <operator activated="true" class="filter_example_range" compatibility="5.3.015" expanded="true" height="76" name="Filter Example Range" width="90" x="45" y="75">
            <parameter key="first_example" value="%{example}"/>
            <parameter key="last_example" value="%{example}"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="5.3.015" expanded="true" height="76" name="Select Attributes" width="90" x="45" y="165">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="month|"/>
          </operator>
          <operator activated="true" class="subprocess" compatibility="5.3.015" expanded="true" height="76" name="Subprocess" width="90" x="179" y="210">
            <process expanded="true">
              <operator activated="true" class="extract_macro" compatibility="5.3.015" expanded="true" height="60" name="Extract Macro (2)" width="90" x="45" y="30">
                <parameter key="macro" value="EventValue"/>
                <parameter key="macro_type" value="data_value"/>
                <parameter key="attribute_name" value="event"/>
                <parameter key="example_index" value="1"/>
                <list key="additional_macros"/>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="5.3.015" expanded="true" height="76" name="Select Attributes (2)" width="90" x="45" y="120">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="event"/>
                <parameter key="invert_selection" value="true"/>
              </operator>
              <operator activated="true" class="generate_attributes" compatibility="5.3.015" expanded="true" height="76" name="Generate Attributes" width="90" x="179" y="120">
                <list key="function_descriptions">
                  <parameter key="event" value="0"/>
                </list>
              </operator>
              <connect from_port="in 1" to_op="Extract Macro (2)" to_port="example set"/>
              <connect from_op="Extract Macro (2)" from_port="example set" to_op="Select Attributes (2)" to_port="example set input"/>
              <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="extract_macro" compatibility="5.3.015" expanded="true" height="60" name="Extract Macro" width="90" x="179" y="30">
            <parameter key="macro" value="maxMonth"/>
            <parameter key="macro_type" value="data_value"/>
            <parameter key="attribute_name" value="month"/>
            <parameter key="example_index" value="1"/>
            <list key="additional_macros"/>
          </operator>
          <operator activated="true" class="fill_data_gaps" compatibility="5.3.015" expanded="true" height="76" name="Fill Data Gaps" width="90" x="313" y="30">
            <parameter key="use_gcd_for_step_size" value="false"/>
            <parameter key="start" value="1"/>
            <parameter key="end" value="%{maxMonth}"/>
          </operator>
          <operator activated="true" class="cartesian_product" compatibility="5.3.015" expanded="true" height="76" name="Cartesian" width="90" x="313" y="165"/>
          <operator activated="true" class="set_data" compatibility="5.3.015" expanded="true" height="76" name="Set Data" width="90" x="447" y="30">
            <parameter key="example_index" value="%{maxMonth}"/>
            <parameter key="attribute_name" value="event"/>
            <parameter key="value" value="%{EventValue}"/>
            <list key="additional_values"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="5.3.015" expanded="true" height="76" name="Set Role (2)" width="90" x="514" y="165">
            <parameter key="attribute_name" value="month"/>
            <list key="set_additional_roles"/>
          </operator>
          <connect from_port="example set" to_op="Filter Example Range" to_port="example set input"/>
          <connect from_op="Filter Example Range" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
          <connect from_op="Select Attributes" from_port="original" to_op="Subprocess" to_port="in 1"/>
          <connect from_op="Subprocess" from_port="out 1" to_op="Cartesian" to_port="right"/>
          <connect from_op="Extract Macro" from_port="example set" to_op="Fill Data Gaps" to_port="example set input"/>
          <connect from_op="Fill Data Gaps" from_port="example set output" to_op="Cartesian" to_port="left"/>
          <connect from_op="Cartesian" from_port="join" to_op="Set Data" to_port="example set input"/>
          <connect from_op="Set Data" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
          <connect from_op="Set Role (2)" from_port="example set output" to_port="output 1"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" compatibility="5.3.015" expanded="true" height="76" name="Append" width="90" x="447" y="75"/>
      <connect from_op="Retrieve test" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Loop Examples" to_port="example set"/>
      <connect from_op="Loop Examples" from_port="output 1" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Here's a link to the toy dataset:

https://db.tt/YdNiQ8rG

wilsonchua · October 2015

Is it possible to use the rapidminer Windowing Operator for this?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

[SOLVED] Survival Analysis in RapidMiner -- Help Preparing Dataset

Answers