The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
[SOLVED] Survival Analysis in RapidMiner -- Help Preparing Dataset
I teach a course of Data Mining in an MBA program. I have done it for several years now and I use RapidMiner as the main software program.
This year I want to introduce the topic of Survival Analysis in Data Mining. The main application is to model customer retention. I have searched this forum and I have concluded that the standard models for doing SA are not available and will not be available anytime soon.
That was bad news for me because I don't want to use two packages (I could use R). And then.... I found this magnificent paper by Singer & Willet on Discrete-Time Survival Analysis. http://gseacademic.harvard.edu/~willetjo/pdf%20files/Singer%20&%20Willett%201993.pdf
Bottom line: All you need is Logistic Regression. So far so good. There is a little problem. The dataset has to be put a specific format (the so called person-period format).
I'll explain with an example:
Suppose I have the following dataset:
id,month,event,x1,x2
1,5,0,0.19,0.65
2,6,1,0.41,0.33
3,7,0,0.22,0.79
4,8,1,0.56,0.91
5,9,0,0.71,0.36
id = patient's id
months = months to event or censoring time
event = 1 if event (death for instance) occurred , 0 if censored (at the time study finished event hadn't taken place)
x1, x2 are potential explanatory variables.
To be able to run the model suggested by Willet & Singer I need that dataset in the format below.
id,month,event,x1,x2
1,1,0,0.19,0.65
1,2,0,0.19,0.65
1,3,0,0.19,0.65
1,4,0,0.19,0.65
1,5,0,0.19,0.65
2,1,0,0.41,0.33
2,2,0,0.41,0.33
2,3,0,0.41,0.33
2,4,0,0.41,0.33
2,5,0,0.41,0.33
2,6,1,0.41,0.33
3,1,0,0.22,0.79
3,2,0,0.22,0.79
3,3,0,0.22,0.79
3,4,0,0.22,0.79
3,5,0,0.22,0.79
3,6,0,0.22,0.79
3,7,0,0.22,0.79
4,1,0,0.56,0.91
4,2,0,0.56,0.91
4,3,0,0.56,0.91
4,4,0,0.56,0.91
4,5,0,0.56,0.91
4,6,0,0.56,0.91
4,7,0,0.56,0.91
4,8,1,0.56,0.91
5,1,0,0.71,0.36
5,2,0,0.71,0.36
5,3,0,0.71,0.36
5,4,0,0.71,0.36
5,5,0,0.71,0.36
5,6,0,0.71,0.36
5,7,0,0.71,0.36
5,8,0,0.71,0.36
5,9,0,0.71,0.36
We want to create a separate observation for each period that each
person was observed, up to the year in which a patient
change occurred.
Thus persons who died in
year 1 contributed 1 person-year each; those who died
in year 6 (like individual 2) contributed 6 person-years.
The value of the variable event is 0 for the first 5 periods and
1 for the sixth period.
Censored individuals (those who were still alive at the study) as many periods as they were observed.
For instance, individual 5, contributes 5 periods. For all the periods observed
the variable event takes the value of 0.
Help is greatly appreciated.
This year I want to introduce the topic of Survival Analysis in Data Mining. The main application is to model customer retention. I have searched this forum and I have concluded that the standard models for doing SA are not available and will not be available anytime soon.
That was bad news for me because I don't want to use two packages (I could use R). And then.... I found this magnificent paper by Singer & Willet on Discrete-Time Survival Analysis. http://gseacademic.harvard.edu/~willetjo/pdf%20files/Singer%20&%20Willett%201993.pdf
Bottom line: All you need is Logistic Regression. So far so good. There is a little problem. The dataset has to be put a specific format (the so called person-period format).
I'll explain with an example:
Suppose I have the following dataset:
id,month,event,x1,x2
1,5,0,0.19,0.65
2,6,1,0.41,0.33
3,7,0,0.22,0.79
4,8,1,0.56,0.91
5,9,0,0.71,0.36
id = patient's id
months = months to event or censoring time
event = 1 if event (death for instance) occurred , 0 if censored (at the time study finished event hadn't taken place)
x1, x2 are potential explanatory variables.
To be able to run the model suggested by Willet & Singer I need that dataset in the format below.
id,month,event,x1,x2
1,1,0,0.19,0.65
1,2,0,0.19,0.65
1,3,0,0.19,0.65
1,4,0,0.19,0.65
1,5,0,0.19,0.65
2,1,0,0.41,0.33
2,2,0,0.41,0.33
2,3,0,0.41,0.33
2,4,0,0.41,0.33
2,5,0,0.41,0.33
2,6,1,0.41,0.33
3,1,0,0.22,0.79
3,2,0,0.22,0.79
3,3,0,0.22,0.79
3,4,0,0.22,0.79
3,5,0,0.22,0.79
3,6,0,0.22,0.79
3,7,0,0.22,0.79
4,1,0,0.56,0.91
4,2,0,0.56,0.91
4,3,0,0.56,0.91
4,4,0,0.56,0.91
4,5,0,0.56,0.91
4,6,0,0.56,0.91
4,7,0,0.56,0.91
4,8,1,0.56,0.91
5,1,0,0.71,0.36
5,2,0,0.71,0.36
5,3,0,0.71,0.36
5,4,0,0.71,0.36
5,5,0,0.71,0.36
5,6,0,0.71,0.36
5,7,0,0.71,0.36
5,8,0,0.71,0.36
5,9,0,0.71,0.36
We want to create a separate observation for each period that each
person was observed, up to the year in which a patient
change occurred.
Thus persons who died in
year 1 contributed 1 person-year each; those who died
in year 6 (like individual 2) contributed 6 person-years.
The value of the variable event is 0 for the first 5 periods and
1 for the sixth period.
Censored individuals (those who were still alive at the study) as many periods as they were observed.
For instance, individual 5, contributes 5 periods. For all the periods observed
the variable event takes the value of 0.
Help is greatly appreciated.
0
Answers
I made a small process that you could use and modify as you need. It uses the Fill Data Gaps and Cartesian Product operators with some macros to control it.
regards
Andrew
Brilliant. Thank you very much. Although it took me a few hours to figure out how to extend your program ( I am that slow), I finally did it.
Here's the code in case anybody need to acomplish the same task. I'm not sure it's the most elegant or efficient code since I'm a rookie but it does the job. I'll try to turn it into a template and post it here. Here's a link to the toy dataset:
https://db.tt/YdNiQ8rG