Survival analysis
I started using RapidMiner today and I think it's great!
What I’m looking for specifically is a method for the prediction of cancer patient survival based on multiple measurements from histological specimens. This can be done in Python’s "DeepSurv" and R’s "randomForestSRC" packages. I know a bit of R, so I got the latter to work but I struggle with Python and DeepSurv. DeepSurv may be more accurate. It would be interesting to compare the results obtained with these (and possibly other) packages.
So, my question is: Has anyone ever implemented a (patient) survival prediction model in RapidMiner?
The difference to the "normal" process is that one uses two variables to train the model on. One dichotomous variable, like “churn” in the example database, and a time variable (i.e., the survival time). One does not merely want to know IF someone died but also how long that took because depending on the time, the “IF-variable” can mean totally different things. E.g., someone died, but only after a very long time. That would obviously correspond to a good prognosis.
Any ideas would be very welcome.
Thank you
Arnulf
Best Answer
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 UnicornSadly there is no direct implementation of a survival model algorithm in RapidMiner! I've often wished there was something like a Cox regression. Maybe a feature request here would be helpful!
Of course you can implement either python or R scripts inside RapidMiner, so that would be one potential route if you want to use RapidMiner for some of its other helpful features like data ETL and model training and management, but you still want to use one of those libraries for the actual machine learning algorithm.
In the meantime, you can also come up with your own native RapidMiner workarounds. Here are a couple that I have used before:
1) You can create two separate models, the first to predict a given event (like default) as a typical binominal label, and the second to predict "time to default" which of course you can only build on cases where there has been a default. These models can be used to manage risk for those that have not experienced the event yet, by looking at both scores in combination.
2) You can create separate labels for your target event that occur at different time horizons in meaningful intervals: e.g., default in first 30 days, default in first 60 days, default in first 90 days, etc. (where each label is separate, although they are not usually independent but rather cumulative, although you could format each interval as exclusive for other use cases) - with this setup, you then have a series of scores that express the likelihood of the event occurring within the specified time ranges, and this can also be used for appropriate risk management.
5
Answers
Hello
I find a paper for your question.
https://ieeexplore.ieee.org/abstract/document/8080031
I hope this helps
Sara
Kind regards,
Arnulf
Kind regards,
Arnulf