The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Text Mining: Ranking Word Vector Occurrences for Output to OLS Model"
Background:
I have a file of short (<170 char) text descriptions of chief medical complaints when a patient is logged at reception of a medical facility. I also have the total service time associated with that patient. There is already an established OLS regression for other attributes logged at reception to predict a patients length of stay. I wish to see if I can extract a signal from the text field to improve the performance of the OLS model. Initially, there doesn't appear to be much lift from looking at the text field alone. My hypothesis is that while most of the text is just noise, there are certain n-grams that should provide a pretty strong signal for a long (>1 std. dev.) or short (<1 hour) length of stay (LOS).
Questions:
1) How can I show the performance (contribution) of each word vector in RapidMiner toward predicting the Long or Short LOS label?
2) Specifically, how do I output a weight factor that can then be used in the OLS?
3) Any other ideas for alternative approaches to combining text mining with OLS models?
Thanks!
I have a file of short (<170 char) text descriptions of chief medical complaints when a patient is logged at reception of a medical facility. I also have the total service time associated with that patient. There is already an established OLS regression for other attributes logged at reception to predict a patients length of stay. I wish to see if I can extract a signal from the text field to improve the performance of the OLS model. Initially, there doesn't appear to be much lift from looking at the text field alone. My hypothesis is that while most of the text is just noise, there are certain n-grams that should provide a pretty strong signal for a long (>1 std. dev.) or short (<1 hour) length of stay (LOS).
Questions:
1) How can I show the performance (contribution) of each word vector in RapidMiner toward predicting the Long or Short LOS label?
2) Specifically, how do I output a weight factor that can then be used in the OLS?
3) Any other ideas for alternative approaches to combining text mining with OLS models?
Thanks!
Tagged:
0
Answers
the Linear Regression or SVM (linear) in RapidMiner have a weight output that provides weighting factors.
Best regards,
Marius