Prediction with Optional Features
Here's a question/scenario that has me going "hmm" ... I am faced with a regression problem where my dataset has examples with attributes {A, B, C} and other examples have attributes {A, B, C, D, E}. I'm scratching my head as I consider different ways to model the data to ultimately predict the target variable.
I understand at a basic level that my regression formula can't be Y = f(A,B,C,D,E) unless I have a way to impute/default the values "D" and "E" for those examples without those features. My thought process is "my model can make a more accurate prediction when it has more information" which is the hypothesis I want to prove with this data.
Anybody have experience developing a model(s) when some of the attributes are "optional?"
Best Answer
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
You have a couple of different approaches here:
- build a model on all the attributes and limit it to only records that have all the attributes populated
- build a model with all the attributes and use missing value replacement (mulitple options here) for any that are missing
- build a model with only the smaller set of attributes that are common to all examples
- build two separate models, one for the larger attribute dataset and build one for the smaller attribute dataset
It's probably not the case that one of these approaches is always better than the others because it will depend on your application and use case. They each have different pros and cons. Option 1 will usually give the best model but it will not be able to score all examples, while option 3 will give the most broadly applicable model but it won't be as powerful.
I've had good experience with the last option, which is essentially a segmented scorecard, although it requires enough examples of each type to train a good model separately. The second option is also a good possibility if there are reasons why the additional attributes are missing and that can be used to assign reasonable replacement values.
0
Answers
Thanks for the input, Brian. And it makes a lot of sense when I stop and think about it. Essentially, the attributes/features which are present in one subset of the data are there for a good reason (i.e. the device's configuration is such that it gives us additional data points). So imputing values for the other subset isn't even a valid premise. Thanks for helping me think that one through. It's probably a level-setting of expectations I need to have as now we're looking at several models perhaps based on device configuration. Logically it makes sense but there's the aspect of managing them, etc.
Thanks!