The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Dealing with an important but often missing attribute
What is a good way to use an attribute that is important when a value is available, but is missing for a large percentage of the data set?
I have an example set containing data that go back about 20 years. Each example has a 20-30 attributes, most of which are available for the entire 20 year span. However, there are some attributes that are only available for recent data (past 5 years or so), and are missing for all the examples prior to that time. These newer attributes, if present, are likely to be strong predictors for the regression problem I'm trying to solve.
My preferred model is a nearest neighbors (actually W-LWL), as its been found to work quite well when using attributes that are available throughout the timespan. However, if I simply fill in the missing values with the average (MissingValueReplenishment), then such a large fraction of the dataset has a single value that it doesn't get selected or weighted highly.
Is there an alternate way of modeling this such that it would take advantage of these useful-but-rare attributes only when they are present?
I have an example set containing data that go back about 20 years. Each example has a 20-30 attributes, most of which are available for the entire 20 year span. However, there are some attributes that are only available for recent data (past 5 years or so), and are missing for all the examples prior to that time. These newer attributes, if present, are likely to be strong predictors for the regression problem I'm trying to solve.
My preferred model is a nearest neighbors (actually W-LWL), as its been found to work quite well when using attributes that are available throughout the timespan. However, if I simply fill in the missing values with the average (MissingValueReplenishment), then such a large fraction of the dataset has a single value that it doesn't get selected or weighted highly.
Is there an alternate way of modeling this such that it would take advantage of these useful-but-rare attributes only when they are present?
0