The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Feature importance operators fail on datasets with features without any data
When an ExampleSet contains even just a single feature, which consists only of missing values, following operators:
Similarly, Weight by Rules fails with:
- Weight by Information Gain Ratio
- Weight by Information Gain
- Weight by Gini
- Weight by Uncertainty
fail with:
Exception: java.lang.ArrayIndexOutOfBoundsException<br>Message: 0
</code><br><code>
Exception: com.rapidminer.example.AttributeTypeException<br>Message: Cannot map index of nominal attribute to nominal value: index 0 is out of bounds!
Known workaround: Use first Remove Useless Attributes.
Expected result: Zero weight for features without any data.
Justification:
- Sometimes I want to report the relevance of all the features in the dataset.
- I dislike when a time consuming process fails because of some unlucky random seed in cross-validation...
Proposed action: Add a parameterized test, which tests all feature weighting operators whether they can handle a feature without any data (be it a nominal, numerical or date column).
Reasoning: I didn't test all the operators. And there is a good chance other operators might share the same "halt the world" trait.
Reasoning: I didn't test all the operators. And there is a good chance other operators might share the same "halt the world" trait.
Tagged:
0
Answers
I definitely do not propose to handle missing values as a placeholder for any value, because then we would have to return ranges (or distributions) instead of point estimates, whenever there is at least one missing value in a feature.
Nevertheless, I would argue that Java error is not the best possible result. If it was, operators like Decision Tree would have to be modified to also return Java error, whenever there is a variable with all missing values in the dataset.