The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
How to ensure all nomnal values appear in each slice when doing XValidation?
Hi,
I am trying to use CrossValidation with Evolutionary Weights and Nearest Neighbor learning as described by Ingo at http://rapid-i.com/rapidforum/index.php/topic,41.msg87.html#msg87 . Specifically, I have this excerpt:
AttributeTypeException Process failed Message: Attribute 'myNomAttrib': Cannot map index of nominal attribute to nominal value: index -1 is out of bounds!
What I think is happening is that when the ModelApplier node inside the XValidation node executes, sometimes the holdout data contains a nominal value for the myNomAttrib attribute that did not occur in the training data, and that is causing the ModelApplier to fail.
If my assessment is correct, how can I avoid this situation? My first inclination was to use stratified sampling, but that only appears to work for nominal labels, not nominal attributes.
Thanks,
Keith
I am trying to use CrossValidation with Evolutionary Weights and Nearest Neighbor learning as described by Ingo at http://rapid-i.com/rapidforum/index.php/topic,41.msg87.html#msg87 . Specifically, I have this excerpt:
This works for me as long as all the features are attributes are numerical. However, I have a couple of nominal attributes I want to include, but when I try to include them, I get:
<operator name="WrapperXValidation" class="WrapperXValidation" expanded="yes">
<parameter key="number_of_validations" value="5"/>
<parameter key="sampling_type" value="shuffled sampling"/>
<operator name="EvolutionaryWeighting" class="EvolutionaryWeighting" expanded="yes">
<parameter key="maximum_number_of_generations" value="20"/>
<parameter key="p_crossover" value="0.5"/>
<parameter key="population_size" value="2"/>
<operator name="XValidation" class="XValidation" expanded="yes">
<parameter key="number_of_validations" value="5"/>
<operator name="WeightLearner" class="NearestNeighbors">
<parameter key="k" value="10"/>
<parameter key="weighted_vote" value="true"/>
</operator>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="ModelApplier" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="Performance" class="Performance">
</operator>
</operator>
</operator>
</operator>
<operator name="WeightedModelLearner" class="NearestNeighbors">
<parameter key="k" value="10"/>
<parameter key="weighted_vote" value="true"/>
</operator>
<operator name="WeightedApplierChain" class="OperatorChain" expanded="yes">
<operator name="WeightedModelApplier" class="ModelApplier">
<list key="application_parameters">
</list>
<parameter key="keep_model" value="true"/>
</operator>
<operator name="WeightedPerformance" class="Performance">
</operator>
</operator>
</operator>
AttributeTypeException Process failed Message: Attribute 'myNomAttrib': Cannot map index of nominal attribute to nominal value: index -1 is out of bounds!
What I think is happening is that when the ModelApplier node inside the XValidation node executes, sometimes the holdout data contains a nominal value for the myNomAttrib attribute that did not occur in the training data, and that is causing the ModelApplier to fail.
If my assessment is correct, how can I avoid this situation? My first inclination was to use stratified sampling, but that only appears to work for nominal labels, not nominal attributes.
Thanks,
Keith
0
Answers
thanks for pointing this out. I needed some time to find a data set where this occurs (it is of course more likely for smaller data sets with lots of nominal values) and I can confirm this problem. You were right that the re-mapping between training and test set was not possible in those cases. We fixed this by using simply the internally used value from the test set if it was not known by the training set. We have added this fix to this to the CVS version which will of course also be available in the next release and in the next update of the RapidMiner Enterprise Edition.
By the way: we are currently planning a revise of the RM data core which will cover two important aspects: 1) we will provide the possibility of working on data sets of arbitrary sizes without the need of external databases by providing a new data access and caching mechanism and 2) we will get rid of the internal mappings for nominal values which often cause compatibility problems like those and huge development efforts to get everything right. This new data core will be part of the upcoming version 5.0 of RapidMiner.
However, for now the fixed version should solve your problem.
Cheers,
Ingo