The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Assessing features performance on different datasets"
Hello,
My question is:
How to identify the features that work best on various different datasets? This means those features have to be robust and transferable and independent by the specific characteristics of an individual dataset.
My data:
- two-class problem
- 7 datasets with about 50 identical numerical features (ranges can differ significantly, but the question is not to find robust thresholds but rather identifying the key features that have a good performance across all datasets)
- Each dataset with about 5000 instances for training and testing
My ideas so far:
- select for each of the 7 datasets an optimal feature subset (e.g. by a wrapper feature selection) and simply count the occurences over all 7 results
- also, calculate "information gain" of features for the individual datasets. The average out of all 7 tests will reveal the robust features (? ..hopefully).
Do you think the ideas are worth to follow? Can you give me a hint to some problems, improvements, RapidMiner algorithms etc. as I'm relatively new to RM and data mining?
Thanks and Greetings
ollestrat
My question is:
How to identify the features that work best on various different datasets? This means those features have to be robust and transferable and independent by the specific characteristics of an individual dataset.
My data:
- two-class problem
- 7 datasets with about 50 identical numerical features (ranges can differ significantly, but the question is not to find robust thresholds but rather identifying the key features that have a good performance across all datasets)
- Each dataset with about 5000 instances for training and testing
My ideas so far:
- select for each of the 7 datasets an optimal feature subset (e.g. by a wrapper feature selection) and simply count the occurences over all 7 results
- also, calculate "information gain" of features for the individual datasets. The average out of all 7 tests will reveal the robust features (? ..hopefully).
Do you think the ideas are worth to follow? Can you give me a hint to some problems, improvements, RapidMiner algorithms etc. as I'm relatively new to RM and data mining?
Thanks and Greetings
ollestrat
Tagged:
0
Answers
your ideas make sense, I just want to add that you could probably make your estimation of the best multiple-purpose feature set even more robust, if you not only take the optimal features into account but the result of multiple feature selection runs for each data set. More often than not the feature set will be overfitted during its selection process and using a wrapper validation approach with an inner and an outer cross validation helps to overcome this issue.
Hardly anybody knows that the RapidMiner cross validation operators can be used to build the average not only from the performance but also from other averagables like feature weights. So I would suggest to calculate those averaged feature weights for each data set and average those results again over all data sets. Maybe here a better aggregation function would work even better.
Just my 2c, cheers,
Ingo
I set up a workflow to repeat the FSS on every dataset 10 times (10-fold "wrapper X-Validation") and indeed the subsets are varying to a fair degree, as you supposed, thus averaging the subsets seems to be a good choice.
However I didnt quite get how I can benefit from assessing the performance of features subsets on a further classifier (as its two times nested: "Optimze Selection" within the "Wrapper X-Validation"). "Optimize Selection" is a wrapper method and the "Wrapper X-Validation" requires again a classfier. Choosing exactly the same classifier will not lead to significantly different performance values compared to the FSS performance evaluation within the "Optimize Selection". And choosing a different classifier does not make sense either as a wrapper FSS is inherently biased towards its selected classifier. I'm probably misunderstanding here something.
Greetings
ollestrat
Is it possible to simply merge all the datasets?
Option 2.
Get ideas from this presentation:
The joint boost algorithm.
http://courses.engr.illinois.edu/ece598/ffl/paper_presentations/HaoTang_JointBoosting.pdf
Concerning JointBoosting: Didnt get it at first glance. Need a closer look at it. Thank you though
It's designed to find shared features among different concepts within different datasets.
The slides show that learning a harder problem of learning all concepts at once,
yields better results than learning then learning to discriminate between concept and rest on separate datasets.