Questions on Automodel (AM)

mzn · April 2019

Hello, I have few questions on Automodel (AM):
1. How does "weights" (given under "General" tab) differ from "feature sets". For example, in one simulation, AM shows that a certain input has an importance of 1, however by examining feature sets in a couple of algorithms (say 4 out 7) that were selected by AM for this analysis, these 4 algorithms do not select this particular input (when I view "feature sets").

2. In "Optimal trade-offs between complexity and error" graph. I can find a model of complexity of 4 and an error of 15%. However, for this particular algorithm the accuracy was 72%. I guess I am not sure on how these two relate to each other.

3. Given the above, what would be the best way to know the critical inputs in a dataset? Say that I trying to identify critical inputs in one dataset using AM and this is my thought process: what are these critical inputs for GLM, LR, DL, DT, RF, GBT etc. such that I can pinpoint identified inputs that re-occur between algorithms. I guess, this is my way of identifying such parameters (i.e. if they show up in different algorithms, then they are of high importance to the dataset). Any tips on this are appreciated. Thanks!

IngoRM · April 2019

Hey @mzn

Ok, here we go:

1. How does "weights" (given under "General" tab) differ from "feature sets".

The weights in the General tab are simply the correlations of the attributes with the label. They are independent of any modeling and just in general some guidance what likely matters more. However, interactions between attributes picked up by models often make other variables much more important for this model than those with highest correlations and sometimes a combination of variables with zero correlation can beat a single one with let's say 0.7. You can open up the process by the way to see how the data is prepared and the correlations are calculated.

The chart in "Feature Sets" on the other hand is model-specific and takes those interactions into accounts. The creation of those chart actually has been a direct output of my research and if you are really bored feel free to check out my PhD thesis here:
https://www-ai.cs.uni-dortmund.de/PublicPublicationFiles/mierswa_2008a.pdf

If you just want the highlights, I would recommend the following two webinars instead:

https://www.youtube.com/watch?v=oSLASLV4cTc

https://www.youtube.com/watch?v=Ol0ZXN-GFTo

I also wrote a series of blog posts about this some time ago. Here is the link to the first one:

https://rapidminer.com/blog/multi-objective-feature-selection-part-1/

2. In "Optimal trade-offs between complexity and error" graph. I can find a model of complexity of 4 and an error of 15%. However, for this particular algorithm the accuracy was 72%. I guess I am not sure on how these two relate to each other.

The 15% error (= 85% accuracy) are the training error during the feature selection run for finding the points on this trade-off chart. The 72% accuracy (= 28% error rate) is the test error for this feature set on a hold-out set which was not used for running the feature engineering optimization. It is important that you do a correct validation for the feature selection as well, not just for the model building. Here you have a perfect example why: the error rate which can be expected in production is only 28%, not 15%! You can again open up the process for the particular model to see the details about the validation there.

3. Given the above, what would be the best way to know the critical inputs in a dataset?

I know this may sound a bit philosophical, but in my opinion there is no "critical inputs for the data set". There is only "critical inputs for a specific model on a data set". Each model type picks up different things in the data, some can work with feature interactions, some cannot. So different features often are important for one model but less for others. I am not a big fan of "averaging" those rankings across different model types. I know that people are doing this, it just does not make a lot of sense to me. Even you weigh the ranks for average building based on the model performance things are not much better in my opinion.

I rather would argue that you should identify a good model (based on the validation performance) and then state what are the most critical features for THAT model. If you run the feature selection with setting "Accuracy", this will be in general the set in the top left corner of the Pareto trade-off chart.

But this only tells you the set of features, not which feature is more or less important within that set. One way of figuring this out is to look at the colors of the Predictions entry. Columns with a lot of bolder colors across all rows in general are more important than those which are mostly light-colored. One of the next versions will put this into numbers, but for now you need to go visual. This is also true BTW if you do not apply feature selection at all.

Hope this helps,
Ingo

sgenzer · April 2019

tagging @IngoRM

mzn · April 2019

Perfect!
I really liked this statement

"..but in my opinion there is no "critical inputs for the data set". There is only "critical inputs for a specific model on a data set"."

and I think this is what I was missing! Thanks again.

sgenzer · April 2019

wow - thank you @IngoRM. Need to keep this post!

mzn · April 2019

@IngoRM one more question. How can I justify an analysis showing an input in the correlation matrix to have a negative correlation, when I know from experimental observations that this factor is more likely to be of positive correlation with the observation? Thanks!

IngoRM · April 2019

@mzn Is this input column by any chance nominal?

mzn · April 2019

@IngoRM
No, it is a numerical value (in this particular case, it is the spacing between two different components say columns in a building). Thank you

IngoRM · April 2019

Hmmm, since the math behind correlations is pretty simple and has been used 1,000s of times by as many users I kind of doubt that there is a bug (still possible of course). Is there any chance that the result can be correct and your prior knowledge may be off? You know, I have to ask :-)

If you are really 100% sure that the results are off, can you possibly share the data and the process so we can have a look on this together (maybe a screen share)?

Thanks,

Ingo

mzn · April 2019

@IngoRM
Sorry for the late reply. I am re-running the analysis using a different machine and will get back to you sometime tom. or Tuesday morning (I can definitely share the data + screen(s)). Thank you for your time.

mzn · April 2019

@IngoRM
So, I have re-ran the analysis on two different computers and found the following:
1. My home PC yields good results as you can here were the factor (s) has a + correlation (as expected):

Image: https://us.v-cdn.net/6030995/uploads/editor/3a/w05fr9kk9x65.jpg

2. My Office PC shows that the factor "s" has a - correlation (which is not quite true).

Image: https://us.v-cdn.net/6030995/uploads/editor/sw/ly0irb9nr8xa.jpg

This is what I have found, the database files are identical (expect one had two columns with rounded digits - the 2nd case), so I guess this was the issue. Thank you!

IngoRM · April 2019

Yeah, probably that was the issue indeed. Since all values for column S are different (while everything else is the same).

Cheers,
Ingo

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Questions on Automodel (AM)

Best Answer

Answers