The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Columns with too many values
Chemical_eng
Member Posts: 16 Contributor II
in Help
Hello.
I am using AutoModel for a regression problem ( my target is continuous). I have 3 input parameters for which I have categorical values. For one of them I have 27 values, for the other 16, but for another I have 107. I have toggled off the option of "Remove columns with too many values". Does this ensure that the one hot encoding is performed correctly for the column with 107 values ?
What does it mean when for different categories in the generalized linear model I have coefficient 0 for many categories , is it not taking the impact ?
Thanks
I am using AutoModel for a regression problem ( my target is continuous). I have 3 input parameters for which I have categorical values. For one of them I have 27 values, for the other 16, but for another I have 107. I have toggled off the option of "Remove columns with too many values". Does this ensure that the one hot encoding is performed correctly for the column with 107 values ?
What does it mean when for different categories in the generalized linear model I have coefficient 0 for many categories , is it not taking the impact ?
Thanks
Tagged:
0
Best Answer
-
yyhuang Administrator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data ScientistHi @Chemical_eng,
Thanks for sharing your experience using AutoML for a regression problem.I have toggled off the option of "Remove columns with too many values". Does this ensure that the one hot encoding is performed correctly for the column with 107 values ?Yes and No. RapidMiner AutoML by default, uses "Target encoding" to remove attributes with too many values and no encoding performed. However, GLM algorithm itself will handle categorical columns directly by one-hot encoding (internally). You don't have to transform the nominal to numerical beforehand for GLM. We strongly recommend avoiding one-hot encoding categorical columns with any levels into many binary columns, as this is very inefficient. That is why we perform target encoding before the GLM internal one-hot encoding.
I tested the Titanic data in AutoML to predict the passenger fare.
open the process here
In Design view, you can locate the operator that handle nominal attributes (another tip, activate the Tree view ). Here it is.
Inside the subprocess "Basic Feature Engineering", you can find "Target Encoding" instead of one hot encoding as shown in my example. If turn on "Remove cloumns with too many values" with a max num of values set as 10, the Target encoding model will remove the attribute "Life boat", but no encodings as default. Here you can customize it by replacing with one-hot encoding operators.The too many of zero coefficients is usually comes from the "regularization" in GLM. Simply put, Regularization is used to reduce the number of predictors in the model to reduce variance of the prediction error, to handle correlated predictors, and to avoid overfitting. https://en.wikipedia.org/wiki/Regularization_(mathematics)What does it mean when for different categories in the generalized linear model I have coefficient 0 for many categories , is it not taking the impact ?
Again, in the process view, you can toggle off the option of regularization.
Hope it helps.
Cheers,
YY
1
Answers
Like the screenshot shows, we have a dropdown list of all possible values in the categorial variable.
If you are available for a follow-up, I could walk you through the details in a quick call.