The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Discretization before or after Feature Selection?
Hello Rapidminer community,
I posted this question yesterday evening as well, however it has somehow disappeared after I edited it. I'm not sure if it will come back, so I thought I will ask again.
I have the following situation: I have a labelled dataset with 80+ features and ~3 million rows. I want to do a feature selection to get the ~10 most relevant features. The resulting features have to be discretized as I can only have a limited amount of different possibilities. For example, if a feature has values between 0-100 I will have to discretize it into 2-5 bins. Now I am unsure if I have to discretize all 80 variables first and then do the feature selection or if I can do the discretization only on the 10 most relevant features. How would this effect my result? I greatly appreciate your answers and explanations!
Tagged:
0
Best Answers
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 UnicornHi @green_tea,
I would say that Discretize the data must be performed before Feature Selection on the training set :
Don't forget to apply the same pre-processing step(s) to your test set...
Hope it helps,
Regards,
Lionel
10 -
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 UnicornThe issue is that features in general can behave quite differently after discretization than in their raw forms. Discretization both masks information and also transforms the input space. While it is "allowable" to do it either way, I think you would need to be pretty careful if you did feature selection first because what you have selected is not necessarily having the same relationship to your label after you transform it subsquently.
It also matters to this discussion what types of models you are using for both feature selection and your subsequent work. Some modeling algorithms will inherently discretize their continuous inputs (e.g., think tree-based alogorithms) in which case your selection can probably be done afterwards based on what is used in the initial screening, but where you will be better off using the splits that those trees find when doing your discretization. Other approaches create functional relationships (e.g., think linear regression or neural networks) in which case a discretized input could be very different from its raw form.8
Answers
Dortmund, Germany
Dortmund, Germany
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts