Continuous vs Categorical
Hi All,
I have small question regarding the type of variables.I have continuous variable called tempereature which is have only 2 values {90,220}in my entire data set.
I am little confused over taking this featues as categorical since it has only 2 values in my data set all the time or take it as continuous value ?
Is there any infulence of choosing the one of them to the model performance?
Thanks in advance.
Regards,
Vishnu
Best Answer
-
sgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
hmm ok. Basically it depends on whether or not you care whether or not 90 is less than 220. If you look at it as a binary classification problem, RapidMiner will just treat them as "apple" and "orange". If you wish to use the idea that 220 is greater than 90 for some reason, you should keep it numerical.
That's all I can really say off the cuff unless I know more about your use case.
Scott1
Answers
hello @k_vishnu772 - so that really, really depends on your use case. I could make arguments one way or the other, depending on what you want to do.
Some quick requests so we can help you:
• Post your XML process here in this thread (see this post for instructions on How to Post on the Community)
• Attach your dataset if possible (use a fictionalized version if there are privacy concerns)
• Make sure you have all necessary extensions installed (see https://youtu.be/pjBqG3xtXx4)
Scott
@sgenzer
Hi Sir Thanks for your reply .I Cannot disclose the data as i have not right to do that .So just want to understand how you can say it depends on use case ,could you please explain me any use case that you have so that i can relate to my problem.
Thanks in advance.
Regards,
Vishnu
You also need to think about other potential values in the data if you are going to apply the model to future samples. If you treat the attribute as nominal in your development data, then in the future if you have any values that are not exactly 90 or 220, your model may not be able to handle them. So I would recommend either keeping the temperature as a numerical, or at least binning it using one of the Discretize operators (e.g., you could make temperature <100 vs >=100), because in that way, you will be able to handle future numerical values that were not present in your development sample.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hi all,
good observation about bining!
If you keep the variable numerical, then 220 is not only bigger than 90, it is more than double! This could mess with some (linear) model types.
Regards,
Sebastian