The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Replicating RapidMiner RandomForest Results in Python
Hi, I have a Random Forest binary classification model which, after dimensional reduction, I have 13 variables. Most are numeric. However, I also have a date and a couple of polynomial attributes (eg SIC code). I am getting accuracy of almost 75% which, for the complexity of the problem, I am reasonably pleased about.
However, I would like to now try to replicate the RapidMiner results in Python. But, in order to do so, I would like to understand a little better about how RapidMiner is making calculations in the string data. For example, one of my string attributes is a SIC code (Standard Industrial Classification). These codes appear numeric but I am regarding them as polynomial to avoid the algorithm trying to assign an order of importance to them which wouldn't make sense.
When it comes to attributes like these, I don't know how RapidMiner is using them. Python libraries like sklearn require all Random Forest inputs to be numeric and suggest things like 'one hot encoding' for converting non numeric data to numeric. However, there are over 800 unique SIC codes in my data and one-hot encoding is not practical in such a situation and the SIC code does appear to be an attribute of very high importance which I cannot just remove.
Is Rapidminer performing one hot encoding in the background here?
What Python library should I use to behave most like RapidMiner - allowing polynomials and dates?
However, I would like to now try to replicate the RapidMiner results in Python. But, in order to do so, I would like to understand a little better about how RapidMiner is making calculations in the string data. For example, one of my string attributes is a SIC code (Standard Industrial Classification). These codes appear numeric but I am regarding them as polynomial to avoid the algorithm trying to assign an order of importance to them which wouldn't make sense.
When it comes to attributes like these, I don't know how RapidMiner is using them. Python libraries like sklearn require all Random Forest inputs to be numeric and suggest things like 'one hot encoding' for converting non numeric data to numeric. However, there are over 800 unique SIC codes in my data and one-hot encoding is not practical in such a situation and the SIC code does appear to be an attribute of very high importance which I cannot just remove.
Is Rapidminer performing one hot encoding in the background here?
What Python library should I use to behave most like RapidMiner - allowing polynomials and dates?
Tagged:
0
Answers
The machine learning models have different capabilities or compatibility for numerical, nominal attributes/labels. The random forest algorithm does handle both numerical and nominal attributes. If you need to encode the SIC for SVM, which can not handle nominal, try dummy coding or unique integers methods in "nominal to numerical" operator.
In other specific cases, e.g. zip codes in United States, the attribute would look like numeric values, 10003, 02184, but we would like to make it nominal to keep the leading zeros in zip. We will use "numerical to polynominal" to convert the zip codes.
HTH!
You should find a function in your environment that implements this functionality.
However, dummy coding is of course functionally equivalent - it just creates hundreds of new 0/1 attributes.
As an optimization, you could look at your trees in RapidMiner and find if there are only a few relevant attribute values in the nominal attributes - you would then only keep these and change the rest to a constant value.
Dortmund, Germany
okay, sorry. That's what i also meant!
BR,
martin
Dortmund, Germany
@SGolbert - thank you - I agree that leaving variables as polynomial isn't desirable. Unfortunately, as indicated in my OP, the SIC codes are apparently predictive, and I cant absolutely confirm that until I clean them properly, so I cannot just omit them. Secondly, SIC Codes are not my only polynomial variable - there are also occupation codes which may be highly predictive too. And there are many of those also,