The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Data mining
Hi guys, I was doing a job but I found a problem and I don't know how to start, I'm really new to using the rapidminer, and I would like to know if anyone could help me. I have to estimate Feature 8 which is the number of maintenance interventions the device has had. What can I do? Thanks André
Tagged:
0
Best Answer
-
yyhuang Administrator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data ScientistAssociation and correlations are not meant for predictions. They are more like “descriptive” models. What is your purpose here?1
Answers
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
I worked on your training data a bit to build regression trees based on clean features. The predictive model performs pretty good with 10-fold cross validation. RMSE is as follows
My process attached for your reference.
Cheers,
YY
Thanks
André
I used the csv files from you in another thread. They are attached here as well.
Cheers,
YY
This way I can understand what?
Ps. the feat1 could potentially result in some data leakage if we apply target encoding on such categorical attributes with soo many values. I don't have the context here but you can try to drop it by configuring "Target Encoding".
Pps. you can round up the predictions after scoring if you prefer to integers.
HTH!
André
I hope it makes sense
André
According to your definition, the model is predicting " Feat 8, which is the number of maintenance interventions."
I will stick to the regression models (KNN, regression tree, Random Forest, GLM, GBT are good choices for regression) because you will predict a numerical target. If the target is categorical, saying true/false, broken/normal, then go classification.
Besides visualization for data exploration and outlier detection, you can also use some of the outlier detection models (e.g. Tukey test for exponential distribution... )
I fully understand why you use the regression method, why the classification method is not the best, but I was kind of at a loss as to why you for example don't use the associations & correlations method is there a reason?