The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"RapidMiner and R, where an Integration is necessary"
Hello,
i wrote my Bachelor Thesis about the Integration of R in RapidMiner and it's potentials for data mining.
While working on this and even after finishing the Thesis, there is on unanswered question.
Are there any and if so, which processes cant be done with RapidMiner Operators and the Integration of R (Execute R Operator) is a must?
Do you have any examples for a situation like this?
best regards
i wrote my Bachelor Thesis about the Integration of R in RapidMiner and it's potentials for data mining.
While working on this and even after finishing the Thesis, there is on unanswered question.
Are there any and if so, which processes cant be done with RapidMiner Operators and the Integration of R (Execute R Operator) is a must?
Do you have any examples for a situation like this?
best regards
Tagged:
0
Answers
i think there is never a "must" because you can always write native RM operators. However the experience tells that the obstacle to write R/Python is lower than for native Java.
For me the use cases to use R or Python are either some file format with no native RM operator is available (or webservices using OAuth2...) or plotting. If you want to produce plots for scientific papers you might prefer R's ggplot over the standart RM plots.
~Martin
Dortmund, Germany
is it possible to get your thesis somehow (either online or a copy)?
I'm' always interested in academic works about RapidMiner.
Sven
sorry for the long waiting time.
Actually the thesis is written in german. I guess you guys cannot do much with it.
Thanks for the replies though.
If there are any use cases where you think R is much better to use than RM, let me know!
best regards
Eike
Even your thesis is written in German, I am very interested reading it. (German=my 4th language)
Sven
@Sven: I didn't know that you speak german Good to know!
Dortmund, Germany
For example, Decision Trees and Random Forests in RapidMiner only do classification, not regression. The ones in R do both.
Also, I particularly like the forecast R package that includes state of the art time series algorithms for automatic forecasting. (ARIMA, ETS, ...)
Spatial statistics for geographic information is another topic that's not supported in stock RapidMiner. I recently tried an approach with the built-in scripting and some libraries with some encouraging results.
i got the chance to have a look on your thesis. I would like to do one remark:
On page 21 you show how you implement the naive bayes in RM and afterwards you show how you do it in R. However in R you do something different. You learn a NB on data and apply it on the same data with not validation. This is wrong if you want to get a predictive value out of it.
This runs into a thought i have quite often: A program should allow you to do your things easily. I am not sure how X-Val works in R. But apperently it requires some more work. So this yields to rather quick and dirty than correct. This is a major disadvantage of R over RM.
Dortmund, Germany
I am a big fan of both programs. Without any doubt, Rapidminer is easier to use.
I teach at an MBA program. At some point in my courses of Stats I used R and students crucify me for subjecting them to such torture.
I use Rapidminer in a 2nd year course on Data Mining and students have never complained.
Having admitted that, obviously R has advantages.
How long does it take for Rapidminer to add a new algorithm to the list of available algorithms?
I'm waiting for RandomForest for Prediction problems (dependent variable continuous). And Random Forest is hardly a new method.
Just to be fair with R, with the right library doing Machine Learning can be fairly straightforward. Example with the wonderful library CARET: 4 lines of code!
Here I attach code for comparing 4 different models in R ( Linear Regression, Stochastic Gradient Boosting , Random Forest and KNN)
https://s3.amazonaws.com/mirlitus/DM-INCAE-3.R
i am totally on your side. I can fully understand your line of thought. I am myself a big fan of python, even though it takes me way longer to build things.
I integrated SKLearns RF into RM with python like this: http://data-analytics.ghost.io/how-to-get-the-best-out-of-python-and-rapidminer/ it should work for R too. I am not expert in R so i can not judge on it.
I am totally convinced that R is capable of all of the things. I am further convinced that standard problems are easily solveable in R. But i encountered the following problem in Python which drove me nuts:
I wanted to learn an algorithm together with a preprocessing (Normalization and PCA) in a x-val. I wanted to evalulate the method with a customer performance measure which includes example weights and confidence information. Afterwards i wanted to optimize on that.
Most of the stuff was done, but you needed to get deep into pipelines of sklearn. The custom scoring function was only possible if you do the x-val in a more handy fashion (k-Fold in sklearn, own class etc.).
Can you show me how to do
- Learn Normalization, PCA and Model TOGETHER in an x-val
- Use a custom scoring function lets say weighted accuracy for confidences > 0.75
- Optimize this
I think this is kind of a complex tast. In RM this is easy (at least it feels easy to do this for me).
Best,
Martin
P.S: Thanks for the great discussion :-)
Dortmund, Germany