The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Difference between WEKA and RapidMiner
Legacy User
Member Posts: 0 Newbie
Hi @all,
I dont know if this is the right category for this topic, but...
Can anyone please tell me what are the main differences between WEKA and RapidMiner and what makes RapidMiner so special?
Thanks in advance
JJP
I dont know if this is the right category for this topic, but...
Can anyone please tell me what are the main differences between WEKA and RapidMiner and what makes RapidMiner so special?
Thanks in advance
JJP
0
Answers
hmm, this will hopefully not turn out to become just another another RapidMiner vs. Weka discussion. But anyway, here are some links:
* In the following thread, Martin has posted his opinion why he and his company preferred RapidMiner and he pointed out some differences:
http://rapid-i.com/rapidforum/index.php/topic,362.0.html
* And a Google search for Weka and RapidMiner would have give you the following link leading to a statement of mine within the KDnuggets newsletter (I would actually rather not like to be remembered to this discussion ):
http://www.kdnuggets.com/news/2007/n24/5i.html
* There was also a study done for the Data Mining Cup 2007 showing some differences of RapidMiner compared to other open source data mining solutions as well as proprietary ones:
http://www.prudsys.de/Service/Downloads/bin/DMC2007_schieder_tuchemnitz.pdf
Finally, you could also have a look into our KDD 2006 paper explaining some conceptual ideas behind RapidMiner to see those differences as well. And there were also some threads in the old forum at SourceForge where you can try to find some of these old threads discussing some of the differences.
But in any case: why did you not simply try RapidMiner and find it out yourself? The learning curve might be steep (hey, data mining is a complicated topic after all...) but it's usually worth the effort.
Cheers,
Ingo
Reading the benchmark between Weka, RapidMiner and KNIME, RM is a bit weak in data preparation. The solution is "datacleaner" here :
http://datacleaner.eobjects.org
Pure Java, query batch optimization, so efficient that sometimes for analyses purposes', you need not data-mining. Clementine has a "Data Quality Audit" showing features' histograms. Which such a tool as "datacleaner", it can go back to bed.
c.v.
thanks for providing the link to the Data Cleaner project. However, I am not aware of any data cleaning or data preprocessing functionality offered by Data Cleaner that is not already provided by RapidMiner. Could you name any?
RapidMiner actually provides significantly more data preprocessing functions and operators than Weka, KNIME, and SPSS Clementine. Feature histograms are also available in RapidMiner and RapidMiner also provides many data cleaning features. If you are not aware of those, I can recommend the RapidMiner training course on Advanced Data Preprocessing for Data Mining with RapidMiner as well as a series of webinars on data preprocessing and data cleaning with RapidMiner.
Best regards,
Ralf
I think that ETL tools and Data Mining Tools cannot be compared directly.
This can illustrated (example: kettle ((Pentaho Data Integration) )) how the data flow is organized: In iterators. A process in kettle is a good one, if all steps process only one row at once. This way you can load, process and save the data in small portions instead of loading all at once in the memory (like R, *snicker*). RapidMiner has improved regarding such tasks, but as far I as see it is still not possible to (e.g.) load data row-wise from a csv-file. I know it is possible to do this by loading data from a database, but then it is not possible to monitor the processed rows, i.e. ...
- show the current process state of the rows
- if a row could not be processed without an error, store it in an extra - file to check manually what has happened
If one step does not satisfy this condition (like sorting), the process is getting really slow. Pentaho Corp has bought Weka to include the data mining framework into their application (http://wiki.pentaho.com/display/DATAMINING/Using+the+Knowledge+Flow+Plugin), but frankly: I do not think that this was a good idea, embedding one dataflow philosophy into another one.Another point is the separation of data management and data analysis. Departments have to talk to each other, but in general I think this are different areas with different targets and responsibilities.
Conclusion:
I would use etl tools for cleaning (which does not include steps like discretization, more steps like duplicate checking) and managing data and shifting data around from one source to another. Shall the DW - specialists take care of it. But if it comes to the point of solving actual data mining problems, I would ask the DW-guys to tell me how to get exactly the data I want and then perform the analysis with RapidMiner.
my (of course subjective) point of view
kind regards,
Steffen
I agree that the way the data flow is organized is a major differentiator between most ETL and data mining tools. And Kettle and Weka do have different flow logics. However, to some extend, RapidMiner offers both flow logics:
Ralf
- It prepares preprocessing, verifying a few consistency points in your datas
- It gives you the main pattern to use in predicates or in regexps when you do linguistics analysis, NER, indexing, etc...
For each string, "String analysis" can give the number of blank spaces (useful for trimming), Lower/Uppercase, number of words in a string (string vs nominal).
In another profiler (FEBRL, not to give it), you can use distances between words, for phonetic indexing widely spread in data quality :
- soundex, phonex, phonix, metaphone, NYSIIS, etc...
- block/canopy indexing
Other distances, as jaro-winkler or levenstein are available here :
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
All this stuff is indeed string data quality and is not in RapidMiner, except a few algorithms in TextInput (TF-IDF, cosine distance)
c.v.
http://www.slideshare.net/MayurSurani/data-mining-tools-45159317
This make clear statement that Rapidminer is the best.
I tried to visit the first suggested site for a comparison between WEKA and RapidMiner
http://rapid-i.com/rapidforum/index.php/topic,362.0.html
and got Error 404 file not found. As it was the first link that I had ever followed from the Community, I thought it reasonable to report the problem.
Best wishes for an exciting venture
Laurie
I also tried to follow the link to
http://www.prudsys.de/Service/Downloads/bin/DMC2007_schieder_tuchemnitz.pdf
On that occasion I got
Seite nicht gefunden. (which I translate as "site not found")
Keep up the good work, but check some of the links
Best wishes
Laurie
Hi Laurie,
Thanks for reporting. We migrated from an old forum system to this new community portal a couple of months ago and unfortunately not all links have been automatically replaced during this migration process. So whenever you see a link still going to "...rapid-i.com/..." is is not going to work unfortunately :smileysad:
But anyways: Have fun here in the community,
Ingo