The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
distributed data mining support
When data is growing larger and larger, data mining algorithms can't finish the computation in time.
Distributed data mining might be a good solution. There are already MPI framework data mining algorithms and MapReduce framework DM algorithms such as Apache Mahout which is based on Hadoop.
Google is now providing Prediction API about data mining, which supports 100M dataset. I think it is a trend that large-scale data mining will be a popular requirement. In my opinion, it is also a good chance for RapidMiner to exceed Clementine etc and to be TOP.1 data mining tool in the world.
Distributed data mining might be a good solution. There are already MPI framework data mining algorithms and MapReduce framework DM algorithms such as Apache Mahout which is based on Hadoop.
Google is now providing Prediction API about data mining, which supports 100M dataset. I think it is a trend that large-scale data mining will be a popular requirement. In my opinion, it is also a good chance for RapidMiner to exceed Clementine etc and to be TOP.1 data mining tool in the world.
0
Answers
At RCOMM 2010 there was a talk by Alexander Arimond on MapReduce integration to RapidMiner.
Currently it is specific for a few algorithms, but we have already started conversations about extending it as a general distibuted plugin for RapidMiner. Of course it is not a matter of weeks to have such an extension out, but I would be surprised if we don't have it in one year.
I have experimented using RM on a distributed LSF cluster with 100s of cores and 100s of Gigs of ram. It does work in its current state for independent computations like cross validation or parallel parameter optimization, however I doubt its optimized for such a system. Keep us posted.
-Gagi
1. The task needs a lot of computations.
2. The task has a lot of data.
In the first case, the parallel extension is great. I have never seen it running on multiple computers, but I think it can be done. Probably it is not well optimized, but it works.
The second case is not solved currently in RapidMiner. If you cannot fit into memory then you have very limited (almost no) chance of get anything done. So I think that should be the main goal of this project, to provide data analysis operators for very large datasets. Of course if you are using many machines for the computation then it will be faster, but my interest is not optimizing runtime, but handling large datasets.
Let me know if you agree or disagree!
That makes sense. I was under the impression that RM has some current ability to work directly on databases, so this could get around some memory problems and handle huge data sets. Interestingly, RAM is becoming higher and higher density and SSDs are becoming ever faster, so fitting lots of data into fast memory is more and more possible. I would think that there have to be dramatic improvements in algorithms performance to merit multiple computer distributed computing, rather than simply adding a RAM disk. The problem is porting single threaded algorithms to multi threaded with significant performance improvement, but again this depends on the algorithm. If you cannot break down the problem into many independent bits there is no hope to distributing the problem.
-Gagi