The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
ETL and OLAP
Boris_Petukhov
Member Posts: 1 Learner III
Hi All!
I'm doing some R&D and want to see if the Rapid Miner can be used in the same way MS SSIS and MS SSAS are used.
In other words I need to be able to do the ETL stuff, buid Star Schema and publish cubes to the Clients.
Users should then be able to connect to cubes and "dice and slice" the data in any way they need.
Do people use this package for this sort of things?
Thanks in advance.
Boris Petukhov.
I'm doing some R&D and want to see if the Rapid Miner can be used in the same way MS SSIS and MS SSAS are used.
In other words I need to be able to do the ETL stuff, buid Star Schema and publish cubes to the Clients.
Users should then be able to connect to cubes and "dice and slice" the data in any way they need.
Do people use this package for this sort of things?
Thanks in advance.
Boris Petukhov.
0
Answers
I guess you moved on and the answer is comes a few month too late for you - but just for others searching for OLAP and Rapid-i / RM / RapidMiner. RapidMiner (afaik) is primarly a solution focused on data mining and use cases around data mining. It does provide some OLAP processing abilities (see http://www-ai.cs.uni-dortmund.de/LEHRE/VORLESUNGEN/MLRN/WS0809/rm-api/overview-summary.html), but is not an OLAP-engine itself. Also its main purpose is not ETL (also it can be (mis)used for ETL), so as you are focused for something you can use to build a relational data modell and a OLAP cube, you might turn to handcoding or some open source ETL vendors (e.g. kettle aka PentahoDataIntegrator or Talend or clover ETL) and OLAP or memory based engines (as Mondrian and Palo).
Having said this, the answer whether RM is the right tool for you depends on what your use cases are and who your users are. What architecture and technology types you use, is in my oppinion is irrelevant, as long as it fullfills your requirements sufficiently.
Regards
ms
i am a new Bee doing an R&D on Predictive Analytics for one of our Client Process. The requirement is, already we have a huge database full of Financial information. We have to identify our Potential Customers, active agents and flourishing regions for our marketing department.
I came across RapidMiner. Can't able to formulate how it is goin to help us? Could u kindly help us.
We have to setup datamarts and design an analysis engine and should give a graphical output.
Did RapidMiner support this?
well, yes, but RapidAnalytics (the server behind RapidMiner) might be better suited for this. With RapidMiner / RapidAnalytics you can set up a datamart, design every analysis process you can think of in the field of predictive analytics / data mining and use those models for scoring or make use of the millions of visualization schemes available within RapidMiner. With RapidAnalytics, you are even able to define services from those processes which can be directly integrating into your infrastructure or create the visualizations on-the-fly for integration.
But since you asked here in the ETL / OLAP thread, I assume that you want also an answer for those topics. The point is that the answer cannot be given in general as MS already has pointed out. Let me comment a bit on his points:
About ETL
RapidMiner / RapidAnalytics are first of all solutions for (statistical) data analysis and predictive analytics / data mining. So those tools were not designed for performing in-database-multiple-terabyte-transformations in just 1 second but for providing all necessary tools for performing great analysis processes. Where is the difference? In data mining, the modeling step - although it is often only a single step within a process of hundreds of operators - is often the bottleneck. Runtimes are high, sometimes exponential and data has to be re-iterated many times. This is not the optimal setting for databases and most solutions hence perform those model calculations in-memory instead of in-database simply because it is much faster. Hence, there is no point in loading / transforming much more data than you can model or you have to take a sample anyway. In such a setting, the traditional ETL approach is not useless but it is not necessary: there is no point in transforming terabytes of data on-the-fly when you cannot model the data later on anyway.
On the other hand, traditional ETL tools sometimes cannot offer calculations which are useful for data mining. A simple example might be an aggregation where the median (instead of a mean, max, min, or count) should be calculated. Not possible in databases and hence often also not possible in ETL tools. But still sometimes useful in data analysis.
So we have two arguments here: first, it is often not necessary to pipe the data through an ETL process since when it comes to modelling, this cannot be done anyway. Second, many ETL tools have some restrictions exactly for this reason, meaning that as much as possible should be done piped and / or directly in-database.
What is the RapidMiner solution I often refer to as "Analytical ETL"? As default, data is retrieved from a data base into memory, transformed and modelled there and results are written back. This is appropriate if data mining is your primary goal. But there is another powerful option which many RapidMiner users oversee: Almost all preprocessing operators also provide an option "create view" which means that data is not changed and stored but all calculations are made on the fly. If you now read your data and transform it batchwise (which is possible by using the appropriate input operators or create the batches in loops yourself and make use of the limit definitions in your database), scalabilitiy is no longer an issue. You can transform data sets of arbitrary size with this.
So, yes, it is possible to perform ETL processes with RapidMiner and you can even do things which are not possible with other tools around. Are other ETL tools hence useless? Of course not: If your primary goal is ETL and not data mining or if processing time is a really important issue (don't start with data mining then ), then those tools are the way to go. If, however, the primary goal is the analysis, you are often really fine with only using this "Analytical ETL" approach of RapidMiner. In fact, we have done more than 200 projects now and we never had any need for an additional ETL tool but did everything with RapidMiner processes.
About OLAP
Again MS is absolutely right: RapidMiner / RapidAnalytics is not an OLAP engine by itself. We will, however, release a new extension this year which make working with cubes possible directly within our products. But in the moment the best idea would be to use another tool for OLAP until you end up with a table which can be fed into RapidMiner.
And I fully agree on his statement
Thanks for this discussion. Cheers,
Ingo
for these valuable suggestions and points.
Cheers
With the Discussions yesterday, i would like to clarify some more points from you. At present, our Customer data is about 9Million records in the database. Considering the scalability, down the line in 4 years or so, it may cross 13 to 14 million entries.
Did RapidMiner / Analytics cope with this much huge volume of data?
Is it Scalable to that extent?
yes, if you use a 64 bit machine with sufficient memory you can directly work on this amount of data without having to think about batch + view processing at all. Of course it depends on the concrete preprocessing processes, but in general RapidMiner / Analytics should be directly able to work on data sets of that size. If your hardware is not sufficient, you can always change to the batch + view approach as stated above. So no problem with this.
As a side note: Recently, we ourself successfully processed 120 million transactions - some parts where done per batch, some parts even directly in database by sending SQL statements for certain preprocessing steps. The data was condensed after those RapidMiner processes in a way so that it perfectly fit into memory and we were able to create the desired models then. Actually, those processes we have created would have been able to process much more tupels than the 120 million we had from our customer - although running time started to become the limiting factor (the complete ETL + modeling process took about 4 days).
Hope that helps. Cheers,
Ingo
Thats great to hear and thank you for your prompt responses.
Cheers.
http://www.jedox.com/de/produkte/palo-gpu-accelerator.html
Best regards.
not yet but there are several groups working on GPU support right now all over the world. I have also seen the first amazing results recently (a speed-up by a factor of several hundred) and I am pretty positive we will here more about that during the RCOMM 2011 this year in Dublin.
Another interesting - although not open source - option would be the combination of RapidAnalytics with Ingres VectorWise. We are working a lot with Ingres on the integration and achieved speed-ups up to a factor of 100 for several mining schemes. Some results about this were presended at last year's RCOMM by the way.
Cheers,
Ingo