The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Is 16GB of RAM the only way to go?
Hello everybody, I'm new here and new in general to Data Mining...
I've been reading some threads here and also been through the WEKA mailing list, and I have come to the sad conclusion that the only way to process large and complex streams of data is to have as much RAM as possible - using RapidMiner or WEKA, of course. I add this because I have been shown some pretty crazy flows on Clementine being run succesfully on modest systems with 4 or even 2 or 1! GB of RAM. I've been told that Clementine creates a lot of temporary files on the hard disk and has a somewhat optimized stream execution code. Optimized at least to allow you to put in anything you wish and let your HDD space handle it, without having to worry about RAM or heap sizes and such. Is this correct?
The first thing I tried to do with RapidMiner (4GB RAM rig) was to convert a 100 fields x 3,000,000 records SPSS file into ARFF, and it wouldn't get past the READ SPSS node. Out of memory error! The same stuff I bumped into with WEKA, and this contrasts a lot to Clementine handling quite, quite a lot more with only 1 GB of RAM.
Regarding to having to use 16GB of RAM as a rule... am I sadly right? Is it not possible to, for example, to make RapidMiner use Window's Virtual Memory? Set it to any crazy amount and let RapidMiner use it, that would be a charm. It probably isn't very efficient at all, but hey, it's definately better than directly not being able to get the job done.
On the other hand, do the enterprise versions of RapidMiner have optimized stream execution codes? If I buy the software, how would you cope with my huge data flows need?
I'm no programmer and I couldn't help you with anything, but, come on guys! If SPSS can manage huge amounts of data and flows then you should be able to do so as well! Remember they also use Java... it's not like there is a language limitation, right?
Thank your for your great program, and your kind attention.
Cheers.
I've been reading some threads here and also been through the WEKA mailing list, and I have come to the sad conclusion that the only way to process large and complex streams of data is to have as much RAM as possible - using RapidMiner or WEKA, of course. I add this because I have been shown some pretty crazy flows on Clementine being run succesfully on modest systems with 4 or even 2 or 1! GB of RAM. I've been told that Clementine creates a lot of temporary files on the hard disk and has a somewhat optimized stream execution code. Optimized at least to allow you to put in anything you wish and let your HDD space handle it, without having to worry about RAM or heap sizes and such. Is this correct?
The first thing I tried to do with RapidMiner (4GB RAM rig) was to convert a 100 fields x 3,000,000 records SPSS file into ARFF, and it wouldn't get past the READ SPSS node. Out of memory error! The same stuff I bumped into with WEKA, and this contrasts a lot to Clementine handling quite, quite a lot more with only 1 GB of RAM.
Regarding to having to use 16GB of RAM as a rule... am I sadly right? Is it not possible to, for example, to make RapidMiner use Window's Virtual Memory? Set it to any crazy amount and let RapidMiner use it, that would be a charm. It probably isn't very efficient at all, but hey, it's definately better than directly not being able to get the job done.
On the other hand, do the enterprise versions of RapidMiner have optimized stream execution codes? If I buy the software, how would you cope with my huge data flows need?
I'm no programmer and I couldn't help you with anything, but, come on guys! If SPSS can manage huge amounts of data and flows then you should be able to do so as well! Remember they also use Java... it's not like there is a language limitation, right?
Thank your for your great program, and your kind attention.
Cheers.
0
Answers
-Gagi
I understand what you are saying - I read about it on books and my partner told me about it as well. Data MODELLING shouldn't be about putting huge chunks of inputs and expecting to get something out of it. But in order to clean that huge chunk, I think the Data Mining tool should be able to manage it. The huge streams I saw at my partner's Clementine workstation didn't include a single bit of modelling, they were all data exploration, comprehension and preparation streams.
Anyway, consider the situation where many features have predicting value. It would be a pity to prune them down just for memory's sake. I still believe, besides all data mining suggestions, normal practices and common scenarios, the tools should be prepared for any kind of streams. I mean, all the books say it on the first page... data mining could be the evolution of statistics to cope with the massive amount of data available today... but it seems the memory issue - what would be a "little catch" - puts great open source applications - such as this one - back against the wall when compared to licensed software such as Clementine. I still ask though if the licensed versions of RapidMiner have optimized streams executions.
I'm sorry if I'm saying nonesense, please remember I'm just starting and my education comes only from my partner, a bunch of books and playing around a little bit with RapidMiner, Clementine and WEKA.
Thanks!
don't make this a open vs. closed source discussion: this is simply not true. If Clementine has built in streaming: fine. So has RapidMiner, but it is simply not the default (for a bunch of reasons). In order to perform preprocessing (not modelling) on data sets of arbitrary sizes, you will have to use a combination of
- a database as data input
- the stream database operator configured to your database or use the default one and use an appropriately configured database
- the option "create view" for all preprocessing operators where possible
Setting up processes making use of this streaming approach is the point where people usually have to rely on our Enterprise Support since designing such processes is no longer a trivial task. But it is definitely possible, we ourself have recently successfully transformed far more than 100 Mio. records with RapidMiner - without a significant memory footprint. This is of course mainly useful for preprocessing and more traditional BI results, there is no point in building a predictive model on a data set of this size simply due to running time restrictions.By the way: On a 64 Bit system if should indeed be possible to use more memory than physically available and let the OS and Java do the temp file approach similar to the way described by you for Clementine. It's probably sufficient to adapt the amount of memory in one of our start scripts and start RapidMiner with the script. But calculations will become ridicously slow then and I would recommend to design better processes and keep control of what is happening instead of using this shutgun approach.
We are able to do this. You just don't have found the right buttons yet
Cheers,
Ingo
Indeed, it would seem I still need to explore better RapidMiner, the truth is I only tried the conversion I mentioned before and got the error.
Thanks a lot!
Just let me add: RapidMiner itself is for free but our support and expert knowledge is not. And making things more scalable without loosing too much performance / accuracy is definitely part of this expert knowledge as I sure everybody understands.
Cheers,
Ingo
Description from the APPEND operator:
"This operator merges two or more given example sets by adding all examples in one example table containing all data rows. Please note that the new example table is built in memory and this operator might therefore not be applicable for merging huge data set tables from a database. In that case other preprocessing tools should be used which aggregates, joins, and merges tables into one table which is then used by RapidMiner."
What other -free - preprocessing tools would you recommend, data miner friendly?
Thanks again.
anything that builds these joins directly in the database. I'm always using RapidMiner, so I don't know
Greetings,
Sebastian
Thanks!
PD: Ingo, haven't been contact yet by sales.
well, for the append step itself you have two options within RapidMiner: using streamed data access and write the result down or write the data into a database and append it there via SQL execution. This is basically the same thing an open-source ETL tool would do and the same is possible also directly within RapidMiner so you would not really have to change.
However, you could also try Talend as an ETL tool for this. They are a partner company of Rapid-I and maybe you prefer there solution for this ETL step over performing it within RapidMiner.
Cheers,
Ingo
I am testing such process but failed due to no java heap space:
1. Stream database, the table has about 5 million rows and 42 columns
2. Select Attributes, select part of fields
3. Set Role, set one attribute as label
4. Linear Regression
Then I run the process. After 10 minutes, the process failed due to no java heap space.
Jun 25, 2010 3:15:53 PM SEVERE: Process failed: Java heap space
Jun 25, 2010 3:15:53 PM SEVERE: Here: Process[1] (Process)
subprocess 'Main Process'
+- Stream Database[1] (Stream Database)
+- Select Attributes[1] (Select Attributes)
+- Set Role[1] (Set Role)
==> +- Linear Regression[1] (Linear Regression)
the problem on this setting is, that the LinearRegression will have to copy all the data into a numerical matrix in order to invert it. This numerical matrix must be stored in main memory, and that's causes the memory problem.
For large data sets I would suggest using linear scanning algorithms like Naive Bayes or the Perceptron.
Greetings,
Sebastian
I looked into the RapidMiner code and found that few modeling operator supports Streaming data processing, which is majorly for data preprocessing, right?
well I think that's correct, but how exactly is your criterion "supporting streaming data processing" defined?
Greetings,
Sebastian