The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Should I Use a database or not?
siamak_want
Member Posts: 98 Contributor II
Hi RM forum,
I am working on a relatively huge text data set (CSV file) with about 300,000 records and about 40,000 columns. I read the data with a "CSV Reader" and then process it via classification and clustering operator. The problem is that, as far as I know, the "CSV Reader" operator, first of all, loads the entire CSV file into main memory. The CSV file is several gigabytes and it uses more than a half of my main memory. So many algorithms starve from insufficient amount of memory.
As a workaround, I converted my data set to the fantastic "RM sparse format". I don't know how the "sparse reader" delivers the data but this could alleviate the problem of insufficient amount of main memory. Nevertheless the problem does still exist.
Now, in order to handle the problem of insufficient main memory, I have a new Idea about using a database (like postgre SQL):)! I think that a database can fetch the proper amount of examples into main memory and deliver the proper number of examples to other operators (i.e. classification operators). please explain if I am right about this. I need to know the advantages and disadvantages of using a database vs using simple files.
Any help would be appreciated.
I am working on a relatively huge text data set (CSV file) with about 300,000 records and about 40,000 columns. I read the data with a "CSV Reader" and then process it via classification and clustering operator. The problem is that, as far as I know, the "CSV Reader" operator, first of all, loads the entire CSV file into main memory. The CSV file is several gigabytes and it uses more than a half of my main memory. So many algorithms starve from insufficient amount of memory.
As a workaround, I converted my data set to the fantastic "RM sparse format". I don't know how the "sparse reader" delivers the data but this could alleviate the problem of insufficient amount of main memory. Nevertheless the problem does still exist.
Now, in order to handle the problem of insufficient main memory, I have a new Idea about using a database (like postgre SQL):)! I think that a database can fetch the proper amount of examples into main memory and deliver the proper number of examples to other operators (i.e. classification operators). please explain if I am right about this. I need to know the advantages and disadvantages of using a database vs using simple files.
Any help would be appreciated.
0
Answers
Btw, the sparse data format only works well if you have a lot of zeros in your data - otherwise it will use even more memory than the standard data format.
Best, Marius
Thanks to your straightforward answer (as always). So, for sure I will use a database as you stated. But there is an important issue about this I think:
A DBMS (for example PostgreSQL) can handle tables with less than 2000 columns. I am working on text data set. so I need to deal with about 80,000 features which does not fit in a regular data base table. Do you know any workaround?
Another thing that I am curious about is that, what should I do when the table is so big that it does not fit in main memory at all? Is the DBMS or RM able to load the data incrementally?
Thanks again Marius.
For big tables, you can of course only read parts of the table from the DB and work on the subset (which is the usual way of dealing with large datasets - learn on a subset which fits into memory, then apply to large datasets incrementally).
For reading data automatically incrementally from the DB there is the Read Database (Incrementally) operator in RapidMiner, but I have to admit that I did not use it until now. Usually I rather load the subset of data I need manually with the normal Read Database operator.
Best, Marius
I think you mentioned a nice discussion... would you please explain more? Do you mean I should use a sampling method to select some examples manually and then I should apply it on all of the data set? Then how about clustering?
Thanks in advance.
- create a representative sample of your data which fits into main memory
- create the cluster model and store it in the repository
- create a process which loads the model, iteratively reads parts of your data, applies the model and writes the result back to the database
Hope this helps,
Marius