Should I Use a database or not?

siamak_want · May 2012

Hi RM forum,

I am working on a relatively huge text data set (CSV file) with about 300,000 records and about 40,000 columns. I read the data with a "CSV Reader" and then process it via classification and clustering operator. The problem is that, as far as I know, the "CSV Reader" operator, first of all, loads the entire CSV file into main memory. The CSV file is several gigabytes and it uses more than a half of my main memory. So many algorithms starve from insufficient amount of memory.

As a workaround, I converted my data set to the fantastic "RM sparse format". I don't know how the "sparse reader" delivers the data but this could alleviate the problem of insufficient amount of main memory. Nevertheless the problem does still exist.

Now, in order to handle the problem of insufficient main memory, I have a new Idea about using a database (like postgre SQL):)! I think that a database can fetch the proper amount of examples into main memory and deliver the proper number of examples to other operators (i.e. classification operators). please explain if I am right about this. I need to know the advantages and disadvantages of using a database vs using simple files.

Any help would be appreciated.

MariusHelf · May 2012

I don't see any disadvantages of databases vs. text files, and you already stated the disadvantages of text files. So for any reasonable amount of data I would suggest to import the data into a database.

Btw, the sparse data format only works well if you have a lot of zeros in your data - otherwise it will use even more memory than the standard data format.

Best, Marius

siamak_want · May 2012

Hi Marius,

Thanks to your straightforward answer (as always). So, for sure I will use a database as you stated. But there is an important issue about this I think:

A DBMS (for example PostgreSQL) can handle tables with less than 2000 columns. I am working on text data set. so I need to deal with about 80,000 features which does not fit in a regular data base table. Do you know any workaround?

Another thing that I am curious about is that, what should I do when the table is so big that it does not fit in main memory at all? Is the DBMS or RM able to load the data incrementally?

Thanks again Marius.

MariusHelf · May 2012

If you have less examples than features you could transpose the dataset before storing it in the DB, and re-transpose it after loading.

For big tables, you can of course only read parts of the table from the DB and work on the subset (which is the usual way of dealing with large datasets - learn on a subset which fits into memory, then apply to large datasets incrementally).

For reading data automatically incrementally from the DB there is the Read Database (Incrementally) operator in RapidMiner, but I have to admit that I did not use it until now. Usually I rather load the subset of data I need manually with the normal Read Database operator.

Best, Marius

siamak_want · May 2012

Hi the nice guy, Marius,

I think you mentioned a nice discussion...

learn on a subset which fits into memory, then apply to large datasets incrementally

would you please explain more? Do you mean I should use a sampling method to select some examples manually and then I should apply it on all of the data set? Then how about clustering?

Thanks in advance.

MariusHelf · May 2012

siamak_want wrote:

would you please explain more? Do you mean I should use a sampling method to select some examples manually and then I should apply it on all of the data set? Then how about clustering?

Basically yes. if you choose a clustering algorithms which results in a cluster model (e.g. k-Means or any other centroid-based clustering), you can:
- create a representative sample of your data which fits into main memory
- create the cluster model and store it in the repository
- create a process which loads the model, iteratively reads parts of your data, applies the model and writes the result back to the database

Hope this helps,
Marius

siamak_want · June 2012

Thanks to Marius

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Should I Use a database or not?

Answers