The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Very Large dataset filesize despite only 2000 examples
Hi, I am a newbie so apologies in advance if I'm missing something obvious.
I am working on a binary classifier for use with a large synthetic dataset for credit card fraud which I have split and sampled into a training and testing dataset, both with balanced classes, 1000 of each. However, there seems to be something up somewhere along the line. The full dataset with 6.3M examples occupies 538MB. However, my training and test datasets are taking up 95.3MB and should only be a tiny fraction of this size. They also behave like 100MB files, taking ages up to open up etc. Training dataset caused AM to crash. Can somebody tell me where I am going wrong please? TIA Ray.
I am working on a binary classifier for use with a large synthetic dataset for credit card fraud which I have split and sampled into a training and testing dataset, both with balanced classes, 1000 of each. However, there seems to be something up somewhere along the line. The full dataset with 6.3M examples occupies 538MB. However, my training and test datasets are taking up 95.3MB and should only be a tiny fraction of this size. They also behave like 100MB files, taking ages up to open up etc. Training dataset caused AM to crash. Can somebody tell me where I am going wrong please? TIA Ray.
0
Best Answer
-
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data ScientistHi @Ray_C,let me explain what @kayman and I are checking for:If you store nominal data, then the actual table does not contain any string. Instead it contains an integer. RapidMiner maintains a Mapping HashMap (aka dictionary in python) to map the integers to their respective strings again. This is very efficient for most data, where you only have a few strings.There are two cases where this may create big file sizes.'Duplication of Data': If you have columns with unique strings in them we actually store them twice. Once as an integer in the table and once in the Mapping. This causes then unnecessarily big file sizes.Not cleaned up tables: If you have a column and remove all but one string using a filter examples operator (or your sample), then the mapping is not cleaned up. The table still "knows" what other strings may exist (and there is some good reasoning for this). This means even though you filtered some things away, you didn't really filter it out of the mapping and the file still keeps the same size.Remove Unused Values cleans up the mapping table and is thus the usual answer for those things.Can you maybe check if you have nominal columns with a very high number of different strings in? That would be the first issue.Best,Martin- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany5
Answers
Dortmund, Germany
I have added that operator before the stores but it appears just to have moved the bottleneck from AutoModel's (non-handling) of the pseudo 100MB dataset, back to the data prep process itself, where, as I type and for the past 5 minutes, one of the "Remove Unused Values" remains in progress, and may not in fact complete I suspect.
I think maybe I could do with carrying out some research on the handling of very large datasets.
Just in: RM has crashed with an OOM exception. I've got a Core i7 16GB RAM so I need to change the methodology for sure.
Try with adding one on both split outputs, as the 'hidden' information will travel through this otherwise. Also try to tick the 'include special attributes' option, given you use a role the remove option might have limited impact if these are like unique identifiers as all your entries will be special.
I am not sure what's going on TBH, specifically what the Remove Unused Value operator is doing or supposed to do - this is a synthetic dataset with no missing values etc, what would it be removing after the split?
Also, I do find it unusual that there is no out-of-box answer for this issue (and no disrespect intended here). I mean I am assuming that many, many people will have worked on this dataset before (kaggle.com/ntnu-testimon/paysim1), and many will have split the data into Test and Training using the Split Data operator within RM I am sure.
Yet I can't seem to find any references online to anybody else experiencing this kind of issue. I mean I am not trying to achieve anything that could be described as complex, I am barely off first base, with the only operation being the assignation of a label which is required in order to obtain a balanced dataset? I just don't understand why the split datasets do not appear to be amenable to the sampling process?
I keep asking myself is there something fundamentally wrong with my approach but the responses to date (much appreciated) do not suggest that there is?
This is helpful when you have nominal attributes with many different unique values, some of which might occur frequently enough to be useful, but most of which occur infrequently and are thus not useful. It would allow you to keep the largest ones and remap all the other values into a generic "Other" category much more easily than the normal Map operator (which would require you to list them all out individually).
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts