The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
best input data format for large data sets?
Hi,
I wanted to ask what's the recommended import format for large datasets?
My dataset has the following specs:
- 36000 samples altogether splitted in 5 groups of 7200 samples each
- timestamp = id, integer label
- theoretical maximum of 1.200.000 integer attributes (for now a subset of about 5000 has been chosen, but more would be better)
Currently I am using an "import" process which does:
- CSV import (one CSV file for 7200 samples)
- define roles
- some normalization
- "write binary"
The binary files are re-read in the classification process, because it's faster than parsing all the CSV's every time. My problem is that if I increase the number of attributes in the CSV, the "import" process eats up all the memory and dies (7Gb). I also experimented with "Free Memory" it didnt help.
My question is now: is there a better format than CSV for large datasets which is still directly processable in decent speed so I can maybe drop this import step? What would you recommend?
Thanks,
Harald
I wanted to ask what's the recommended import format for large datasets?
My dataset has the following specs:
- 36000 samples altogether splitted in 5 groups of 7200 samples each
- timestamp = id, integer label
- theoretical maximum of 1.200.000 integer attributes (for now a subset of about 5000 has been chosen, but more would be better)
Currently I am using an "import" process which does:
- CSV import (one CSV file for 7200 samples)
- define roles
- some normalization
- "write binary"
The binary files are re-read in the classification process, because it's faster than parsing all the CSV's every time. My problem is that if I increase the number of attributes in the CSV, the "import" process eats up all the memory and dies (7Gb). I also experimented with "Free Memory" it didnt help.
My question is now: is there a better format than CSV for large datasets which is still directly processable in decent speed so I can maybe drop this import step? What would you recommend?
Thanks,
Harald
0
Answers
if your data is sparse (a lot of zero and significantly less non-zero attribute values), you may want to try the sparse file and data formats. They store only the non-zero values and hence are the preferred representation for sparse data sets like large text collections.
Best regards,
Ralf
I managed it with the Read AML Operator and sparse storage. Thanks!
Greetings, Harald
Version 5 is much faster. So download version 5.