Dealing with really large text / csv files
I have to deal with a huge tab seperated file, containing thousands of lines but also thousands of columns, and the latter is bringing my system to it's knees. It takes ages to load even a small amount of records because the system probably wants to validate all of the content for all of these attributes.
So my idea was to load it as a text file, then chop it into rows using a regex looking for end of line, and then chop into attributes looking for the tab character, no questions asked.
part 1 was easy, and indeed my file now loads into seconds rather than a whole lot of minutes. But I'm not sure how to actually convert the rest of my data to columns using the cut command as this would require me to name all of my columns.
is there a better way to achieve this, or using different logic / operators?
Best Answer
-
jczogalla Employee-RapidMiner, Member Posts: 144 RM Engineering
Hi @kayman,
you could potentially do somethign like the following:
1) Split the document with Cut Document into collection of rows
1a) Use the Replace Tokens operator with "(.*?)" and $1 to unwrap string values if necessary
2) Use Documents to Data operator to get an example set
3) Select only the text attribute
4) Use the Split operator with \t as delimiter with an ordered split to create genericly named attributes
5) Rename columns with Rename by Example Values (row 1)
This should do exactly what you want (I hope). The last step even deletes the name row from the example set
I hope this helps!
Jan1
Answers
Brilliant, I overlooked the split operator but indeed that worked perfectly !