data ETL for multiple text line file
I have a data extract from a system that is definitely not data-mining friendly. Essentially the data is coming in a csv containing text blocks (paragraphs) where each block is an example and each line is an attribute, and the same relative line in each block is the same attribute, like this:
line 1
line 2
line 3
line 4
line 5
line 6
...
where line 1 and line 4 are the same attribute, line 2 & line 5, line 3 & line 6, etc. A sample file is attached--the format is accurate but the data have been anonymized. Each line may contain spaces or other special characters. A blank line separates each block from the others. The blocks all have the same number of lines so at least there are no missings to worry about.
I'd like to turn this into a typical exampleset where each block is an example with the relevant attributes, but I'm struggling to do the necessary conversions. Any ideas or help from the community? Thanks in advance!
Best Answer
-
BalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
Hi,
I don't have a complete solution but I've done similar stuff before.
You might want to read the file with Read CSV making sure that the entire lines end up in the result (e. g. separator = $, no header line).
Then add an ID with Generate ID.
Calculate the modulo so you have the same number for the same attribute.
Pivot by the calculated modulo attribute.
Regards,
Balázs
1
Answers
Thanks @BalazsBarany. Very clever! I was able to create a solution by using Generate Attribute for both a modulus attribute to define the attribute and a floor attribute to define the index, and then using those to pivot. Quite simple actually.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts