Loop through DataSet with given Header

Artur_Heinle22 · December 2020

Hello everyone,

i have the following Problem. I have an unsorted dataset where i pulled the headers out of this data set. Now I want to build a new data set with the Data and the header. I have the Data Set in Rapidminer i only use Excel to make an example.

Data Set

Image: https://us.v-cdn.net/6030995/uploads/editor/f7/dm9d31iig1ed.jpg

Header

Image: https://us.v-cdn.net/6030995/uploads/editor/8o/bj3hyp9vbd3v.jpg

Result

Image: https://us.v-cdn.net/6030995/uploads/editor/va/my76i4jlkbf3.jpg

The data should then be sorted as in the result in a new Data Set.

Look for the Header in the Data Set. If you find one of the header in the row then pick the next value and put in in the new Data Set under the right header. Else ignore the value. If the Header is an specific header like Sonstiges or Index, took every value until the next header or end of the row an put every value until then in one value und safe it in the Result Data Set. The Header in the raw Data Set is not always in the order as in the Result Data Set.

Can you do that with Rapidminer Operators or do you have to build something yourself with Execute Sript?

Best regards,

Artur

honoxy12 · December 2020

dsd

honoxy12 · December 2020

Now, drag the Foreach loop container and join it with executing the SQL task created earlier. It runs the tasks specified inside the loop for each ...

Artur_Heinle22 · January 2021

I believe you mean with dsd = Document Structure Description. But i don't find an operator in RapidMiner for that.

And what did you mean SQL task created earlier?

kayman · January 2021

You could use multiple filter by value operators. Not the most efficient way but it get's the job done. Just cascade through your records attribute by attribute.

First filter by Article, then by Date, then by type and finally by color. Now you can use the aggregate operator to concatenate your Others and your indexes so you get single record, which you can then append all together filter by filter.

So filter article values
- - > filter Date values (for this article)
- -> filter Type values (for this date and article)
- - > filter by color value (for this... You get it)
- - > concat
And loop till end

Artur_Heinle22 · January 2021

Good idea but how should I filter by date etc. if the order is not correct and the data is spread over several columns and not all of one type among each other? I tried to move the data line by line within Rapidminer, but i find that not so easy like in Excel.

And how can i filter after more then one value und can put them together like i need that for the Sonstiges and Index column?

kayman · January 2021

Ah, I see... Didn't notice the spread...
Any idea if it's like a single character causing the 'jump' in your data? Is it the same in your excel? If not probably something went wrong with the conversion. Otherwise trying to get rid of this 'jumper' on import might make it a lot easier for you.

It seems like you have missing (the question marks) attributes that are caused by an unexpected tab or so, and empty fields that just indicate there is no data for the field but are expected.

If so you could first loop through all the records and remove all the attributes that are missing, if you then append these again your dataset might be lined out again.

Anyway, the order isn't a problem at all as rm will use grouping and sorting behind the scenes when filtering by value. The spread might be more problematic. From Att6 onwards it seems things start to go wrong.

You could also do some tricks by using the loop attributes operator in combi with loop examples, first you loop and validate the content, if it's empty or missing you ignore, if it contains Type you know the next attribute should be your Type value. You store these in macros, do the same for your other known fields and then you generate new attributes with your macro values and just remove the rest. This way you could also construct your table again as it should have been.

Bit hard to explain, but maybe you can figure it out with this. If not you may always share your file so I can have a play with it also. Seems like a nice challenge 😉. But first try and learn yourself of course

Artur_Heinle22 · January 2021

I filtered the data form word files where the word file has tables in it with the informations i need. This is where these distances com from. I also managed to streighten these distances again. The only problem is that i have about 900 ros over up to 3400 columns. And the Colums are irregular filled with infomrations. The information is also nested in the Dataset and not alwas in the right order. I need to find this nested blocks of information.

I can use the product type to form smaller groups where each group does not necessarily have 3400 columns, but i thought that there is problably a possiblity to write somthing only with Rapidminer operators that automatically /semi-automatically gibes me the desired data sets.

The header i get when i count the values and look at then in descending order. The one who looks like an header and not a number like 1 which will be counted quiet often, i ignore. If i have the header list i check it against the word files if it is realy an header.

I'm so desperate that I'm doing it with Python within Rapidminer using iterators and check the value if it is the header and than get me the next value und put it below the header in my data set. If it is one of the "special" header like sonstiges or id then take me the next values until the next header or end of the list.

I'll try to create a data set for testing.

kayman · January 2021

Yeah, the pain of datacleaning...

Unfortunately there are no magical tools that turn garbage into beautiful data automatically so you are kind of forced to use the loop by example and then for each example by attribute to get rid of the jumps and reorganize the order.

Any way to just export your baseset as an excel that you could share?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Loop through DataSet with given Header

Answers