Text Processing - Cut Document - Similar entries separated by a number

exmenace · September 2021

I was having trouble finding the operator documentation that pertains to string matching or cutting documents in general.

I have a few different types of documents (.xml, .csv, .docx, .html) that list records, in order, separated by *Record (n)* in ascending numbers, starting with 1.

Each of these records has similar attributes but it's all unformatted other than the records and attributes being separated by asterisks*.

My hope was to cut the document by record, which I assumed I could do with a string matching query, but I'm not sure how I could do that if each record is different, and the only commonality being the record #, but that's variable so not sure how to input that expression.

kayman · September 2021

Are your records each time on a new line?
like :
Record 1*something*someting else*and again something else
Record 2*something*someting else*and again something else
Record 3*something*someting else*and again something else

or is it more like

Record 1*something*someting else*and again something else*Record 2*something*someting else*and again something else*Record 3*something*someting else*and again something else

In case of the first you could simply use the read csv operator and use the * as the separator. Beware that this is a special character that needs to be escaped, so in order to use it correct you need to enter \* instead of just *

You could also use the split operator, same here. Use \* to make clear you want to split on the 'normal' asterix.

If all is in one line I recommend to use the split document into collection from the toolbox extension.

I've attached some samples to play around with, hope they get you started.

exmenace · September 2021

Thanks for the recommendations, I will check them out. I found a way to rewrite the csv files so that the info isn't in separate rows. The structure in the csvs and excel files has the info all in one column, but each category is on a different row, so something like this:

*Record 1:*

*Title:*

The Blah blah of blah

*Author:*

M. Blah

*Keywords:*

Blah, blah, blah

And so on. Including a row that will have lots of text that's an Abstract. The way my other program is rewriting it is so that all the info for each record is in one cell, and then I have a process that will separate that into a more manageable table (hopefully).

I will check out your solution and compare the two because I will still have to analyze after all this. Any other input is appreciated.

kayman · September 2021

Ah, I see. All in one row isn't a real problem but it's just a bit more complex then. The split document should do the trick then also, just look for \*Record, put a special string in front of it and use that to split. Then you have small docs for every topic, and these you split again on line breaks, or use a transpose to change them from column to attributes.

Is the number of rows each time the same, like in your example for instance 7 rows, then next 7 for a new topic and so on?

If so you could also use a loop logic and filter each time 7 records on every 7th entry using a mod logic. Sounds far more complex than needed btw :-)

exmenace · September 2021

I've definitely overcomplicated the whole thing. Thanks again

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Text Processing - Cut Document - Similar entries separated by a number

Best Answer

Answers