The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
How to extract a specific part (section) from a large text (txt format)?
Enthusiast21
Member Posts: 6 Learner I
in Help
Dear RM Friends,
I have 500 txt files containing large Reports and I need to extract only one section of these Reports. As the Reports are each slightly different, the only common patern I can recognise is that the section' headline by all start with the same 3 words, but in the end of each something different is written and the following section is also not the same. My Question is how I can in general extract part of large Texts in RapidMIner (I think I need to use some regular expressions, but so far I could not find anything suitable for my Task).
Thank you very much for your support in Advance!
I have 500 txt files containing large Reports and I need to extract only one section of these Reports. As the Reports are each slightly different, the only common patern I can recognise is that the section' headline by all start with the same 3 words, but in the end of each something different is written and the following section is also not the same. My Question is how I can in general extract part of large Texts in RapidMIner (I think I need to use some regular expressions, but so far I could not find anything suitable for my Task).
Thank you very much for your support in Advance!
Tagged:
1
Best Answer
-
kayman Member Posts: 662 UnicornHi @Enthusiast21, as discussed find attached an alternative approach to your problem, first splitting by page (double sided), then filtering on the pages containing your term (REPORT ON THE ANNUAL) and then using a more loose way to figure out what is left or right page content. Seems to be relatively ok this way, and maybe you can take it further from there.
6
Answers
Are your sections bound by linebreaks, or does your next session start with something that resembles a paatern?
So the idea is to first split the content in left and right page, and then get the section?
Splitting the page in 2 is something you can achieve by splitting on string length, so basically the first 70 characters belong to the first page, 70 to 140 belong to the second page. Splitting and then merging can give you the both pages in one flow.
Bit of quick and dirty approach can be found in attachment.
About the pattern - I have the beginning that is Independent Auditor's Report, but I don' know About the end as it's a date, but how not to take everything which ends up somewhere with a date? For what other type of pattern I can look for besides words?
Thank you so much for the support!
Then try again on your data after changing the decoding of the decode url's operator to utf-8, this could also solve some encoding problems with your original text.