The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

How to extract a specific part (section) from a large text (txt format)?

Enthusiast21Enthusiast21 Member Posts: 6 Learner I
Dear RM Friends,

I have 500 txt files containing large Reports and I need to extract only one section of these Reports. As the Reports are each slightly different, the only common patern I can recognise is that the section' headline by all start with the same 3 words, but in the end of each something different is written and the following section is also not the same. My Question is how I can in general extract part of large Texts in RapidMIner (I think I need to use some regular expressions, but so far I could not find anything suitable for my Task).

Thank you very much for your support in Advance! :smile:

Best Answer

Answers

  • kaymankayman Member Posts: 662 Unicorn
    Regular expressions are probably what you need indeed. You already know where to start so it's about the where to end part. You don't need to limit yourself with words. Whitespace can also be a good candidate. 

    Are your sections bound by linebreaks, or does your next session start with something that resembles a paatern? 
  • Enthusiast21Enthusiast21 Member Posts: 6 Learner I
    As attachment is part of one report containing two sections of what I need to extract (Independent Auditor's Report), which is another issue - some Reports contain two parts I need to extract. I copied in the attached file also the end of the previous section and the beginning of the next one. The next section is always different in the reports, so I can't find a patern. Each section I need ends with a date, which unfortunately is only common for them, but not uniqe as there are also other dates in the report in general. 
  • kaymankayman Member Posts: 662 Unicorn
    edited December 2019
    Nice challenge :-)
    So the idea is to first split the content in left and right page, and then get the section?

    Splitting the page in 2 is something you can achieve by splitting on string length, so basically the first 70 characters belong to the first page, 70 to 140 belong to the second page. Splitting and then merging can give you the both pages in one flow.

    Bit of quick and dirty approach can be found in attachment.
  • Enthusiast21Enthusiast21 Member Posts: 6 Learner I
    Thank you for the solution of the first part of my problem. I'm sorry for the question, but as I am relatively new may I ask you where I enter the xml Code you send me? I tried in the xml pannel, but after that I don't know how to make the process appearing and then running in RapidMiner. 

    About the pattern - I have the beginning that is Independent Auditor's Report, but I don' know About the end as it's a date, but how not to take everything which ends up somewhere with a date? For what other type of pattern I can look for besides words?

    Thank you so much for the support! 
  • kaymankayman Member Posts: 662 Unicorn
    Views -> xml -> paste and green tick before save
  • Enthusiast21Enthusiast21 Member Posts: 6 Learner I
    What could I do to remove the error?
  • kaymankayman Member Posts: 662 Unicorn
    Install the toolbox extension from the marketplace, but you can also replace this with the common append operator
  • Enthusiast21Enthusiast21 Member Posts: 6 Learner I
    Thank you! I did it, but now I have new problem. Could you help me with it too? 
  • kaymankayman Member Posts: 662 Unicorn
    Hmm, there might be more issues with your original file. Could you already verify it works with the 'for the forum' txt file you provided? This way we can already ensure we are using the same environmental conditions.
    Then try again on your data after changing the decoding of the decode url's operator to utf-8, this could also solve some encoding problems with your original text.


  • Enthusiast21Enthusiast21 Member Posts: 6 Learner I
    With the file 'for the forum' it works perfectly, I don't understand why the original one doesn't then as I olny copied part of the text from it in the new txt file which I uploaded here. I tried with an online tool to change to utf-8, but the resulted file didn't give any better results. Is there another ways to decode the file?
  • kaymankayman Member Posts: 662 Unicorn
    Would you mind sharing the full text? You can send by pm if ok for you.
Sign In or Register to comment.