The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

How can I do text mining to relate a number and word of a doc and relate both into a dataset ?

GuiGui Member Posts: 10 Contributor II
How can I do text mining to relate a number and word of a doc and relate both into a dataset (each one as an attr)?

The idea is taking a doc (similar as a "bill of sale"), read it and process in a way that I can have a simple exampleset as...

Access key | Product code | Product name
     xxxxx     |      yyyyy        |      XPTO


Do you have any idea or solution on another topic that I haven't found? It will help a lot

Thanks. Best,

G.

Best Answers

Answers

  • GuiGui Member Posts: 10 Contributor II
    Hi Kayman, thanks for your time and help. I am really soffering with this problem.

    I really appreciate if you can support me on the regex. I am sharing a process with the document that I need to do what I described.
    About the patterns, I have three kind of docs, PDF ones (with a pattern), scanned docs (images that I need to do the same thing, read, identify, separate in a exampleset, etc. with another pattern) and another scanned docs. I will need to build a process to each one because of the patterns

    attached are the process and a notepad with the XML. 

    Thanks again.



  • kaymankayman Member Posts: 662 Unicorn
    Getting the access key is not a real issue, but as I'm not familiar with the rest of your structure it's hard for me to know what is needed and what patterns it can have.

    It looks as if you start with a pdf that you convert to a text file, so it might be better to start with using the pdf table extractor extension (available on the market place https://marketplace.rapidminer.com/UpdateServer/faces/product_details.xhtml?productId=rmx_pdf_table_extraction )

    This may reduce the complexity a lot as you seem to have quite some columns originally. Combining a few techniques together may work out better than.

    Attached an example extracting the Access Key and storing it as a new attribute.


  • GuiGui Member Posts: 10 Contributor II
    Kayman,

    Can I send you a private message? Then I could share with you an image of the structure of the document. If you have time to do it, of course, it would be wonderful. Let me know if this is feasible

Sign In or Register to comment.