The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Import a Word document to Rapidminer

BrilliantDataBrilliantData Member Posts: 1 Learner III
edited December 2018 in Help

On a project for a recent client I needed to apply some common Natural Language Processing (NLP) techniques to surveys they had gathered, but one of the requirements for the project was that the source document had to remain in Word's .docx format and couldn't be exported to .txt. RapidMiner was the tool of choice for this engagement since it is graphical in nature and has a very usable library for text analysis, but what it doesn't have is an operator that specifically imports .docx files.

 

Microsoft Word files are basically zip files that contain an XML representation of the actual document. It stands to reason that if you can unzip the wrapper and get to the XML inside, you have a good chance of being able to read the document and do whatever you need in terms of analysis. RapidMiner has an operator for executing custom Python scripts (if you download the Python extension), so I chose to start there and see if it could handle those tasks.

Using Python in RapidMiner

First we'll need to download the Python extension, which you can do by going to Extensions-->Marketplace in the menu at the top of the page. It's one of the most popular downloads, so just go to "Top Downloads," select it from the list, and click "Install Packages" at the bottom of the window. You'll need to restart RapidMiner afterwards for the extension's operators to become available.

 

image

 

To use a custom Python script, search for the "Execute Python" operator and drag it onto the workflow. Double-click and you'll see the usual parameter editing box on the top right of the screen, which should contain a button labeled "Edit Text." This is where we'll enter the code.

 

image

The Code

I try not to reinvent the wheel when coding, so I Googled the problem to see if someone had tackled it before me and someone definitely had. The code I used is below:

 

image

 

If you want to download it straight from Etienne's blog, just follow this link:

http://etienned.github.io/posts/extract-text-from-word-docx-simply/

The initial workflow looked like this:

 

image

 

After using Etienne's code to unwrap the .docx file, it was easily readable by the "Read Document" operator. After that I transformed all words to lowercase, tokenized them, removed stop words, then converted the resulting word list to data and loaded it into a database for analysis. Simple.

Answers

  • sgenzersgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @BrilliantData - welcome to the community and thanks for sharing this! It's actually similar to another thread from last December about xlsx files (see https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Extract-Sheet-name-from-an-Excel-file/m-p/44747).

     

    Scott

     

     

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Wonderful solution to a common problem!  If you would be willing to post an anonymized version of the process, I am sure there are many community members that would be grateful!

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • orsan_awawdiorsan_awawdi Member Posts: 3 Learner III

    This is brilliant. 

    I ca'nt find Read Document component? any idea . 

    using Rapid Miner Studio 8.1

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Did you install the free text mining extension?  All the document operators are in that and not in the base version of Sudio.  Just search for Text Processing on the Marketplace and it will come up.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • orsan_awawdiorsan_awawdi Member Posts: 3 Learner III

    Yes, you are right, it is right there.

    for some reason, it is failing in some identation issue. don't know why.

     

    ---

    Untitled7










      File "<ipython-input-28-405e2fcdbb20>", line 21
    document = zipfile.ZipFile('C:/Users/orsana/Desktop/MMO.docx')
    ^
    IndentationError: unindent does not match any outer indentation level











    ---

     

    iden.jpg

  • orsan_awawdiorsan_awawdi Member Posts: 3 Learner III

    I think I know what is wrong here. I will fix

  • blake_galbreathblake_galbreath Member Posts: 4 Contributor I
    This is a great article, but I still can't quite figure out how to actually get the word doc into the RM repository, in order to enter it into the process described above. I tried using the Import Data module, but it only seems to allow Binary, Excel, and CSV. Where do I go to import docx files?
  • rfuentealbarfuentealba RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    edited April 2020
    I got it as a Building Block.

    You just use the operator Open File to pass the Word Document, and then insert the building block here.

    Before pasting the building block into your system, remove the .txt extension I had to add.

    Usage:


  • blake_galbreathblake_galbreath Member Posts: 4 Contributor I
    rfuentealba I believe this will work.
Sign In or Register to comment.