Extracting Entities with Rosette in RapidMiner Studio
Check out our Rosette Text Toolkit extension for RapidMiner and plug Rosette text analytics directly into your RapidMiner workflows. More info here: https://www.rosette.com/
Get up and running with Rosette for RapidMiner Studio with this quick start guide, which covers the installation and setup process. We also demonstrate how to get started extracting and linking entities with Rosette.
Installing RapidMiner and Rosette
If you aren’t already running RapidMiner Studio, download the application on RapidMiner’s website, to download the Rosette Text Toolkit extension, open RapidMiner Studio, navigate to the Extensions menu and select Marketplace.
A new window will open. Search for “rosette” and select Rosette Text Toolkit from the list of results. Click the Install 1 Packages button at the bottom of the window and follow the click-through instructions to complete the installation.
Once the extension has finished installing, the Rosette operators will be visible in the Extensions folder of the Operators panel.
Getting a Rosette API Key
In order to activate the Rosette Text Toolkit for RapidMiner Studio, you’ll need an API key and a Rosette developer account. Head over to developer.rosette.com and complete the signup process.
You can create an account linked to either your email or your GitHub account. No credit card is required — our default plan gives you 10,000 calls a day for free! If you’re interested in upping your call quota, check out our paid plans.
Once you have completed the signup process and verified your account, click on the API Key tab on the top left of the menu bar to display your key.
Setting up your Rosette API Connection
Back in RapidMiner Studio, input your Rosette API key to start using any of Rosette’s operators. We’ll be looking at the entity extraction operator in the next section, so we’ll use it to set up the Rosette API connection now.
First, locate Extract Entities in the Rosette Text Toolkit folder in the Operators panel and drag it to the Process panel.
You can see the various settings options for the Extract Entities operator in the The Parameters panel to the right of the Process panel. The first parameter is Connection. Click the Rosette icon to the right of the box.
The Manage Connections window will open. Click the Add Connection button on the bottom left and select Rosette Connection from the Connection type dropdown list. Name your new connection and click the Create button.
Select your new Rosette API connection from the list on the left and enter your Rosette API key in the API KEY box. Use the Test button at the bottom of the window to verify that your connection is working. If you run into any trouble, confirm that you have copied your API key correctly. When you are satisfied that everything is running smoothly, click the Save all changes button to return to the Parameters panel.
Select your new connection from the Connection dropdown list.
Extracting Entities
Now that you’ve installed the Rosette for RapidMiner extension and set up your API key and connection, you’re almost ready to start analyzing. Last step: download RapidMiner’s Text Processing extension in the RapidMiner Marketplace, a helpful set of operators that allow you to load, filter, and analyze text from a variety of different sources. With that installed, head to RapidMiner Studio where we’ll use three operators to create a simple entity extraction workflow, or process: Create Document and Documents to Data from Text Processing, and Extract Entities from Rosette. Drag these operators into the Process panel and connect them together, maintaining the order listed above. You can find the operators using the Operators Search Bar.
Select the Create Document operator. In the parameter panel, check the add label box. Under label type, select text and enter ‘my_text’ for label value. Click the Edit Text button at the top of the panel and copy the text below into the popup window.
“Bill Murray will appear in new Ghostbusters film: Dr. Peter Venkman was spotted filming a cameo in Boston this… http://dlvr.it/BnsFfS.”
Hit the Apply Changes button to save your work.
Now select the Documents to Data operator. In the Parameters panel, enter ‘my_text’ in the text attribute field.
Execute the process using the blue “play” button. The results show five extracted entities. As you can see, Rosette correctly extracted both the names and the location included in the text.
Let’s make our input text a little longer. Add the sentence below to the parameter text and rerun the process.
“Another original Ghostbuster, Dan Akroyd, is also confirmed to have a cameo in the film.”
From the results we can see that Rosette extracts Dan Akroyd’s name as expected. However, eagle-eyed readers may have noticed that “Akroyd” is misspelled. (It should be “Aykroyd.”) This is not uncommon. Name misspellings appear frequently, everywhere from personal blogs to the New York Times online. If you are trying to track a particular entity across a large collection of documents, you want to make sure that you are identifying all possible spellings of that entity’s name. Rosette automatically extracts and links entities with spelling variations and other textual anomalies, unifying them into a single entry.
To demonstrate this functionality, let’s enable Link Entities in the Extract Entities parameter panel.
Then, we’ll add a third line to the parameter text that includes the correct spelling of Dan Aykroyd’s name, like the one below:
“Actually, the correct spelling is Aykroyd.”
When we run the process again, a new QID column appears in the results. Notice that “Dan Akroyd” and “Aykroyd” have the same QID value — Rosette has correctly identified them as the same entity.
QID values are drawn from Wikidata, so if an entity has a Wikidata entry, Rosette should be able to link and resolve it.
QIDs are very useful for machine reading-purposes, but for humans they can be difficult to keep track of. Let’s turn on the Include Entity Name parameter, which will allow us to see the entity names in addition to their QIDs.
Try it Yourself
Now that you’ve got the Rosette Text Toolkit up and running with RapidMiner Studio, you are well equipped to handle a host of text analytics tasks. Incorporate results like the ones above into your pre-existing data processes, and check out our other operators, including Categorization, Sentiment Analysis, Morphological Analysis, Tokenization, Sentence Tagging, Name Translation, and Name Matching.
While you’re at it, keep us posted! We love to hear what our users are working on, and would be thrilled to share your Rosette for RapidMiner story on our blog and here in the RapidMiner Community.
Comments
Hello: I cant find Create Document, Documents to Data, in my operators, im using 7.3 version.
Help !!!
Hi,
Do you have the Text Mining extension installed? Go to Extensions > Marketplace, and search for Text Mining. Then install it.
hello
would someone please guide me to do this task?
question: precision of k=3 and k=5 in k fold cross validation and set data using ID3 decision tree
thanks alot
Hi pedramahmadi,
You may have more luck with your question elsewhere. It's not related to the entity extraction process described above.
Best of luck,
Hannah from Rosette
Hello, I am trying to import an Excel file into Rapidminer. However, this Excel file have mixed data format. For instance, a given column may contain some cells which are just numerical values, while some other cells are plain texts.when i am importing it showing a error "cannot get numeric value from a text file " How should I solve this problem ?
Hi amenaakhterchy
You may have better luck with this question elsewhere, as it doesn't pertain to the entity extraction guide. I just did a quick search and it looks like there are some helpful responses to this very similar question about mixed data formats in Excel files.
Best of luck,
Hannah from Rosette
Hello, I'm trying Rapidminer and Rosette for the first time and following this tutorial I'm already stuck at the first line.
I get: Could not create meta attributes
I followed exactly the same steps and registered the API key correctely, can you help me?
Hi fabio_pertel
I believe the issue you were running into may have been caused by a bug related to our recent release of Rosette API 1.7, which our RapidMiner extension depends on. We just released a fix this afternoon, can you try again and see if you are able to get results? If not, please email us at support@rosette.com.
Many thanks,
Hannah
Hi,
I'm just getting started with RM for text analytics. Everything has gone well working with structured data but I'm struggling with analysing text documents. Could you please provide a quick summary of how to extract entities from a PDF or Word Doc?
I've searched these forums and Google and the only solution that seems to work is converting the file into a txt file first. Which isn't ideal. Any help would be super appreciated.
Hi Ty,
Thanks for your question! Rosette works with raw text files, but RapidMiner makes it easy to prepare your text for processing if it's not in .txt format.
We recommend using RapidMiner's "Text Processing" extension. Just use the "Read Document" operator (which takes PDF as input) followed by the "Documents to Data" operator.
All the best,
Hannah