Delete hyphen (special chars) before tokenization
I would like to delete all hyphens in the text document which I analyze in Rapidminer. For that I use operator "Process documents from files" to analyze large PDF-files. Each file contains a lot of hyphens which I would like to delete before I'll tokenize the text into pieces (non letters). I've used operator "Replace token". With it I can replace hyphens with other symbols, but I cannot replace them with nothing or empty string(" "). I've tried also to use my own customized dictionary of stopwords(non-letters, -). This operator does no work at all. I've saved my dictionary containing the chars and words I want to delete as a text file (each in the new line). Can anybody help on this issue?
Best Answer
-
land RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
Hi,
the problem with this operator is, that it actually does not permit that you enter en empty value in the replacement list. It will then automatically discard the entry.
To circumvent this, you will need to enter something that is actually text, but will be empty. The easiest way is to make use of the regular expression and their capturing groups. The idea is simply to make an empty capturing group and replace the match with this empty group. If you don't know about regular expressions, I would recommend to read some tutorial, they are really powerful and can be useful in any number of events.
So in your case instead of having to replace "-" by "", you will need to replace "()-" with "$1". The parenthesis defines a capturing group. As nothing is inside them, it will be empty. You address a capturing group in the replace by term with the $1.
Here's an example that makes it work. Simply copy the xml and paste in RapidMiner.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.1.001">
<operator activated="true" class="text:create_document" compatibility="7.1.001" expanded="true" height="68" name="Create Document" width="90" x="447" y="136">
<parameter key="text" value="This is a - hyphen-word"/>
<parameter key="add label" value="false"/>
<parameter key="label_type" value="nominal"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.1.001">
<operator activated="true" class="text:replace_tokens" compatibility="7.1.001" expanded="true" height="68" name="Replace Tokens" width="90" x="581" y="136">
<list key="replace_dictionary">
<parameter key="()-" value="$1"/>
</list>
</operator>
</process>Greetings,
Sebastian
0
Answers
in replace token try replacement as slash followed by actual space i.e. "\ " without the doublequotes, have not tried that myself, but remember doing something like that in the past