Text Tokenization Using Regular Expression For Text Mining

onurer007 · April 2014

Hello,
I have a problem and i need your help, please.
I want to tokenize a unstructured document using regular expression. I have a text file where each rows include a sentence such as:

1. String1 String2 String3 String4 String5
2. String6 - String7 - -
...
n. String8 - String9 String10 - (assume string2 and string5 dont exist.)

What I exactly want to do is that tokenization will extract each word and give the results in a table in Excel format such as:

S1 S2 S3 S4 S5
1. String1 String2 String3 String4 String5
2. String6 - String7 - -
3.
..
n. String8 - String9 String10 -

which operators and and which regular expression structure can i use in Rapid Miner?
Thank you for your help in advance.

MariusHelf · April 2014

If your original document contains the dashes you can simply read it with Read CSV and specify all blanks (space, tab, etc.) as column separator.

Best regards,
Marius

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Text Tokenization Using Regular Expression For Text Mining

Answers