The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
RegEx query returns only one word instead of a complete sentence
Hey I am new to Rapidminer and try to analyze text for my Bachelor thesis.I have already pre-processed (e.g. tokenized etc. ) the Documents and would like to use "extract information" and regular expressions to get all sentences containing the word "Kenntnisse".
I have already tested some expressions on regex101.com and regexr.com, all worked.
Examples: ^.*(kenntnisse|Kenntnisse|kennt*) or (?m)^.*?(Kenntnisse).*$ But as soon as I use the query in "extract information", I only get the word "Kenntnisse", not the whole sentence / paragraph.
can anyone help me?
Thanks guys!
Thanks guys!
Tagged:
0
Best Answers
-
kayman Member Posts: 662 UnicornYou use the group () regex around kenntnisse, so it's normal that's the only thing returned as you don't select pre or suffix. If you want the whole sentence you need to use ( at the beginning and the closing one on the end.5
-
kayman Member Posts: 662 UnicornWell, magical is maybe a bit overrated but I did figure out the issue.
What you are trying to do is extract multiple sentences in one go, and that isn't exactly supported. While the operator correctly provides what is selected with the regex it has no real clue what to do with the part that doesn't match, so it just keeps it as is, which is actually correct but may look strange. The operator just sees 'Ah, I have this in my content so I allow the full thing' the way it is constructed now.
You could use this to get the first match, or the last match, or in between matches, but you cannot use it to say 'I want sentence one and 5' as the operator cannot do that. The regex emulator is a common one, so the replace thing tricks us here as there is no replace. Just match...
A workaround would be to tokenise by sentence first, and then do the extract, but that's pretty heavy, so a better way around is to use negative lookaheads. So rather than keeping what you need you basically remove what you don't need.
You can use negative lookahead for that, so something like
(?mi)^(?!.*kenntnisse).*$
and replace it with nothing. This works with the data operators, the document ones do not support replace with nothing so it's a bit more complex then.
I've simplified your process a bit using this logic, so actually using a replace instead of extract, and at first glance it seems to work also. I'll attach this if the browser allows me, hope this gets you further. You can just import the attached rmp process.
(BTW, maybe it's best to remove your XML again, seems browsers have a hard time dealing with it once they get a certain size...)
5
Answers
Before I ask more questions, I try a few more things and read further into the topic.
But, i will be back
so it looks like the query is working (?i)[^.\s]*Kenntnisse*[^\n]*
Only one (the first) match is displayed in the results, where in the editor 4 matches are displayed.
Is there again something I have forgotten? I thought this was achieved by the "multiline mode", but it seems to make no difference.
the [^.\s] part basically means 'anything but actual dots or spaces' so it may not give you the results you need and this is probably the reason you only get the first match and multiline isn't working
Try maybe with something like this :
(?i)^.*\bkenntnisse\b.*$
The \b means word boundary, so kind of everything but a character, therefore the above kind of states 'if the word kenntnisse is between start (^) and end ($), whatever casing used, to have a match...
Multiline mode would allow you to use this line by line, so you probably wouldn't even need the ^ and $ characters as that's considered the default then, but it never harms...
If you want to focus on multiple words you can use following
(?i)^.*\b(?:kenntnisse|other_word|something_else)\b.*$
the (?: xxx ) allows you to group but without 'storing' this
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
@TobiTee, thanks for the process already. If you do not mind could you also send me the excel (you can use my pm for that or just add it as an attachment here). Then I can reconstruct the whole flow
But after running the process I only received the first match.
I hope @kayman got has some magical tips. otherwise there is still python..