Regular Expressions II - Needle I need You.
Dear reader, this is a continuation of this thread:
http://community.rapidminer.com/t5/RapidMiner-Studio/Regular-expressions/m-p/37750#M26008
I opened a new one- because the original thread is solved.
I try to grep a word of a text-value to generate a value with the Generate Attribute oparator.
replaceAll(haystack, "hay (.*) hay", "$1")
...doesn't work. I guess I need needle as result, but it never appears in $1-$3 probably because it's replaced.
Alternative methods like match() and find() only deliver true/false.
Is there an example expression and function which collects an arbitrary word of an attribute containing text, to provide a value for a new attribute (without text-extension)?
Needle I need you - Song!
https://www.youtube.com/watch?v=rNS6D4hSQdA
Best Answer
-
Edin_Klapic Employee-RapidMiner, RMResearcher, Member Posts: 299 RM Data Scientist
@BalazsBarany: Thank you for pointing out the problem with multiple whitespaces - I included them in this answer
Finally:
Attached is the XML for the process. For those not having access to version 7.4. some screenshots and explanations can be found below.
<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
<list key="attribute_values">
<parameter key="haystack" value=""SELECT * FROM myNeedleTable order by drDate ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM myNeedleTable)-33""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="set_macro" compatibility="7.4.000" expanded="true" height="82" name="single searchword" width="90" x="179" y="34">
<parameter key="macro" value="searchword"/>
<parameter key="value" value="needle"/>
</operator>
<operator activated="true" class="set_macro" compatibility="7.4.000" expanded="true" height="82" name="searched tablename regex" width="90" x="313" y="34">
<parameter key="macro" value="regexTablename"/>
<parameter key="value" value="(?di).*from\s+(\S+)\s+order.*"/>
</operator>
<operator activated="true" class="set_macro" compatibility="7.4.000" expanded="true" height="82" name="searched limit regex" width="90" x="447" y="34">
<parameter key="macro" value="regexLimit"/>
<parameter key="value" value="(?di).*limit\s+(\S+)\s+offset.*"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.4.000" expanded="true" height="82" name="Generate Attributes" width="90" x="581" y="34">
<list key="function_descriptions">
<parameter key="searchword extraction" value="if(matches(haystack,"(?di).*"+%{searchword}+".*"),replaceAll(haystack,"(?di).*("+%{searchword}+").*","$1"),haystack)"/>
<parameter key="test if searchword matches" value="matches(haystack,"(?di).*needle.*")"/>
<parameter key="tablename" value="replaceAll(haystack,%{regexTablename},"$1")"/>
<parameter key="limit" value="replaceAll(haystack,%{regexLimit},"$1")"/>
</list>
</operator>
<connect from_op="Generate Data by User Specification" from_port="output" to_op="single searchword" to_port="through 1"/>
<connect from_op="single searchword" from_port="through 1" to_op="searched tablename regex" to_port="through 1"/>
<connect from_op="searched tablename regex" from_port="through 1" to_op="searched limit regex" to_port="through 1"/>
<connect from_op="searched limit regex" from_port="through 1" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>As you can see the searchword extraction uses the matches() function inside an if clause. Thus if the RegEx matches the replaceAll() function is only then applied.
Resulting ExampleSet:
The string which is tested (i.e. the content of the attribute haystack) is
SELECT * FROM
myNeedleTable order by drDate ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM myNeedleTable)-33The regex for the tablename is
(?di).*from\s+(\S+)\s+order.*
The regex for the limit is
(?di).*limit\s+(\S+)\s+offset.*
Important is the (?di) for dotall and case-insensitive mode for full replacement of the whole content.
Afterwards you can use Extract Macro for getting a variable with the desired value.
Hope this works for you
Best regards,
Edin
2
Answers
Oh no, you started a new thread and didn't tag anyone. They probably won't see this right away.
Flexibility Requirement:
Let's take another source text (real live example).
source:
Now I want to retrieve myNeedleTable and later 30.
regex:
..shows a strange Result preview, though Result List shows the right Group Matches.
That's why it replaces the matching string. If I invert the selection it does not match myNeedleTable.
I think I have to turn
...into its opposite.
...does not work.
Tadaa
Or separately
You are a Regex Guru.
How often do you do regex- every day?
(?mius)((?!(from)(.*)(order)).) left me one "F" to much.
I will choose your solution. Thank you!
I get other results then you...strange...
Something funny:
I think since RM seems to have no real matching function native (without extension),
I should consider if Regex is the right approach for my pipieline. I would need it as much flexible as possible.
Performance would also be a considerable point. The detour via replace seems a little bit 'overkill' too.
One of the functions matches() or find() should return needle. That would be the real smart deal.
In your first picture, you use \s?. This means "0 or 1 whitespace characters". The line feed is 1 or even 2 characters depending on your operating system.
I always use \s+ so I'm independent of the number of whitespace characters. If it is syntactically possible to leave out the whitespace, you could also use \s*.
Regards,
Balázs
Just an FYI, the Generate Attributes operator does have contains() and matches() functions. You can also you use RegEx in there and create new columns (attributes).
Thank you, Do you have an example where contains() or matches() returns needle (the searched word)?
My search term isn't static so "(.*)".
@Edin_Klapic
Wow thank you. That probably was a lot of work.
As I said before, I changed my strategy/pipeline.
I found initial attributing/tagging important stuff (with 'talking declarations') while data preparation, makes much more sense in my case, then extracting informations via replace() to generate attrbutes later in the pipeline. My data source is now "normalized" and optimized.
I'm still not convinced that replacing should substitute a generic matching in most cases.
But I'm sure there will be other cases, when this example could be helpful.
So a great contribution to the community.