The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Answers
Actually, the name "text" it has to be the name of my excel file or?
I have been following this post and your other post for quite some time, and I get the feeling that it may be a good idea to step one step back, leave the rather complicated text processing aside and get used to the common concepts of RapidMiner and data mining with RapidMiner with the help of our tutorials. That will make it much easier for you to assemble your processes, and debug them if anything does not work. There is also a good book available which is even downloadable for free: search for "Data Mining for the Masses" by Matt North. Here the author explains many concepts of data mining on simple, but realistic examples, starting at a very basic level and advancing to more and more complex topics. Most of the chapters use RapidMiner as the platform for doing the exercises.
If you have any further questions you are of course invited to ask for help here on the forums!
All the best,
Marius
Actually, I have tried Tokens by Content, but actually I couldn't figure out how to specify more than one expression in the corresponding field, I tried search the sample syntax, but no result. :-\
Regular expressions are quite complex, and there exist complete books on only this topics. The basic syntax however can be quickly learned from tutorials on the internet.
RapidMiner contains a regular expression dialog, where you can directly test the expressions you entered. It is available in many parameters where you can enter regular expressions, e.g. Select Attributes (with attribute_filter_type = regular_expression). Since the dialog is quite new, not all fields have been ported, so its not yet in Filter Tokens. There are also many free regex testers on the internet.
Happy Mining!
~Marius
Thanks for the help. I will try and give the feedback!
Best regards
Armen
*dog.*|.*cat.*|.*fish.*
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
*dog.*|.*cat.*|.*fish.*
The used regular expression is a bit on the greedy side, meaning you can have a lot of results, but not the right ones depending on how your text is structured.
In the given example it will only match the exact case, so if you have for instance Fish (with capital F) , it will not match. it will also capture fishing, hotdog, category and so on, and while that might be useful for some scenarios it can also lead to unexpected results again.
There are ways to improve this, using some of the more advanced yet cool options of regular expressions.
you could use groups to start with, that reduces the wildchards already and makes it more readable and less error prone.
The above becomes then
It does exactly the same, it reads as 'take whatever you want (the dot), as many times as you like (the asterix) followed by either cat, dog or fish, and then again followed by whatever as much as you want.
This is what we call a greedy pattern, we don't care of what we get and how much we have. This si typically no problem when dealing with small sentences, but can cost you a lot of memory when you have long content.
so one small improvement already :
Ok, 2 small changes. The first is the 'hat' (^), which means, begin at the start of the sentence, and the question mark, which means 'end at the first match. So using ^.*? is short for begin at the start, and end as soon as you find the first match. This can save quite some time again with large texts, as the original one will just keep looking for matches untill he is at the end of the sentence.
Now, we still can only match lower case, and while it is good practice to set all of your cases either lower or upper in a text analysis workflow, there are occasions where we need the difference of course. Anyway, to ignore cases we use the i flag as follows :
So now it will find cat, Cat, CAT, and whatever else. Should that be a requirement of course.
You can combine many flags together, while the i flag means ignore case, the m flag can be used to indicate you can have multiple lines. combining them as below would mean that every sentence, when using line breaks, would get the same treatment.
the order doesn't matter, (?mi) would work exactly the same.
Now, we still have the problem we can get things like category or hotdog in the results, so the final part would be to use the word boundary, so that we are ensured we only get a match when it is exactly the same word. A word boundary can be anything like a comma, a dot, a space, end or beginning of sentence etc. Luckily there is a little helper again, so the below will give you an exact match, stop at the first match, looking at every line you have.
As an alternative you could also use the s flag when you have a lot of line breaks, and this will ignore all linebreaks and treat your text as one single line.
FINAL EDIT : it seems the code block screws the content a bit up, all of the symbols used need to be in one single line.
Regex is probably not going to work here if that is what you want to achieve.
i have a question where we write regular regular expression what that box called in rapidminer?