How to obtain a words' list relating to word containing wildcard *, ?, #

EL75 · November 2020

Hello rapidminer community,

Hope you're doing well in this crisis period...

Here's my topic, and thank you all for your help and advice.

As said in the subject, I'd like to obtain a words' list relating to words containing wildcard *, ?, #.

The reason is that I’m trying to migrate a dictionary from a platform to another.

In my original dictionary, I have words with wildcard *, ?, #

The new platform doesn’t accept such characters and force me to create a single line for each declination.

These wildcards can be associated with part of words or in word sequences.

Using a snowball « * » allow me - in my present dictionary - to capture all part of texts relating to these variations (pluriel, gender, grammatical declinaison, etc.).

For example, SUPPORT* will mean SUPPORT, SUPPORTS, SUPPORTING, SUPPORTIVE, SUPPORTER, etc.

While the following word pattern: *SUPPORT* will also substitute all words with the substring "SUPPORT" in it, such as UNSUPPORTEDLY, UNSUPPORTED, etc.

An expression that includes several words may also be substituted by joining the various words with underline characters. For example, the expression "going out" GO*_OUT.

But my needs go beyond the snowball as wildcards:

- « ? » is used to replace any unique character in a word,

- « # » is used to replace any number « ## » for two numbers, etc.

Therefore, I need to migrate my actual dictionary (French words) that contains thousands of rows with ITEMS containing wildcards: is there a solution that could allow me to give, for each such word, all the corresponding words?

I believe that this is a tricky thing to solve.... but I'm stuck in the process and can't move forward.

I would be so happy to find a solution

Thank you so much in advance for any help.

kayman · November 2020

Regex is fairly similar but the syntax can be a bit scary at first...

In order the mimic SUPPORT* you'd need to use SUPPORT.*

The dot (.) basically means 'any character allowed', and the star (*) means as many as you can find

Which also means that if you want to filter on a single character the dot is sufficient.

So .ES returns TES,YES,NES etc.

Now is where the problems start, you want to have your search case independent so you need to explicitly tell the regex compiler to ignore cases, you do this by starting your query with the (?i) syntax.

So (?i).ES returns TES, but also Tes and TeS etc.

but it would also return honestly as it's just looking for es with a character in front...

So that's where boundaries come to play, where you tell the regex compiler where to start and/or end.

If you only have single words it would mean your script needs to start at the beginning of a line, using the carret symbol (^)

So (?i)^.ES will now match only words that start with a single character followed by ES without bothering about upper or lower cases...

Fine, that's with 1 character up front, what if we need more, say 2 or 3 before ES?

One way is to go wild with dots, so (?i)^...ES will match everything that has 3 characters before ES (dot dot dot) but that's not very flexible, so we can use ranges also. With curly brackets you can define a range, so this :

(?i)^.{1,5}ES will match everything that starts with 1 to 5 characters followed by ES.

Let's try above with some wildchards also, say you need to have any word that contains ES but it's not important if it's at he beginning or somewhere at the middle.

The star basically means 0 or more so

(?i)^.*ES means match everything that has ES, even if it starts with it

If you need to have at least one match you use the + so

(?i)^.+ES means match everything that has ES, but there needs to be at least one other character in front of it

Or if you want to have maximum one character in front you can use the questionmark, which means optional. Note that in regex a few special characters (.*?| etc) can have multiple useage, to keep it simple...

(?i)^.?ES means match everything that has ES, but there needs to be either just one or no characters in front of it.

So both TES and ES would match, but not SNES.

For the suffix the same goes. with the above it basically only matches words that end with ES, if you don't bother about what comes behind you need to use something like

(?i)^.*ES.* This basically matches everything that contains ES, wherever in a word.

Hope this get's you started, it's a very powerful language, and once you used it a few times also suprisingly easy, even if it looks different at this stage...

kayman · November 2020

Have you tried with regex?
You will have some translation to do from old format but in essence the logic remains (more or less) the same...

The rapidminer stem operators (part of the text processing add ons) do allow * also, not sure about the other wildcards as I've never used them but did you give these a try also?

EL75 · November 2020

Hi Keyman,

thank you so much for your help!

Could you precise how could REGEX find all declensions for a lemma?

For example, a REGEX should:

- start reading the firs row and find in the column named 'ITEM' of my dictionary a word containing an asterisk (*)

- after finding SUPPORT*, then find all declensions in a french words list (I have different ones) the words e.g SUPPORT, SUPPORTS, SUPPORTING, SUPPORTIVE, SUPPORTER, etc.

- then create lines for each new word in my dictionary

- add the words found in the column ITEM

and continue the process until the last row…

An other point is that I can have an entry in the dictionary that contains multiple words, some of them can contain a « * » and/or a « ? » e.g the french expression: « temp* d?ecran* » . this returns all verbatims dealing with time spent on screens including misspellings and the use or not of accented characters that are frequent in french (temps d’écran, temp d’ecran, temps d’écrans, etc.). Such cases are so frequent in french, that, when realizing semantic analysis of verbatims, it is really useful to capture all those expressions in the same « folder ».

For words words containing a « ? » would it be an identical process, considering that the « ? » can be elsewhere in a word? For instance, in french, as I’m working on dataset of verbatims coming from social networks, I have entries in my dictionary that allow me to capture different ways people write - including misspellings - for the same theme. this entry of words « sans_que_?es_parent*_le_voi* » is a good exemple of the global issue of my dictionary’ s migration:

- « ?ES » should return « tes, des, les, mes »
- "parent" ; "parents"

- « voi* » should return « voie, voient, vois, voit etc.)

I thought perhaps a rapidminer process - combined with REGEX- could allow me to do that?

Thank you for your help. I’m quite beginner in coding

and really stuck in this migration process...

best regards

EL75 · November 2020

hi Kayman, I'm so glad reading your answer, that is very precise, and pedagogic.

I’ve read in details, and with your help, I feel confortable starting playing with regex !

Would you please tell me how (within rapid miner) I can implement the target of the research of regex expression? I mean that the regex formula searches words in the orignal dictionary and, on the other side, it has to find the words in another file?

best regards,

kayman · November 2020

I lost you a bit :-)
Could you share some examples, like reduced source files and elaborate what you would like to get as outcome? It's easier this way to get an overall view on the problem and potential solution.

EL75 · November 2020

Hi Kayman,
thank you very much for your help.
I've took some time to find the right way, and you're right, regex are powerful.
best regards,
PS : I've post a new question regarding encoding apostrophe when exporting CSV file, in case you know how to do..

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to obtain a words' list relating to word containing wildcard *, ?, #

Best Answer

Answers