function behaviour with replaceAll()
When using the replaceAll operator it seems some functions are ignored while other seem to work fine.
As an example :
replaceAll(lower([myField]),"^(.)",upper("$1")) just returns the same, whereas the expected behaviour would be to get the first character being returned in upper case. There is no error thrown, the upper (and also lower) command is just ignored when applying it to the regex result.
replaceAll([myField],"^(.)",concat("-","$1","-")) nicely returns a concatenated field, as expected. So here the function works nicely with the regex match.
Any idea why?
(PS : I'm aware I can get the wanted result with other functions also, but that would only work for the simplified example as my actual regex is a bit more complex)
Answers
hi @kayman - that is one nasty RegEx you are building there in Generate Attributes
I have no idea why you would see that strange behavior BUT if it were me, I would build that expression in three Generate Attributes operators rather than in one nasty formula:
att1 lower([myField])
att2 upper("$1") [--- not sure what this does...upper case $? ---]
att3 replaceAll(att1,"^(.)",att2)
Scott
Trust me, that's not a nasty one :-)
What I want to achieve is to camel case some uppercased content in a given string, so the attribute flow will not work.
Example :
Attr : This is a STRING
Should become
Attr : This is a String
Now, if there were only a few words like this I could deal with replacing them one by one, but there are a load of them. So in essence I want to be able to replace some defined uppercased words to lower (or camel) case, but definitly not all of them
Getting the words in question is fairly simple, that would be something like
replaceAll([myAttr],"(WORD1|WORD2|WORD3)", [replaceWithLogic])
The $1 operator is simply my matched word, so for instance WORD1
Where the replace logic could be something like concat(prefix("$1"),lower(suffix("$1",len("$1")-1))
Or take the first char and leave as is, and everything else to lower case. Should work in theory but the operator happilly ignores everything and just returns the value as is.
So instead of getting expected Word1, it produces WORD1WORD1.
In standard regex you could also use something as (W)(ORD1) and replace this with $1\L$2 to receive Word1 but this syntax is not supported either in the normal regex replacements.
The behaviour is not consistent, some functions are dealing correct with the matched group ($x), others do not. But it also doesn't fail as such, it just does not get handled.
Hope this makes some sense...
yes it makes sense. Huh. I have a feeling there is a much easier way to do this but my mind is a blank. One thing that I hope(?) you know is that the RegEx engine is different in Java than in JavaScript. I accidentally discovered this a while back when I was using online RegEx builders and then found RapidMiner being wonky. @Telcontar120 got me onto this book "Regular Expressions in 10 Minutes" by Ben Forta which does a good job for me.
Can you post a sample of the data set and your process so I can play with it? I love this kind of stuff.
Scott
There are alternatives indeed, but they can end up pretty hard to maintain in the end, problem with my content is that sometimes terms are in uppercase, and other times in camel and I need them to be consistent in the end. What I do now is use a replace by document flow, like below
section,from,to
Generic,\\bSPORT\\b,Sport
Generic,\\bCINEMA\\b,Cinema
Generic,\\bGAME\\b,Game
and many more...
This works pretty fine and is a reasonable alternative, but some things can be done easier (ok, the regex can become scary then) but it doesn't really work as expected. I know and understand there are differences between the various flavours, but that is not the real issue in this scenario, since there is a match and the regex itself is valid. The problem is that the operator behaves different with the output based on the function used.
Not sure what the format behind the scenes the result is returned, guess the problem is that the operator recognizes that there is a result, but not what format it is.
Below is a sample, it is a bit simple but shows that the script is at least working for some functions.