Count UPPER CASE Tokens

nfridge1 · May 2020

I have a spreadsheet with a text column and a label column. I would like to represent text values with some token metadata. I'm using "process documents". In "process documents" I'm tokenizingo the text value. I would like to achieve the following:
1. Add an attribute to the exampleset which contains a count of the number of tokens which were UPPER CASE.
2. Add an attribute to the exampleset which is a count of the number of adjective tokens.
On point (2) I have made some progress by using "filter tokens by pos tag". This doesn't give me quite what I want though. I want a count of the number of adjectives, not just bag-of-words filtered to only contain adjectives.
On point (1) I have no ideas for how to proceed.
Thank you.

kayman · May 2020

@nfridge1 , The count is fairly easy, Use the generate attribute and use something like this :

length(replaceAll([Text],"[^A-Z]",""))

Basically this means replace everything that's not uppercase with nothing, and then count the length of the remainder.

So if your original Text would be "JusT FoR FuN" the replacement would return JTFRFN and length would be 6, which is then what will be returned.

Not sure what you would mean with the adjective tokens, do you have an example of what you have and what you want to achieve?

Edit : If you just want to count the words you could use something similar as above, but now you remove everything that's not the separator (comma or space or whatever you used) and add 1. Should give you the total tokens.

So if you have something like Token1, Token2, Token3 use

length(replaceAll([MyTokens],"[^,]",""))+1

Telcontar120 · May 2020

This is an interesting question. There are probably multiple ways of doing this and I'd be interested to see what some of our regex wizards like @kayman have to say.
But my first suggestion for #1 is to use the Generate Aggregation operator with the "count" function and use a regular expression to select only those attributes with a name that is entirely uppercase, which would be: [A-Z]+
(and this could be modified if you want to allow numbers or other special characters as well).
For the 2nd one, once you have a dataset with just the adjective tokens, you can skip the regular expression filtering and just use Generate Aggregation directly to get the count.
In both cases, this will provide the count for all tokens, regardless of whether they are in each individual document or not.
If you want a count of only the ones that appeared in each document, in the Process Documents operator you could use the word vector creation method of binary term occurrences and then simply use the sum function inside Generate Aggregation instead. Or use term occurrences as your word vector creation method and then the sum function will give you the actual count of such tokens. So you have several options.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Count UPPER CASE Tokens

Best Answer

Answers