The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Count UPPER CASE Tokens
I have a spreadsheet with a text column and a label column. I would like to represent text values with some token metadata. I'm using "process documents". In "process documents" I'm tokenizingo the text value. I would like to achieve the following:
1. Add an attribute to the exampleset which contains a count of the number of tokens which were UPPER CASE.
2. Add an attribute to the exampleset which is a count of the number of adjective tokens.
On point (2) I have made some progress by using "filter tokens by pos tag". This doesn't give me quite what I want though. I want a count of the number of adjectives, not just bag-of-words filtered to only contain adjectives.
On point (1) I have no ideas for how to proceed.
Thank you.
1. Add an attribute to the exampleset which contains a count of the number of tokens which were UPPER CASE.
2. Add an attribute to the exampleset which is a count of the number of adjective tokens.
On point (2) I have made some progress by using "filter tokens by pos tag". This doesn't give me quite what I want though. I want a count of the number of adjectives, not just bag-of-words filtered to only contain adjectives.
On point (1) I have no ideas for how to proceed.
Thank you.
Tagged:
0
Best Answer
-
kayman Member Posts: 662 Unicorn@nfridge1 , The count is fairly easy, Use the generate attribute and use something like this :
length(replaceAll([Text],"[^A-Z]",""))
Basically this means replace everything that's not uppercase with nothing, and then count the length of the remainder.
So if your original Text would be "JusT FoR FuN" the replacement would return JTFRFN and length would be 6, which is then what will be returned.
Not sure what you would mean with the adjective tokens, do you have an example of what you have and what you want to achieve?
Edit : If you just want to count the words you could use something similar as above, but now you remove everything that's not the separator (comma or space or whatever you used) and add 1. Should give you the total tokens.
So if you have something like Token1, Token2, Token3 uselength(replaceAll([MyTokens],"[^,]",""))+1
6
Answers
But my first suggestion for #1 is to use the Generate Aggregation operator with the "count" function and use a regular expression to select only those attributes with a name that is entirely uppercase, which would be: [A-Z]+
(and this could be modified if you want to allow numbers or other special characters as well).
For the 2nd one, once you have a dataset with just the adjective tokens, you can skip the regular expression filtering and just use Generate Aggregation directly to get the count.
In both cases, this will provide the count for all tokens, regardless of whether they are in each individual document or not.
If you want a count of only the ones that appeared in each document, in the Process Documents operator you could use the word vector creation method of binary term occurrences and then simply use the sum function inside Generate Aggregation instead. Or use term occurrences as your word vector creation method and then the sum function will give you the actual count of such tokens. So you have several options.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts