Can I use POS expressions (chuncks) with the text mining operators?
Hi there,
I know I can use the Filter tokens (by POS tags) to filter out single POS tags, but how would I generate chuncks ?
I am for instance interested in combinations of adjectives and nouns, or noun sequences, but this does not seem to work for me
Let's assume I have a dummy sentence like this one : "I have a broken computer, there is no picture, this thing sucks"
I would like to chunck this uing for instance (JJ.* NN.*+)|(DT NN.*+)|NN.*+
-> so either an adjective folowed by noun(s), or a determinator followed by a noun, or a simple noun phrase
so my output would become after some further processing something like [broken computer],[no picture],[thing sucks]
But the operater seems to accept single POS tags only. Is this correct or am I doing it completely wrong?
Answers
What is your tokenizer set at? non letters? Have you tried setting it to linguistic sentences?
Hi Thomas,
As I was testing on single sentences (handpicked) I did not use a tokenizer yet. The POS operator works pretty ok when selecting a single POS (like JJ.*|NN.*) but seems to be unable to handle sequences (so like any JJ followed by NN).
I can do the same with a python operator so I am not really stuck if RM dos not support it, it would just be nice to be able to do it with the standard operators. maybe something for the next version ?
Or I may be having issues with the syntax, not too sure about that one either
It should be able to do it because on your regex structure, but I think it needs to operate inside the Process Doc from Data operator with a Tokenize operator set to linguistic sentences.
Try this:
Hi Thomas,
maybe I was not clear enough. The example you show works, but it filters on either JJ or NN tag, not on a sequence of these. The OR worked for me also (even without tokenizing on sentences) but I need more of an AND scenario
What I need to achieve is a filter on JJ, only if followed by one or more NN (or other combinations)
Assume I have following sentences :
"hello what do I need to be able to group multiple pos tokens? Can I use regular groups or is that too complex?"
Using a chunkrule like this one in python <JJ><NN.*>+ would return me
Using the expression JJ.*|NN.* as in the RM example logic correctly returns
So the option to group POS tags provides much more powerful options, but given that using an expression like JJ NN.* returns an empty match I assume this is not possible.
Hope this makes it more clear
Hmm, in this case I'm stumped. Maybe @mschmitz has an idea.
Puh, this is rather a question for @hhomburg or @RalfKlinkenberg
Dortmund, Germany