Filter Stopwords (English) takes out a non-stopword token
Greetings community,
I am learning to use RapidMiner to extract and to analyse occurrences of selected keywords in annual reports, prepared by commercial entities. RapidMiner works well for all the key words I study, except for one.
For some reason, Filter Stopwords (English) operator filters out word 'important' for the whole corpus of documents I study.
E.g. I have a document , where manual search shows me that it contains the following words of interest:
important - 11
importantly - 4
importance - 4
Using Process Documents from Files, with Filter Stopwords (English) operator ON, I can see only occurrences of the words 'importantly' and 'importance', having this operator OFF allows me also to extract the expected 11 occurrences of word 'important'.
I tried to change tokenizing from 'non letters' to 'linguistic tokens' option, but it did not help.
Question: Is it an (known) error?
( I don't see the </> icon to share my process )
Kind regards,
Answers
Process and
test documentaddedHi @AO1 ,
I'm able to replicate on all the versions available to me. I will see if I can find out more from the development team. In the meantime, I would suggest using Filter Stopwords (Dictionary) for more fine-grained control.
Best,
Roland