The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Language filter to retain English only
Best Answers
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 UnicornIn theory you could tokenize based on spaces, which would give you a set of "words" that would be potentially in multiple languages. You could then use the filter token with dictionary operator to retain only those tokens which were in a given language dictionary (that you would need to supply as a txt file). This would be a kind of crude language filter using only native RapidMiner operators, but I think the accuracy would not be as high as you would like due to ambiguous words and also your treatment of potentially mixed language texts.6
-
JamieLim Member Posts: 3 Learner II ended up using python to split up the parargraphs into sentences and then identified the english sentences from the non-english ones and managed to do a pretty good filter. Then, the filter text is passed into RapidMiner, tokenize and still a few non-english words were left, and i removed these by adding them to a stopwords dictionary.1
Answers
- Manually classify a set of documents and train a ML model to discriminate between them, then apply the model on all new documents.
- Use an external API such as Google Translate or AWS Translate to do this for you
Scott
or in other words, sometimes there is no quick-and-dirty answer.
Scott