The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Text Mining: How to split data according to language"
Hi there,
I am currently trying to split the text corpus I am working with into the different languages the texts are written in, but I fail and seek help.
First, I classified the languages of each text in my text corpus by using a Naive Bayes based language detector. Thus, I already know which of the texts are e.g. German or English. Now, I want to select only the German or English texts in order to analyze them seperately, but I fail and don't know the correct operators to use. I already tried to use the Filter Examples operator, but it looks like only the different prediction labels for the languages are filtered and the corresponding texts are omitted.
Can anybody help?
Thanks in advance!!
Ute
I am currently trying to split the text corpus I am working with into the different languages the texts are written in, but I fail and seek help.
First, I classified the languages of each text in my text corpus by using a Naive Bayes based language detector. Thus, I already know which of the texts are e.g. German or English. Now, I want to select only the German or English texts in order to analyze them seperately, but I fail and don't know the correct operators to use. I already tried to use the Filter Examples operator, but it looks like only the different prediction labels for the languages are filtered and the corresponding texts are omitted.
Can anybody help?
Thanks in advance!!
Ute
Tagged:
0
Answers
Alternative you can use a language detection API through the "Enrich data by WebService" operator to create such attribute. I personally used http://detectlanguage.com/ and it was very good and easy to implement.
Hope this helps
Igor
Igor,I tried your suggestion of using the Enrich data by WebService" operator to create such atribute, however I am not sure about:
1. What quesry type to use
2. and what the regular expression would have to look like for this to work.
I do have a API key from detectlanguage key and I am able to pass data to the detectlanguage.com service. Now teh question is how do I get the value from languge parsed out.
Thanks in advance for your help.
hello @tibi - welcome to the community. This is an old thread but maybe I can help? Can you please post your XML process (see instructions on the right)?
Thanks.
Scott
Thank you for writing back. Atatched is my XML code. I edited teh code so that it does not show my API key.
hello @tibi - looks like an encoding issue. Give this a try (again deleting API key):
Scott
Scott,
Yes. That is waht it was. Thank you!
One more thing. When I have text string with two languages in it, the API on the web actaully returns 2 sets of values for language, isReliable and confidence. I actually need these values. Here is an example what gets returned by the API in this situation:
I assume I have to edit the jsonpath queries for the Enrich Data by Webservice operator. Any suggestions, please?
Thanks,
Tibor
ok I think that would be fine but...can you please give me a text string that will give that result?
Scott
[EDIT: ok I got a snippet from the DetectLanguage site. So I have never found a reliable way to parse JSON beyond simple ways using that operator so, strangely enough, I find it more straightforward to convert to XML and go from there. It looks completely bizarre but until RapidMiner makes a good Read JSON operator, this is what I have found works best for me.]