The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Calculate number of unique words in text and number of repeating paragraphs
How can I calculate the number of unique mentions of each words (tokens without stopwords) in each text document? Besides how can I find the number of repeating sentences or paragraphs? Is there any operators in text mining extension?
Tagged:
0
Best Answer
-
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
Hi,
you can simply use a Process Documents operator with binary occurences and use Generate Aggregation afterwards to get the sum of a row.
~Martin
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany0
Answers
Thank you I think that will work. And what about repeating sentences? I tried the similarity measure first, but I have too long documents, so it will not work.
Hi,
Simply tokenize on linguistic sentences and do the same trick as for words.
~Martin
Dortmund, Germany
Hi Martin,
Thank you for the answer. I have a follow-up question: if the sentences are not complete the same, but very similar (e.g. 2-3 words are changed), how could I find the repeating text parts then?
Hi ln777,
you are always allowed to ask questions - that's what we are here for . The only question is if we can answer them.
i would create a similarity/synonym dictionary. I would go for worldist to data, take the sentences as an input for a 2nd process documents, tokenize on words and calculate a cross distance on the result. There i would go for a high cosine similarity to define a "synomym". This dictionary can then be used to replace texts in the original document.
~Martin
Dortmund, Germany