The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Doubts on the text plug-in"

LorenzoLorenzo Member Posts: 7 Contributor II
edited May 2019 in Help
Good afternoon everyone.
I'm a very newbie in the context of text mining and of rapid miner usage.
I used the text plug-in of rapid miner and I have a few questions for who is so kind to answer.
1) when I process a group of texts I get a big matrix where the items (documents) are the rows and the features (the stems) are the columns. what is the metric that fills each cell (I want to be sure about the meaning of the number inside each cell)? Can I change it?How?
2) what is (I simply want an opinion) the more suitable of these metrics (if there is more than one) to exploit the matrix for clustering analysis?
3) The stemmer and the tokenizer divide my text into words (if the text is "always happy or sad" I'll get the stems corresponding to always, happy...).
Is it possible in RM to work not on a single word but on groups of words (in medical and scientific text very often I have lexicons such as "acetic anhydride" that should be considered as a unique token)?
I apologize because I'm always too verbose :)
Thanks for your kind attention..hoping that someone can help.
Lorenzo

Answers

  • IngoRMIngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hello Lorenzo,

    ad 1)

    the values are usually the TFIDF (term frequency - inverse document frequency) values for all terms (just google for this). Which tokens are taken into account and how they are changed is subject to the inner operators of the TextInput operators. You can also select only term frequency (without normalizing by the inverse document frequency) or just binary occurences, i.e. a flag indicating if the word is part of the corresponding text or not. The corresponding parameter is called "vector_creation".

    ad 2)

    Actually, I always use TFIDF since it is the only measurement which "weighs" the terms according to the fact if they are typical for the documents.

    ad 3)

    Just add the operator "TermNGramGenerator" as additional inner operator to create this type of pairs / tupels.

    Cheers,
    Ingo
Sign In or Register to comment.