The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Wrong TFIDF Values

smartosmarto Member Posts: 1 Learner III
Hey Rapid community. I discovered something with the TFIDF, that I don't understand. Wether I use "Generate TFIDF" or "Process Documents" with this option, it seems like the most frequent words are delivered without any value at all.

I analyzed 10 documents, a couple of different sets, a couple of different setups, but i discover the same problem over and over.

image
image
image

These are screenshots from RM and MySQL. What am I doing wrong?

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    it seems like the most frequent words are delivered without any value at all.
    Yes, that's the definition of TF-IDF: it applies a penalty on words which appear in only very few or almost all documents. Imagine a word which appears in all documents: it contains no information at all.

    For the exact definition of TF-IDF you could start with the wikipedia article: http://en.wikipedia.org/wiki/Tf%E2%80%93idf

    Best, Marius
Sign In or Register to comment.