The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Understanding TFIDF calculation
Hi,
To provide a bried background to my exercise,
My objective is to create a SVM Classifier model which would classify particular feedback(attribute) into one of the various categories(Label) I have in dependent variable.For this am trying to generate word vectors from feedback verbatims which I pass as attributes.
Please find below the query,
When I manually calculated the TFIDF values and compared with those shown in the Data view of RapidMiner; they were very different. If anything the summation of square of TFIDF values in every row of Data view seemed to add upto 1.
So far I was assuming that the formula mentioned below would have been used for the calculation,
TFIDF( term)= (number of occurences of the term in that particular category/total number of occurences of all terms across all categories)*log(number of all categories/number of categories where that particular terms appears)
Please help me understand the reason for this difference,
Many thanks in advance,
Ram
To provide a bried background to my exercise,
My objective is to create a SVM Classifier model which would classify particular feedback(attribute) into one of the various categories(Label) I have in dependent variable.For this am trying to generate word vectors from feedback verbatims which I pass as attributes.
Please find below the query,
When I manually calculated the TFIDF values and compared with those shown in the Data view of RapidMiner; they were very different. If anything the summation of square of TFIDF values in every row of Data view seemed to add upto 1.
So far I was assuming that the formula mentioned below would have been used for the calculation,
TFIDF( term)= (number of occurences of the term in that particular category/total number of occurences of all terms across all categories)*log(number of all categories/number of categories where that particular terms appears)
Please help me understand the reason for this difference,
Many thanks in advance,
Ram
0
Answers
did you use a wordlist for text input? Beside the words itself, the word list saves the number of occurences. They are then used for TFIDF calculation in order to be consistent to the training set during apply time.
Greetings,
Sebastian
I did use the wordlist and its being saved the way you are saying. However the actual TFIDF values thrown by RapidMiner are pretty different from the ones that I calculated using the formula mentioned in the post. Is this because of some normalization or something, which I had not accounted for?
Thanks again,
Ram