The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Term Occurrences and Frequency - I have to be missing something
I am following along with this post because I wanted to ensure my intuition was correct, because I was seeing results that didn't make sense, to me anyway.
https://community.rapidminer.com/discussion/46333/term-frequencies-and-tf-idf-how-are-these-calculated
The only difference that I see in my process to start is that I am reading in my data from Excel and not creating it by hand.
Here is the term occurrences after making it lower case, extracting stop words, tokenizing, and counting the tokens.
Just the like the post, I am using very simple sentences to keep the vocabulary small.
Now, here is the same exact data, the only difference is that I am now using term frequency within the Process Documents operator
Of course there is a very good change that I am missing a setting along the way, but why is the first example .577 for each of the three words, when the basic sentence, unprocessed, was I like turtles.
Thanks in advance.
https://community.rapidminer.com/discussion/46333/term-frequencies-and-tf-idf-how-are-these-calculated
The only difference that I see in my process to start is that I am reading in my data from Excel and not creating it by hand.
Here is the term occurrences after making it lower case, extracting stop words, tokenizing, and counting the tokens.
Just the like the post, I am using very simple sentences to keep the vocabulary small.
Now, here is the same exact data, the only difference is that I am now using term frequency within the Process Documents operator
Of course there is a very good change that I am missing a setting along the way, but why is the first example .577 for each of the three words, when the basic sentence, unprocessed, was I like turtles.
Thanks in advance.
0
Best Answer
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 UnicornHi @btibert,
To retrieve the results displayed by RapidMiner (according to the thread you shared) :
From the results of Term Occurences :
You calculate the "classic Term Frequency" (for that See the process in attached file) :
Then the term frequency word vectors that are shown in RapidMiner are normalized vectors. This is exactly the same as unit vector normalization that you may have seen in physics classes. In broad brush strokes, the norm of a (Euclidean) vector is its length or size. If you have a 1x2 vector, you can find the norm by simple Pythagorean Theorem. For a 1x7 vector like each document above, you use Pythagorean Theorem but in 7-dimensional space.
Hence the norm of the first document term frequency vector is:SQRT [ (0)^2 + (0)^2 + (0)^2 + (0.333)^2 + (0.333)^2 + (0.333)^2 + (0)^2 ] = 0.577
and the second document term frequency vector is :
idem than the first document
and the third document term frequency vector is:SQRT [ (0.25)^2 + (0.25)^2 + (0.25)^2 + (0)^2 + (0)^2 + (0.25)^2 + (0)^2<span> ] = 0.5</span>
In order to look at all the documents equally, we want all the document vectors to have the same length. So we divide each document term frequency vector by its respective norm to get a "document term frequency unit vector" – also called a normalized term frequency vector :
0 / 0.577 = 0 0.333/0.577 = 0.577 0.25/0.5 = 0.5
So we obtain :
Hope this helps,
Regards,
Lionel7
Answers
are you sure you don't use TF/IDF?
BR,
Martin
Dortmund, Germany
Regards,
Lionel