Term Occurrences and Frequency - I have to be missing something

btibert · November 2019

I am following along with this post because I wanted to ensure my intuition was correct, because I was seeing results that didn't make sense, to me anyway.
https://community.rapidminer.com/discussion/46333/term-frequencies-and-tf-idf-how-are-these-calculated

The only difference that I see in my process to start is that I am reading in my data from Excel and not creating it by hand.

Here is the term occurrences after making it lower case, extracting stop words, tokenizing, and counting the tokens.

Image: https://us.v-cdn.net/6030995/uploads/editor/zq/i903zcygol79.png

Just the like the post, I am using very simple sentences to keep the vocabulary small.

Now, here is the same exact data, the only difference is that I am now using term frequency within the Process Documents operator

Image: https://us.v-cdn.net/6030995/uploads/editor/xz/3gjk4d9li993.png

Of course there is a very good change that I am missing a setting along the way, but why is the first example .577 for each of the three words, when the basic sentence, unprocessed, was I like turtles.

Thanks in advance.

lionelderkrikor · November 2019

Hi @btibert,

To retrieve the results displayed by RapidMiner (according to the thread you shared) :

From the results of Term Occurences :

Image: https://us.v-cdn.net/6030995/uploads/editor/qt/0bmya90fbv9y.png

You calculate the "classic Term Frequency" (for that See the process in attached file) :

Image: https://us.v-cdn.net/6030995/uploads/editor/qz/los8tkzdylzh.png

Then the term frequency word vectors that are shown in RapidMiner are normalized vectors. This is exactly the same as unit vector normalization that you may have seen in physics classes. In broad brush strokes, the norm of a (Euclidean) vector is its length or size. If you have a 1x2 vector, you can find the norm by simple Pythagorean Theorem. For a 1x7 vector like each document above, you use Pythagorean Theorem but in 7-dimensional space.
Hence the norm of the first document term frequency vector is:

SQRT [ (0)^2 + (0)^2 + (0)^2 + (0.333)^2 + (0.333)^2 + (0.333)^2 + (0)^2 ] = 0.577

and the second document term frequency vector is :

idem than the first document

and the third document term frequency vector is:

SQRT [ (0.25)^2 + (0.25)^2 + (0.25)^2 + (0)^2 + (0)^2 + (0.25)^2 + (0)^2<span> ] = 0.5</span>

In order to look at all the documents equally, we want all the document vectors to have the same length. So we divide each document term frequency vector by its respective norm to get a "document term frequency unit vector" – also called a normalized term frequency vector :

0 / 0.577 = 0 0.333/0.577 = 0.577 0.25/0.5 = 0.5

So we obtain :

Image: https://us.v-cdn.net/6030995/uploads/editor/ga/l4hsdbdlabln.png

Hope this helps,

Regards,

Lionel

MartinLiebig · November 2019

Hi,
are you sure you don't use TF/IDF?

BR,
Martin

btibert · November 2019

See below, and my dataset/process attached. Entirely possible I am missing something obvious, just not sure what it could be.

Image: https://us.v-cdn.net/6030995/uploads/editor/7v/scfuf4uhogx3.png

btibert · November 2019

Absolutely fantastic, thanks! I completely missed (as the title suggested) the normalized part, I just saw the output I expected and stopped reading like a dummy. Many thanks for the example process as well, I haven't had a chance to wrap my head around looping the way you did it, but it appears straight forward enough.

lionelderkrikor · November 2019

You're welcome, @btibert

Regards,

Lionel

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Term Occurrences and Frequency - I have to be missing something

Best Answer

Answers