The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Undoing the cosine normalization in 'Process Documents' operator"
Hello,
I have noticed that 'Process Documents' does not output term frequencies, when the coresponding mode is selected. As stated in http://rapid-i.com/rapidforum/index.php/topic,3728.msg13943.html#msg13943, cosine normalization is applied to the raw frequencies before outputting the result.
Is there a way to get raw term frequencies for each document, without normalization?
I have noticed that 'Process Documents' does not output term frequencies, when the coresponding mode is selected. As stated in http://rapid-i.com/rapidforum/index.php/topic,3728.msg13943.html#msg13943, cosine normalization is applied to the raw frequencies before outputting the result.
Is there a way to get raw term frequencies for each document, without normalization?
Tagged:
0
Answers
Yes there is - I had to do the same thing. Basically, get the term occurrences then normalise the rows using the proportion transformation option.
regards
Andrew
Long story short - you have saved the day once again! To help other RapidMiner newbies, here is a more detailed description of what happened:
1. I have tried using 'Term occurences' before, but thought that this is not the 'number of times a specific term occurs in the doc'. The reason is - when manually counting the number of tokens in a document and comparing that with the sum of term frequency vector counts, these two measures did not match;
2. Simultaneously, when using Andrews sample process, the term frequency vector component sum was correct ???
The problem turned out to be trivial. My RapidMiner process applied term pruning - any terms that occured less than two times in the corpus were removed. However, the tokens output by RapidMiner still included the removed ones - that's why results seemed unexpected. The example provided by Andrew did not apply pruning, therefore results were consistent.
Hope this helps someone As a side note, I would recommend the RapidMiner team make the 'Process Documents' operator generate term vectors consistently with the token list in order to avoid confusing dumb users like me :P
Thanks, Andrew!