The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Term Frequencies greater than 1

spok2spok2 Member Posts: 6 Contributor II
edited November 2018 in Help
Dear all,

I use "Text Processing - Process Documents From Files" to calculate word vectors for documents.

As I read here: http://rapid-i.com/rapidforum/index.php?PHPSESSID=0aba344304fbb94614ad24f236d974e4&;topic=3728.0
term frequencies are normalized (as I expected).

For me this means that term frequencies always have values < 1.
In my case I use TF-IDF for vector creation as proposed, and get some term frequencies in the range of 1E+10 or 1E+11.
Looking at the related documents they appear to be "normal".

Any ideas why this happens? What I´m not understanding?



Answers

  • spok2spok2 Member Posts: 6 Contributor II
    No one with an idea?
    Do I think wrong?
    Can term frequencies be greater than 1?
    Are there circumstances where it is better to use method for vector creation?
    Under which circumstances which method for vector creation is most appropriate?

    Many thanks in advance for any hint ...
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    1E11 sounds like some error because you need to devide by log(something). if somethings is close to zero, some problems might occur in the precision of double..
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • mohammadrezamohammadreza Member Posts: 23 Contributor II
    As Martin mentioned the 1E+11 is too odd for a normalized term frequency. I suggest trying "binary term occurrences" for vector creation first to see if you will get the anticipated results or not. If the result is somehow acceptable and you don't have something strange in your process there might be a numerical problem which happens mostly because of extremely low term frequencies which I believe can be mitigated by using the "prune" property. You need to just define a lower bound and upper bound for the occurrences to avoid having extremely low values.
  • spok2spok2 Member Posts: 6 Contributor II
    Dear all,

    thanks a lot for your hints ...

    I´ll try and see.

    BR
Sign In or Register to comment.