The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

What is a good threshold for CosineSimilarity Measure?

mrcmrc Member Posts: 2 Learner I
Hi RM community,

I'm using the Cosine Similarity measure in the Cross Distance operator to determine the relevance of documents in a corpus of 5000 documents to a reference document. I'm getting results ranging from 0.8 to 1.6, without any significant breakpoint between relevant and not-so-relevant documents. How can I determine a threshold that is mathematically sound so that I know that documents below the threshold can be categorized as relevant and the ones above as not relevant? In short, how does one determine a threshold for cosine similarity measures with the cross distances operator?

Thanks so much, any insights will be greatly appreciated as I'm very new to this!

Marcia

Answers

  • sgenzersgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    cc @yyhuang ???
  • mrcmrc Member Posts: 2 Learner I
    I do not have an answer yet but since posting this I’ve used the Normalize operator to normalize the results between 0 and 1. I am
    now trying to decide what threshold makes sense - leaning towards 0.25 or 0.5. I’d like to justify my threshold choice with a mathematically sound answer but so far I have not come across one. Any insights to help? 

    Thanks much!
Sign In or Register to comment.