The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
What is a good threshold for CosineSimilarity Measure?
Hi RM community,
I'm using the Cosine Similarity measure in the Cross Distance operator to determine the relevance of documents in a corpus of 5000 documents to a reference document. I'm getting results ranging from 0.8 to 1.6, without any significant breakpoint between relevant and not-so-relevant documents. How can I determine a threshold that is mathematically sound so that I know that documents below the threshold can be categorized as relevant and the ones above as not relevant? In short, how does one determine a threshold for cosine similarity measures with the cross distances operator?
Thanks so much, any insights will be greatly appreciated as I'm very new to this!
Marcia
I'm using the Cosine Similarity measure in the Cross Distance operator to determine the relevance of documents in a corpus of 5000 documents to a reference document. I'm getting results ranging from 0.8 to 1.6, without any significant breakpoint between relevant and not-so-relevant documents. How can I determine a threshold that is mathematically sound so that I know that documents below the threshold can be categorized as relevant and the ones above as not relevant? In short, how does one determine a threshold for cosine similarity measures with the cross distances operator?
Thanks so much, any insights will be greatly appreciated as I'm very new to this!
Marcia
1
Answers
now trying to decide what threshold makes sense - leaning towards 0.25 or 0.5. I’d like to justify my threshold choice with a mathematically sound answer but so far I have not come across one. Any insights to help?
When I use the "cross distance" operator with cosine similarity on text/document, I usually have cosine similarities range from 0 to 1.
Just remember to use the "compute similarities" for cosine measurement.
If I calculate the distance instead of similarities, the result will be possibly out of [0,1] range. The higher similarity, the lower distances.
When you pick the threshold for similarities for text/documents, usually a value higher than 0.5 shows strong similarities. The distribution may vary in the histogram chart for another use case. Always double check the histogram before you pick the threshold.
Cheers,
YY