What is a good threshold for CosineSimilarity Measure?

mrc · January 2020

Hi RM community,

I'm using the Cosine Similarity measure in the Cross Distance operator to determine the relevance of documents in a corpus of 5000 documents to a reference document. I'm getting results ranging from 0.8 to 1.6, without any significant breakpoint between relevant and not-so-relevant documents. How can I determine a threshold that is mathematically sound so that I know that documents below the threshold can be categorized as relevant and the ones above as not relevant? In short, how does one determine a threshold for cosine similarity measures with the cross distances operator?

Thanks so much, any insights will be greatly appreciated as I'm very new to this!

Marcia

sgenzer · January 2020

cc @yyhuang ???

mrc · January 2020

I do not have an answer yet but since posting this I’ve used the Normalize operator to normalize the results between 0 and 1. I am
now trying to decide what threshold makes sense - leaning towards 0.25 or 0.5. I’d like to justify my threshold choice with a mathematically sound answer but so far I have not come across one. Any insights to help?

Thanks much!

yyhuang · January 2020

Hi @mrc, thanks for sharing your findings.
When I use the "cross distance" operator with cosine similarity on text/document, I usually have cosine similarities range from 0 to 1.
Just remember to use the "compute similarities" for cosine measurement.

Image: https://us.v-cdn.net/6030995/uploads/editor/sp/j20csbax75nq.png

If I calculate the distance instead of similarities, the result will be possibly out of [0,1] range. The higher similarity, the lower distances.
When you pick the threshold for similarities for text/documents, usually a value higher than 0.5 shows strong similarities. The distribution may vary in the histogram chart for another use case. Always double check the histogram before you pick the threshold.

Image: https://us.v-cdn.net/6030995/uploads/editor/ju/0oysb4bzdbk9.png

Cheers,
YY

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

What is a good threshold for CosineSimilarity Measure?

Answers