Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider adding soft-cosine distance #21

Open
behrica opened this issue Oct 10, 2022 · 4 comments
Open

consider adding soft-cosine distance #21

behrica opened this issue Oct 10, 2022 · 4 comments

Comments

@behrica
Copy link
Contributor

behrica commented Oct 10, 2022

Useful in comparing TFIDF text representations, instead of using cosine

https://en.wikipedia.org/wiki/Cosine_similarity#Soft_cosine_measure

The similarity function s_i_j should be plugable (as input to the function)

@genmeblog
Copy link
Contributor

Thanks for the idea! I think I need your support here. I understand the definition. However I have no idea how to build convenient API for that. Set of examples would be helpful.

@behrica
Copy link
Contributor Author

behrica commented Oct 14, 2022

I think it should simply allow to plugin in any (distance) function which takes 2 values and returns a float.

(soft-cosine  [1 2 3 4]   [ 2 3 5 6]    (fn [x y]  .... do-somthing-to caluòate distance of x and y ))

Concrete case comes from NLP-

A language aware function:

(defn word-dist [token-1 token-2]
...
)
with this spec

(word-dist   "I"  "I") = 1
(word-dist   "like"  "like") = 1
(word-dist   "I"  "like") = 
(word-dist   "fruits"  "banana") = 0.5

(soft-cosine [ "I" "like" "fruits"] ["I" "like" "banana"] word-dist) = .... > 0.6 (not sure about concrete number)
It would compare "I" -> "I" = 1
"like" -> "like" = 1
"fruits" -> banana" = 0.5

In practice we would map all tokens to number first (this makes the vocabulary),
so the soft-cosine would be called with vectors of ints in this case. (if token frequency is used)
or floats, if tfidf is used.

@behrica
Copy link
Contributor Author

behrica commented Oct 14, 2022

I found here:
https://github.com/TeamCohen/secondstring/blob/master/src/com/wcohen/ss/SoftTFIDF.java

an old Java implementation which combines TFID and soft-cosine

I would prefer to have this separated.

The tfidf part we have already here:
https://github.com/scicloj/scicloj.ml.smile/blob/main/src/scicloj/ml/smile/nlp.clj#L285

This gives me the 2 vectors above, I want to get the distance for.

The "classical" way is to use simple cosine distance, but this is then not able to deal with "similarity of tokens".
The only way to do hat would be to "normalize" the vocabulary before, and somehow say that "fruits" and "banana" is the same thing, and remove one. But his is a too strict normalisation.

SoftCosine should be better , hopefully.

@behrica
Copy link
Contributor Author

behrica commented Oct 14, 2022

an other exmaple would be to plugin in text embeddings (word2vec).
They can calculate as well a semantic distance between any 2 words.

There is an Java implmentation here, and so I would plugin this concrete function:
https://javadoc.io/static/org.deeplearning4j/deeplearning4j-nlp/1.0.0-M2.1/org/deeplearning4j/models/embeddings/wordvectors/WordVectors.html#similarity(java.lang.String,java.lang.String)

(just doing first the mapping to the vocabulary token<->index)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants