Optimization Ideas #58
Replies: 2 comments
-
I have tried simhash-cpp earlier for BigScience, and it was fast but also less approachable from the Python ecosystem. So, I would say Cython code is one step towards optimization on the optimization-vs-user-friendliness curve. Or we can wait for the no-GIL version of Python, or use Rust, which seems to have quite some popularity recently in AI communities. |
Beta Was this translation helpful? Give feedback.
-
The way this repo uses My preference between cython and Rust is definitely Rust. As for integrating Rust into a poetry without sweeping changes, is a bit of a challenge. Some examples where this was done : https://github.com/sdispater/pendulum/blob/master/build.py If you have experience, knowledge or opinionation regarding how we should approach this, feel free to share. Since there is interest, I'll try to prepare some PRs. Also I am hoping to implement some changes such as double-hashing to minimize object movement and calling between Rust/Python in preparation for such integration and changes. |
Beta Was this translation helpful? Give feedback.
-
Following #55 I've tried some more things.
NON_ALPHA.split(content)
in minhash.py etcblingfire.text_to_words()
to replace the tokenization. The original Pythonre
version was just as fast as that.re2
is generally believed to be faster than pythonre
. Turns out, in this particular casere
is much faster thanre2
My guess is that it will be much more fruitful to write the fingerprinting part in Cython.
(from More Pandas #55)Beta Was this translation helpful? Give feedback.
All reactions