More Pandas #55
Replies: 4 comments
-
Can you give an example of such optimization? Most of the string manipulation (tokenization) happens during the fingerprint calculation, so I don't think converting the arrow dataset first to a pandas dataframe and then performing string operations will gain much improvement unless it is some kind of batched processing where the overhead is marginalized. |
Beta Was this translation helpful? Give feedback.
-
My plan was to use dataframes to enable array string processing which would enable batch in I was under the impression that using zerocopy conversions with pandas would be "overhead free" I am reevaluating these underlying assumptions as they don't seem that straightforward anymore. |
Beta Was this translation helpful? Give feedback.
-
Each batch becomes a column-oriented dict, if I remember correctly. One needs to convert them to pandas first for each batch:
I think I tried this before because I wanted to use My guess is that it will be much more fruitful to write the fingerprinting part in Cython. |
Beta Was this translation helpful? Give feedback.
-
It seems like that is the path but even then they use a for loop within the cython function. |
Beta Was this translation helpful? Give feedback.
-
In this repo, we do a lot of string manipulation and then numerical calculations
For the former we rely mostly on python types and data structures and for the latter, numpy.
(there are some interlap however)
I feel that for simple string manipulations python works well, but for mass string manipulation, it is a bottleneck.
Numpy is not well suited to accelerate string manipulations either.
I am thinking of utilizing pandas for this step.
Pros : might be faster and depending on implementation, cleaner code.
Cons : might introduce more complexity for little gain.
Neutral : won't introduce further dependencies since pandas is already used here.
Beta Was this translation helpful? Give feedback.
All reactions