More Pandas #55

chris-ha458 · 2023-08-15T02:52:43Z

chris-ha458
Aug 15, 2023

In this repo, we do a lot of string manipulation and then numerical calculations
For the former we rely mostly on python types and data structures and for the latter, numpy.
(there are some interlap however)

I feel that for simple string manipulations python works well, but for mass string manipulation, it is a bottleneck.
Numpy is not well suited to accelerate string manipulations either.
I am thinking of utilizing pandas for this step.

Pros : might be faster and depending on implementation, cleaner code.
Cons : might introduce more complexity for little gain.
Neutral : won't introduce further dependencies since pandas is already used here.

ChenghaoMou · 2023-08-15T21:09:13Z

ChenghaoMou
Aug 15, 2023
Maintainer

Can you give an example of such optimization?

Most of the string manipulation (tokenization) happens during the fingerprint calculation, so I don't think converting the arrow dataset first to a pandas dataframe and then performing string operations will gain much improvement unless it is some kind of batched processing where the overhead is marginalized.

0 replies

chris-ha458 · 2023-08-16T01:22:52Z

chris-ha458
Aug 16, 2023
Author

My plan was to use dataframes to enable array string processing which would enable batch in ds.map.

I was under the impression that using zerocopy conversions with pandas would be "overhead free"
If not i would just operate on the underlying arrow datasets.

I am reevaluating these underlying assumptions as they don't seem that straightforward anymore.
As for where I would use it, string manipulation was one of the places I had in mind.

0 replies

ChenghaoMou · 2023-08-16T19:59:18Z

ChenghaoMou
Aug 16, 2023
Maintainer

Each batch becomes a column-oriented dict, if I remember correctly. One needs to convert them to pandas first for each batch:

def process(batch: Dict[str, List[Any]]):
    df = pd.DataFrame(batch)
    ...

I think I tried this before because I wanted to use groupby, but it didn't scale well in my case.

My guess is that it will be much more fruitful to write the fingerprinting part in Cython.

0 replies

chris-ha458 · 2023-08-16T23:23:11Z

chris-ha458
Aug 16, 2023
Author

My guess is that it will be much more fruitful to write the fingerprinting part in Cython.

It seems like that is the path
https://github.com/mattilyra/LSH
took.

but even then they use a for loop within the cython function.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More Pandas #55

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

More Pandas #55

chris-ha458 Aug 15, 2023

Replies: 4 comments

ChenghaoMou Aug 15, 2023 Maintainer

chris-ha458 Aug 16, 2023 Author

ChenghaoMou Aug 16, 2023 Maintainer

chris-ha458 Aug 16, 2023 Author

chris-ha458
Aug 15, 2023

ChenghaoMou
Aug 15, 2023
Maintainer

chris-ha458
Aug 16, 2023
Author

ChenghaoMou
Aug 16, 2023
Maintainer

chris-ha458
Aug 16, 2023
Author