Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go] Need help on reading parquet from S3 #37

Open
Zeeyi13 opened this issue Mar 22, 2024 · 3 comments
Open

[Go] Need help on reading parquet from S3 #37

Zeeyi13 opened this issue Mar 22, 2024 · 3 comments

Comments

@Zeeyi13
Copy link

Zeeyi13 commented Mar 22, 2024

Hi team,

I would like to read a parquet file from S3 with high performance. Is there any hit or an example for me to start with? I have some ideas , but not sure which one is recommended or any better solutions?

One approach is to write a customized reader (internally it's leveraging S3 API to fetch a range of bytes) and passed it to function file.NewParquetReader().

Another approach is to send S3 API to fetch the last 8 bytes of parquet file to get the footer, metadata first, and then send S3 APIs to read each row group to get data using file.NewPageReader().

Component(s)

Go

@zeroshade
Copy link
Member

Personally, I would use https://github.com/wolfeidau/s3iofs to open the file which will internally leverage the s3 API to fetch the byte ranges and just pass it to file.NewParquetReader like you suggested.

I would only go down to creating your own page readers if you find the above isn't performant enough. It's unlikely that going down to that level would provide much in the way of performance gains.

@Zeeyi13
Copy link
Author

Zeeyi13 commented Mar 22, 2024

Thanks @zeroshade for the quick reply.

Just tried the s3iofs file reader , 140MB file takes 12 mins to read VS if reading from local , it's ~ 14s or less. It's expected to see slowness when reading from S3, but 12 mins is too long for our application. I have to check if there is other way to improve the performance.

@zeroshade
Copy link
Member

12 minutes seems really bad, much worse than I'd expect. I've definitely seen better performance from S3 than that in the past, so I wonder where that time is being spent

@assignUser assignUser transferred this issue from apache/arrow Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants