You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to read a parquet file from S3 with high performance. Is there any hit or an example for me to start with? I have some ideas , but not sure which one is recommended or any better solutions?
One approach is to write a customized reader (internally it's leveraging S3 API to fetch a range of bytes) and passed it to function file.NewParquetReader().
Another approach is to send S3 API to fetch the last 8 bytes of parquet file to get the footer, metadata first, and then send S3 APIs to read each row group to get data using file.NewPageReader().
Component(s)
Go
The text was updated successfully, but these errors were encountered:
Personally, I would use https://github.com/wolfeidau/s3iofs to open the file which will internally leverage the s3 API to fetch the byte ranges and just pass it to file.NewParquetReader like you suggested.
I would only go down to creating your own page readers if you find the above isn't performant enough. It's unlikely that going down to that level would provide much in the way of performance gains.
Just tried the s3iofs file reader , 140MB file takes 12 mins to read VS if reading from local , it's ~ 14s or less. It's expected to see slowness when reading from S3, but 12 mins is too long for our application. I have to check if there is other way to improve the performance.
12 minutes seems really bad, much worse than I'd expect. I've definitely seen better performance from S3 than that in the past, so I wonder where that time is being spent
Hi team,
I would like to read a parquet file from S3 with high performance. Is there any hit or an example for me to start with? I have some ideas , but not sure which one is recommended or any better solutions?
One approach is to write a customized reader (internally it's leveraging S3 API to fetch a range of bytes) and passed it to function
file.NewParquetReader()
.Another approach is to send S3 API to fetch the last 8 bytes of parquet file to get the footer, metadata first, and then send S3 APIs to read each row group to get data using
file.NewPageReader()
.Component(s)
Go
The text was updated successfully, but these errors were encountered: