-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve performance of Table.add_files by parallelizing #1335
Comments
what do you have in mind to parallelize this part of the code? |
I believe @vtk9 is suggesting the files to be read in parallel rather than sequentially. I could be mistaken, but it seems that if you have 10,000 files, each one is being read one after the other. This approach can be quite time-consuming, even though I understand that we are only reading the metadata of each parquet file. One option could be to have something like (pseudo-code alert): def parquet_files_to_data_files(io: FileIO, table_metadata: TableMetadata, file_paths: Iterator[str]) -> Iterator[DataFile]:
futures = []
with concurrent.futures.ThreadPoolExecutor() as executor:
for file_path in file_paths:
futures.append(executor.submit(scan_file, file_path))
for future in concurrent.futures.as_completed(futures):
yield future.result() |
Apologies @kevinjqliu , i forgot to link the relevant slack thread https://apache-iceberg.slack.com/archives/C029EE6HQ5D/p1731611943890879 Exactly, thank you @bigluck! |
@vtk9 thanks for the context from slack, I must have missed that thread |
@kevinjqliu when i find time, yes. I would definitely love this feature in the next release of |
sounds good! Feel free to ping me for review. I'll add this issue to the 0.8.1 milestone for now |
Yes, looks like this shouldn't be too hard. I think it would be good to re-use the I would refactor def _parquet_files_to_data_files(table_metadata: TableMetadata, file_paths: List[str], io: FileIO) -> Iterable[DataFile]:
"""Convert a list files into DataFiles.
Returns:
An iterable that supplies DataFiles that describe the parquet files.
"""
from pyiceberg.io.pyarrow import parquet_files_to_data_files
executor = ExecutorFactory.get_or_create()
futures = [
executor.submit(
parquet_file_to_data_file,
io,
table_metadata,
file_path
)
for file_path in file_paths
]
return [f.result() for f in futures if f.result()] @kevinjqliu I would not classify this as a bugfix, so I'm not sure if this is appropriate for 0.8.1. |
make sense, this is a feature |
Feature Request / Improvement
Table.add_files()
processes the list of files in sequential order. Part of this flow can be parallelized, particularlyiceberg-python/pyiceberg/table/__init__.py
Lines 591 to 593 in 3ccdc44
The text was updated successfully, but these errors were encountered: