improve performance of Table.add_files by parallelizing #1335

vtk9 · 2024-11-18T19:57:34Z

Feature Request / Improvement

Table.add_files() processes the list of files in sequential order. Part of this flow can be parallelized, particularly

iceberg-python/pyiceberg/table/__init__.py

Lines 591 to 593 in 3ccdc44

    
           data_files = _parquet_files_to_data_files( 
        
               table_metadata=self.table_metadata, file_paths=file_paths, io=self._table.io 
        
           )

.

The text was updated successfully, but these errors were encountered:

kevinjqliu · 2024-11-20T05:20:29Z

_parquet_files_to_data_files is a generator and uses parquet_files_to_data_files which is also a generator

what do you have in mind to parallelize this part of the code?

bigluck · 2024-11-20T05:34:47Z

I believe @vtk9 is suggesting the files to be read in parallel rather than sequentially.

I could be mistaken, but it seems that if you have 10,000 files, each one is being read one after the other. This approach can be quite time-consuming, even though I understand that we are only reading the metadata of each parquet file.

One option could be to have something like (pseudo-code alert):

def parquet_files_to_data_files(io: FileIO, table_metadata: TableMetadata, file_paths: Iterator[str]) -> Iterator[DataFile]:
    futures = []
    with concurrent.futures.ThreadPoolExecutor() as executor:
        for file_path in file_paths:
             futures.append(executor.submit(scan_file, file_path))
        for future in concurrent.futures.as_completed(futures):
             yield future.result()

vtk9 · 2024-11-20T05:56:34Z

Apologies @kevinjqliu , i forgot to link the relevant slack thread https://apache-iceberg.slack.com/archives/C029EE6HQ5D/p1731611943890879

Exactly, thank you @bigluck!
I tried something like this and there's a noticeable improvement even when add_files contains ~30 files

kevinjqliu · 2024-11-20T05:57:55Z

thanks @bigluck that makes sense! I think _parquet_files_to_data_files might be a good place to add the parallelism

@vtk9 is this something you would like to contribute?

kevinjqliu · 2024-11-20T05:58:23Z

@vtk9 thanks for the context from slack, I must have missed that thread

vtk9 · 2024-11-20T06:03:21Z

@kevinjqliu when i find time, yes.

I would definitely love this feature in the next release of pyiceberg and will prioritize this with enough heads up (if possible) before a release

kevinjqliu · 2024-11-20T06:05:04Z

sounds good! Feel free to ping me for review. I'll add this issue to the 0.8.1 milestone for now

Fokko · 2024-11-20T19:28:12Z

Yes, looks like this shouldn't be too hard. I think it would be good to re-use the ExecutorFactory:

I would refactor parquet_files_to_data_files to let it take a single file instead of an Iterator, and then call it parquet_file_to_data_file.

def _parquet_files_to_data_files(table_metadata: TableMetadata, file_paths: List[str], io: FileIO) -> Iterable[DataFile]:
    """Convert a list files into DataFiles.

    Returns:
        An iterable that supplies DataFiles that describe the parquet files.
    """
    from pyiceberg.io.pyarrow import parquet_files_to_data_files

    executor = ExecutorFactory.get_or_create()
    futures = [
        executor.submit(
            parquet_file_to_data_file,
            io,
            table_metadata,
            file_path
        )
        for file_path in file_paths
    ]

    return [f.result() for f in futures if f.result()]

@kevinjqliu I would not classify this as a bugfix, so I'm not sure if this is appropriate for 0.8.1.

kevinjqliu · 2024-11-20T20:34:22Z

make sense, this is a feature

kevinjqliu added this to the PyIceberg 0.8.1 release milestone Nov 20, 2024

kevinjqliu removed this from the PyIceberg 0.8.1 release milestone Nov 20, 2024

Fokko added this to the PyIceberg 0.9.0 release milestone Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve performance of Table.add_files by parallelizing #1335

improve performance of Table.add_files by parallelizing #1335

vtk9 commented Nov 18, 2024

kevinjqliu commented Nov 20, 2024

bigluck commented Nov 20, 2024

vtk9 commented Nov 20, 2024

kevinjqliu commented Nov 20, 2024

kevinjqliu commented Nov 20, 2024

vtk9 commented Nov 20, 2024

kevinjqliu commented Nov 20, 2024

Fokko commented Nov 20, 2024

kevinjqliu commented Nov 20, 2024

improve performance of Table.add_files by parallelizing #1335

improve performance of Table.add_files by parallelizing #1335

Comments

vtk9 commented Nov 18, 2024

Feature Request / Improvement

kevinjqliu commented Nov 20, 2024

bigluck commented Nov 20, 2024

vtk9 commented Nov 20, 2024

kevinjqliu commented Nov 20, 2024

kevinjqliu commented Nov 20, 2024

vtk9 commented Nov 20, 2024

kevinjqliu commented Nov 20, 2024

Fokko commented Nov 20, 2024

kevinjqliu commented Nov 20, 2024