Processing time for paddleocr with multiprocessing #14431

FRAki73 · 2024-11-15T05:16:07Z

FRAki73
Nov 15, 2024

I measured the processing time for OCR by executing the following code.
The execution results are as follows, and the processing time is about 5 times longer when executed with multiprocessing.
I would like to know why the time is getting longer and what to do about it.
Can someone please help me?

Processing time of OCR: 1.1000 [sec]
OCR Result: The difficult thing in the life is
Processing time of OCR: 5.5675 [sec]
OCR Result: The difficult thing in the life is

import time
import multiprocessing
from multiprocessing import Process

#paddle OCR
from paddleocr import PaddleOCR

def normalEntry():
    processes = []
    p = Process(target=OCR_runnable, args=())
    processes.append(p)
    p.start()

    for process in processes:
        process.join()

def OCR_runnable():

    ocr = PaddleOCR(use_angle_cls=False, lang='en', show_log=False)

    start_time = time.time()
    result = ocr.ocr("./test.png", cls=False)
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"Processing time of OCR: {elapsed_time:.4f} [sec]")
    print("OCR Result: " + result[0][0][1][0])

if __name__ == '__main__':
    multiprocessing.freeze_support()
    OCR_runnable()
    normalEntry()

Originally posted by @FRAki73 in #14221

freemedom · 2024-11-22T16:49:45Z

freemedom
Nov 22, 2024

只看到这个代码了输出了一次时间

0 replies

FRAki73 · 2024-11-24T13:08:06Z

FRAki73
Nov 24, 2024
Author

只看到这个代码了输出了一次时间

The time is displayed when OCR_runnable() is called.
OCR_runnable() is called once directly and then called in another core by Multiprocessing.
Therefore, the time should appear twice.
There may be some other problem.

0 replies

freemedom · 2024-11-24T13:43:36Z

freemedom
Nov 24, 2024

试一下ocr多张图片？（ocr = PaddleOCR(use_angle_cls=False, lang='en', show_log=False)只需运行一次
此外，试一下show_log=True，会显示三个阶段各自的时间。

0 replies

FRAki73 · 2024-11-25T02:14:50Z

FRAki73
Nov 25, 2024
Author

Thank you for your advice.
I have changed the parameter show_log=True from False, and the following is check result.
The processing time of rec_res is 5 times longer than that of a direct call.
Why is there a difference in processing time when there should be no data sharing between cores?

・Directly call
[2024/11/25 11:05:48] ppocr DEBUG: dt_boxes num : 4, elapsed : 0.2573506832122803
[2024/11/25 11:05:49] ppocr DEBUG: rec_res num : 4, elapsed : 0.863243579864502
Processing time of OCR: 1.1323 [sec]

・Called on multiprocessing
[2024/11/25 11:06:08] ppocr DEBUG: dt_boxes num : 4, elapsed : 0.25519514083862305
[2024/11/25 11:06:13] ppocr DEBUG: rec_res num : 4, elapsed : 5.2654759883880615
Processing time of OCR: 5.5323 [sec]

0 replies

freemedom · 2024-11-25T12:49:04Z

freemedom
Nov 25, 2024

奇怪，这我就不清楚了。

0 replies

asdat3 · 2024-12-16T21:25:35Z

asdat3
Dec 16, 2024

Hey @FRAki73 I am running into the same issue trying to process multiple images at once, do you have any advices? the total of 40minutes for all images is way too long and my CPU alone is only running at 10% usage (26Core Server CPU)

0 replies

FRAki73 · 2024-12-19T15:07:56Z

FRAki73
Dec 19, 2024
Author

Hello @asdat3 , I have not solved this problem yet, so I have no idea. Sorry.
I hope anyone can solve it and advice us.

0 replies

GreatV · 2024-12-20T02:50:36Z

GreatV
Dec 20, 2024
Maintainer

The significant increase in processing time when using multiprocessing in your code occurs due to the initialization of the PaddleOCR model inside the OCR_runnable function for each process. This initialization is computationally expensive, and when using multiprocessing, each spawned process creates its own instance of the PaddleOCR class instead of sharing a single instance. This leads to redundant resource allocation and model initialization, which significantly impacts performance.

Why is this happening?

Model Initialization Cost:
In the OCR_runnable function, the line ocr = PaddleOCR(...) initializes the PaddleOCR model. This involves loading the model weights, configuration, and other resources, which is time-consuming. When using multiprocessing, each process independently initializes its own PaddleOCR object, duplicating this overhead.
Isolation in Multiprocessing:
Multiprocessing in Python spawns separate processes with independent memory spaces. Unlike multithreading, where threads share memory, processes in multiprocessing cannot share objects like the PaddleOCR model unless explicitly shared through inter-process communication mechanisms. This means that each process has to perform its own model initialization.
Overhead of Process Creation:
Creating and managing processes also has its own overhead. Each process needs to allocate resources and establish communication channels, which adds to the total execution time.

How to Fix the Issue

To address the increased processing time, you can use one of the following approaches:

1. Initialize the Model Once and Share It

Instead of initializing the PaddleOCR model within the OCR_runnable function repeatedly, initialize it once in the main process and pass it to each worker process. However, this is not straightforward in Python's multiprocessing because objects are not shared by default. Instead, you can use shared memory or other techniques like multiprocessing.Manager.

Example:

import time
import multiprocessing
from paddleocr import PaddleOCR

def OCR_runnable(ocr):
    start_time = time.time()
    result = ocr.ocr("./test.png", cls=False)
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"Processing time of OCR: {elapsed_time:.4f} [sec]")
    print("OCR Result: " + result[0][0][1][0])

if __name__ == '__main__':
    multiprocessing.freeze_support()

    # Initialize the PaddleOCR model once
    ocr_model = PaddleOCR(use_angle_cls=False, lang='en', show_log=False)

    # Use multiprocessing Pool for process management
    with multiprocessing.Pool(processes=1) as pool:
        pool.apply(OCR_runnable, args=(ocr_model,))

However, note that object sharing in multiprocessing might still require extensive handling, as PaddleOCR may not be directly shareable across processes.

2. Use Threading Instead of Multiprocessing

Since the PaddleOCR model is CPU- and GPU-intensive, and the Global Interpreter Lock (GIL) may not be a bottleneck in this case (due to external library usage), you can use the threading module. Threads share memory, so the PaddleOCR model can be initialized once and reused across threads.

Example:

import time
from threading import Thread
from paddleocr import PaddleOCR

def OCR_runnable(ocr):
    start_time = time.time()
    result = ocr.ocr("./test.png", cls=False)
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"Processing time of OCR: {elapsed_time:.4f} [sec]")
    print("OCR Result: " + result[0][0][1][0])

if __name__ == '__main__':
    # Initialize the PaddleOCR model once
    ocr_model = PaddleOCR(use_angle_cls=False, lang='en', show_log=False)

    # Create and start a thread
    thread = Thread(target=OCR_runnable, args=(ocr_model,))
    thread.start()
    thread.join()

3. Use a Multiprocessing Pool with Lazy Initialization

If you need to use multiprocessing for scaling, consider using a multiprocessing.Pool with lazy initialization, where the PaddleOCR model is initialized only once per worker process.

Example:

import time
from multiprocessing import Pool
from paddleocr import PaddleOCR

def init_worker():
    global ocr_model
    ocr_model = PaddleOCR(use_angle_cls=False, lang='en', show_log=False)

def OCR_runnable(_):
    start_time = time.time()
    result = ocr_model.ocr("./test.png", cls=False)
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"Processing time of OCR: {elapsed_time:.4f} [sec]")
    print("OCR Result: " + result[0][0][1][0])

if __name__ == '__main__':
    with Pool(processes=1, initializer=init_worker) as pool:
        pool.map(OCR_runnable, [None])

Here, the init_worker function initializes the PaddleOCR model once per worker process, and the model is reused for all tasks assigned to that process.

Key Takeaways

Avoid repeated initialization: The primary reason for the slowdown is the repeated initialization of the PaddleOCR model. Refactor your code to initialize the model only once and reuse it across tasks.
Choose the right parallelization approach: Depending on your use case, threading might be more efficient than multiprocessing if the GIL does not significantly impact performance.
Consider GPU/CPU utilization: If PaddleOCR is configured to use a GPU, ensure that multiple processes or threads are not overwhelming the GPU, as this could also degrade performance.

By using these improvements, you should be able to significantly reduce the processing time when using multiprocessing for PaddleOCR.

Response generated by feifei-bot | chatgpt-4o-latest

1 reply

FRAki73 Dec 20, 2024
Author

Hi @GreatV , Thank you for your support but your advice is not appropriate.
The code ocr = PaddleOCR(...) is not subject to time measurement.
It is not related to the part of the time difference.

Only one line below is subject to time measurement.
The question is why there is a difference in the processing time of the following line.
result = ocr.ocr("./test.png", cls=False)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing time for paddleocr with multiprocessing #14431

{{title}}

Replies: 8 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Processing time for paddleocr with multiprocessing #14431

FRAki73 Nov 15, 2024

Replies: 8 comments · 1 reply

freemedom Nov 22, 2024

FRAki73 Nov 24, 2024 Author

freemedom Nov 24, 2024

FRAki73 Nov 25, 2024 Author

freemedom Nov 25, 2024

asdat3 Dec 16, 2024

FRAki73 Dec 19, 2024 Author

GreatV Dec 20, 2024 Maintainer

Why is this happening?

How to Fix the Issue

1. Initialize the Model Once and Share It

2. Use Threading Instead of Multiprocessing

3. Use a Multiprocessing Pool with Lazy Initialization

Key Takeaways

FRAki73 Dec 20, 2024 Author

FRAki73
Nov 15, 2024

Replies: 8 comments 1 reply

freemedom
Nov 22, 2024

FRAki73
Nov 24, 2024
Author

freemedom
Nov 24, 2024

FRAki73
Nov 25, 2024
Author

freemedom
Nov 25, 2024

asdat3
Dec 16, 2024

FRAki73
Dec 19, 2024
Author

GreatV
Dec 20, 2024
Maintainer

FRAki73 Dec 20, 2024
Author