Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inserting QDQ has severely impacted the performance of the unquantized Myelin part. #4297

Open
zsh4614 opened this issue Dec 23, 2024 · 3 comments

Comments

@zsh4614
Copy link

zsh4614 commented Dec 23, 2024

Description

I am performing QAT quantization on a complex model. When I insert Q/DQ nodes into the ResNet portion I want to quantize according to the rules, TensorRT can run this part in INT8 after building. How can I ensure that the parts without Q/DQ nodes run with optimal performance in non-INT8 precision (FP16 + FP32)? I noticed that after inserting Q/DQ nodes into a part of the complex network, the performance of the unquantized parts decreases compared to FP16.

I conducted an experiment where I inserted QDQ only before a single convolution layer and obtained the build result.
Image

The result of building the same network in FP16 mode.
Image

Why does the part within the green box perform differently?

Another question: Even if the input and output of Myelin are exactly the same in the two exported engines, the execution time differs significantly.
Image

fp16 mode:
Image

I'm confused about how I can ensure that the unquantized parts of my model run optimally in FP16 or FP32.

Environment

TensorRT Version: 8.5.2

NVIDIA GPU: orin / 3090

NVIDIA Driver Version:

CUDA Version: 11.4

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

@lix19937
Copy link

Can you upload the two logs with
trtexec --verbose --dumpProfile --dumpLayerInfo --separateProfileRun 2>&1 | tee log

@zsh4614
Copy link
Author

zsh4614 commented Jan 2, 2025

Can you upload the two logs with trtexec --verbose --dumpProfile --dumpLayerInfo --separateProfileRun 2>&1 | tee log

before insert Q/DQ:
log_fp16.log

after insert:
log_qdq.log

@lix19937
Copy link

lix19937 commented Jan 6, 2025

From your log, qat model has addsome reformat copy node.

If you only want resnet parts run in int8, others run fp16/fp32, some you can split the model: backend + head.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants