Bug Report: vtgate planning performance regression for big "union" since v15 #17467

czhang1216 · 2025-01-06T19:01:47Z

Overview of the Issue

Continuation from the slack thread. After switching from v15 to v19, we are seeing ~~2x-3x+~~ 2-30x planning performance regression on some particular query. This regression is significant when small amount of data is read.

Reproduction Steps

I included our schema and query in question in attached patches. Apply them to v15 and v19 repo, you'll be able to see the performance difference.

Step-by-step reproduce instructions:

Please download 15.patch and 19.patch to an empty directory, then run the following in that directory:

$ git clone https://github.com/vitessio/vitess.git -b release-15.0 --depth=1 vitess15
$ cd vitess15
$ git apply ../15.patch
$ go test -benchmem -run=^$ -bench ^BenchmarkMine$ --benchtime=5s -count=12 vitess.io/vitess/go/vt/vtgate/planbuilder

$ cd ..
$ git clone https://github.com/vitessio/vitess.git -b release-19.0 --depth=1 vitess19
$ cd vitess19
$ git apply ../19.patch
$ go test -benchmem -run=^$ -bench ^BenchmarkMine$ --benchtime=5s -count=12 vitess.io/vitess/go/vt/vtgate/planbuilder
# Looking at "ns/op" 19 takes ~2x time compared to 15.

The query in question roughly looks like the following. As you can see it has a large UNION subquery.

select
    blahblah
from my_table
where
    id in (
        (
            select id
            from my_table
            where blahblah
        )
        UNION ALL
        (
            select id
            from my_table
            where blahblah
        )
        UNION ALL
        # ... in total 50 sub queries

(Please ignore the inefficiency of the query itself, it's just for the purpose of displaying planner benchmark regression.

Binary Version

vtgate version Version: 19.0.8 (Git revision 2665aa2a74fdd26c4606598ba68746be30bd9148 branch 'HEAD') built on Tue Dec  3 05:09:55 UTC 2024 by runner@fv-az1670-25 using go1.22.9 linux/amd64

Operating System and Environment details

Red Hat Enterprise Linux Server 7.9 

$ uname -sr
Linux 3.10.0-1160.80.1.el7.x86_64

Log Fragments

No response

The text was updated successfully, but these errors were encountered:

czhang1216 · 2025-01-06T19:03:13Z

Sorry I forgot to attach the patches. They are here:

15.patch
19.patch

czhang1216 · 2025-01-10T16:26:35Z

I did another round of benchmarking.

v19 implementation create such deep stack that actually breaks pprof when there's a 50 subquery union. So I did another round of tests using 2-query-union.
Lots of time is spent in go/vt/vterrors/vterrors.go/Errorf by vitess.io/vitess/go/vt/vtgate/evalengine.(*astCompiler).translateColName. This is likely due to unit test setup, and should not be the culprit in production.
Taking above out, the flame graph looks like
Notice a lot of time is spent in vitess.io/vitess/go/vt/vtgate/semantics.vindexTableToColumnInfo.
It seems RealTable.dependencies calls getColumns() calls vindexTableToColumnInfo, which iterates a lot of things.
And it seems it's put in the busy loop because the planner does so many rounds of rewriting.

Suggestions:
Adding a ColumnInfo cache to RealTable solves inefficiency in 3., can some maintainer comment on correctness (i.e., are RealTable and its members immutable)?

harshit-gangal · 2025-01-13T07:42:57Z

Thank you @czhang1216 for reporting and providing all the relevant details.

systay · 2025-01-13T11:06:41Z

Hi there 👋

So, in release-19.0, the caching change does seem to make a nice performance difference:

        │ without_cache │             with_cache             │
        │    sec/op     │    sec/op     vs base              │
Mine-20    8.378m ± ∞ ¹   7.766m ± ∞ ¹  -7.31% (p=0.008 n=5)
¹ need >= 6 samples for confidence interval at level 0.95

        │ without_cache │              with_cache              │
        │     B/op      │     B/op       vs base               │
Mine-20   5.002Mi ± ∞ ¹   4.069Mi ± ∞ ¹  -18.65% (p=0.008 n=5)
¹ need >= 6 samples for confidence interval at level 0.95

        │ without_cache │             with_cache             │
        │   allocs/op   │  allocs/op    vs base              │
Mine-20    82.35k ± ∞ ¹   81.00k ± ∞ ¹  -1.64% (p=0.008 n=5)
¹ need >= 6 samples for confidence interval at level 0.95

However, in main I don't notice any difference:

        │ without_cache │          with_cache           │
        │    sec/op     │   sec/op     vs base          │
Mine-20     5.820m ± 0%   5.838m ± 1%  ~ (p=0.105 n=10)

        │ without_cache │           with_cache           │
        │     B/op      │     B/op      vs base          │
Mine-20    3.517Mi ± 0%   3.517Mi ± 0%  ~ (p=0.066 n=10)

        │ without_cache │          with_cache           │
        │   allocs/op   │  allocs/op   vs base          │
Mine-20     74.51k ± 0%   74.51k ± 0%  ~ (p=1.000 n=10)

And when comparing the cached release-19.0 version with the current one in main, it looks like main is even faster:

        │ r19-with-cache │              main_bench               │
        │     sec/op     │   sec/op     vs base                  │
Mine-20      7.837m ± 0%   5.781m ± 1%  -26.23% (p=0.000 n=10+9)

        │ r19-with-cache │               main_bench               │
        │      B/op      │     B/op      vs base                  │
Mine-20     4.069Mi ± 0%   3.515Mi ± 0%  -13.61% (p=0.000 n=10+9)

        │ r19-with-cache │              main_bench              │
        │   allocs/op    │  allocs/op   vs base                 │
Mine-20      81.00k ± 0%   74.51k ± 0%  -8.01% (p=0.000 n=10+9)

Since we don't backport performance optimisations to older releases, I don't think there is much to do here.

czhang1216 · 2025-01-13T17:10:29Z

@systay I'm sorry I don't see you comparing the performance between main and 15. In my test main is slightly better than 19, but is still visibly worse than 15. Could you address this?

systay · 2025-01-14T13:26:54Z

Hi again @czhang1216! 👋

Thank you for clarifying the difference you’re seeing between release-15.0 and main. My own measurements confirm the slowdown on your particular UNION query:

        │ ../r15-bench │             ../main_bench             │
        │    sec/op    │   sec/op     vs base                  │
Mine-20    3.940m ± 3%   5.781m ± 1%  +46.72% (p=0.000 n=10+9)

        │ ../r15-bench │             ../main_bench              │
        │     B/op     │     B/op      vs base                  │
Mine-20   2.818Mi ± 0%   3.515Mi ± 0%  +24.73% (p=0.000 n=10+9)

        │ ../r15-bench │             ../main_bench             │
        │  allocs/op   │  allocs/op   vs base                  │
Mine-20    50.46k ± 0%   74.51k ± 0%  +47.65% (p=0.000 n=10+9)

Part of the reason is that our newer planner now does more extensive analysis—especially around merging UNION sub-queries—so we end up investing additional computation time. For most workloads, we expect the plan cache to spread that cost out across many runs, making it less noticeable.

Does the plan cache help in your scenario, or are you seeing this as a persistent bottleneck?

czhang1216 · 2025-01-15T15:35:21Z

Hi @systay, thanks for your suggestion.

I think there are some performance problems, even when we are able to mitigate with query plan caching.

For my 50-union query on main branch, rewriteRefOfColName is called 2447 times, copyOnRewriteRefOfColName 1665 times, even when there're less than 10 columns involved in my query. I'd suggest carefully examine redundant work in planning.

Secondly, long unions creates very deep stack. And golang is not good at handling that. golang pprof default stack size is 64 and go has no tail recursion. I'd suggesting folding UNION ALLs before treating them as lots of binary ops.

deepthi · 2025-01-15T16:37:25Z

@czhang1216 what made you focus on the planning time? Is it a significant proportion of overall query time?
Can you share a flame graph that shows that as well?

czhang1216 · 2025-01-15T17:12:16Z

@czhang1216 what made you focus on the planning time? Is it a significant proportion of overall query time? Can you share a flame graph that shows that as well?

Yes the plan time is significant, in real world, when we read ~50 rows out of ~20 shards (all microbenchmark above is single shard), the vtgate plan time is 50+% of overall query time. After v15 it's 2x, sometimes 10x worse, taking a large hit on the total performance.

Yes caching mitigates the problem. But still when something is untidy we should clean it up.

The flame graph on the 50-subquery union look like this.

Notice the big stash of Rewrite BottomUp calls in the right, they should all be inside the testing call.
This deep stack makes looking for culprit much harder.
(Even increasing the default pprof stack size to 2000 by recompliling go doesn't solve it completely. Go copies stack to a buffer for profiling, when the stack is so deep, the buffer cannot keep up and result in similar broken flame graph)

deepthi · 2025-01-15T18:05:24Z

Thanks for the detailed info. We will continue to look into this.

systay · 2025-01-17T09:28:51Z

Hi again @czhang1216

I think @deepthi asked for a flamegraph that includes the running of the query. You are showing only the planning part.
Would it be possible for you to show the flamegraph that includes both planning and query execution?

czhang1216 added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels Jan 6, 2025

deepthi added Component: Query Serving Type: Performance and removed Type: Bug Needs Triage This issue needs to be correctly labelled and triaged labels Jan 10, 2025

systay self-assigned this Jan 13, 2025

systay closed this as completed Jan 13, 2025

deepthi reopened this Jan 13, 2025

czhang1216 changed the title ~~Bug Report: vtgate planning performance regression for big "union" from v15 to v19~~ Bug Report: vtgate planning performance regression for big "union" since v15 Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: vtgate planning performance regression for big "union" since v15 #17467

Bug Report: vtgate planning performance regression for big "union" since v15 #17467

czhang1216 commented Jan 6, 2025 •

edited

Loading

czhang1216 commented Jan 6, 2025

czhang1216 commented Jan 10, 2025

harshit-gangal commented Jan 13, 2025

systay commented Jan 13, 2025

czhang1216 commented Jan 13, 2025 •

edited

Loading

systay commented Jan 14, 2025

czhang1216 commented Jan 15, 2025

deepthi commented Jan 15, 2025

czhang1216 commented Jan 15, 2025 •

edited

Loading

deepthi commented Jan 15, 2025

systay commented Jan 17, 2025

Bug Report: vtgate planning performance regression for big "union" since v15 #17467

Bug Report: vtgate planning performance regression for big "union" since v15 #17467

Comments

czhang1216 commented Jan 6, 2025 • edited Loading

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

czhang1216 commented Jan 6, 2025

czhang1216 commented Jan 10, 2025

harshit-gangal commented Jan 13, 2025

systay commented Jan 13, 2025

czhang1216 commented Jan 13, 2025 • edited Loading

systay commented Jan 14, 2025

czhang1216 commented Jan 15, 2025

deepthi commented Jan 15, 2025

czhang1216 commented Jan 15, 2025 • edited Loading

deepthi commented Jan 15, 2025

systay commented Jan 17, 2025

czhang1216 commented Jan 6, 2025 •

edited

Loading

czhang1216 commented Jan 13, 2025 •

edited

Loading

czhang1216 commented Jan 15, 2025 •

edited

Loading