Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: vtgate planning performance regression for big "union" since v15 #17467

Open
czhang1216 opened this issue Jan 6, 2025 · 11 comments

Comments

@czhang1216
Copy link

czhang1216 commented Jan 6, 2025

Overview of the Issue

Continuation from the slack thread. After switching from v15 to v19, we are seeing 2x-3x+ 2-30x planning performance regression on some particular query. This regression is significant when small amount of data is read.

Reproduction Steps

I included our schema and query in question in attached patches. Apply them to v15 and v19 repo, you'll be able to see the performance difference.

Step-by-step reproduce instructions:

  • Please download 15.patch and 19.patch to an empty directory, then run the following in that directory:
$ git clone https://github.com/vitessio/vitess.git -b release-15.0 --depth=1 vitess15
$ cd vitess15
$ git apply ../15.patch
$ go test -benchmem -run=^$ -bench ^BenchmarkMine$ --benchtime=5s -count=12 vitess.io/vitess/go/vt/vtgate/planbuilder

$ cd ..
$ git clone https://github.com/vitessio/vitess.git -b release-19.0 --depth=1 vitess19
$ cd vitess19
$ git apply ../19.patch
$ go test -benchmem -run=^$ -bench ^BenchmarkMine$ --benchtime=5s -count=12 vitess.io/vitess/go/vt/vtgate/planbuilder
# Looking at "ns/op" 19 takes ~2x time compared to 15.

The query in question roughly looks like the following. As you can see it has a large UNION subquery.

select
    blahblah
from my_table
where
    id in (
        (
            select id
            from my_table
            where blahblah
        )
        UNION ALL
        (
            select id
            from my_table
            where blahblah
        )
        UNION ALL
        # ... in total 50 sub queries

(Please ignore the inefficiency of the query itself, it's just for the purpose of displaying planner benchmark regression.

Binary Version

vtgate version Version: 19.0.8 (Git revision 2665aa2a74fdd26c4606598ba68746be30bd9148 branch 'HEAD') built on Tue Dec  3 05:09:55 UTC 2024 by runner@fv-az1670-25 using go1.22.9 linux/amd64

Operating System and Environment details

Red Hat Enterprise Linux Server 7.9 

$ uname -sr
Linux 3.10.0-1160.80.1.el7.x86_64

Log Fragments

No response

@czhang1216 czhang1216 added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels Jan 6, 2025
@czhang1216
Copy link
Author

Sorry I forgot to attach the patches. They are here:

15.patch
19.patch

@czhang1216
Copy link
Author

I did another round of benchmarking.

  1. v19 implementation create such deep stack that actually breaks pprof when there's a 50 subquery union. So I did another round of tests using 2-query-union.

  2. Lots of time is spent in go/vt/vterrors/vterrors.go/Errorf by vitess.io/vitess/go/vt/vtgate/evalengine.(*astCompiler).translateColName. This is likely due to unit test setup, and should not be the culprit in production.

  3. Taking above out, the flame graph looks like
    2_union_flame

  4. Notice a lot of time is spent in vitess.io/vitess/go/vt/vtgate/semantics.vindexTableToColumnInfo.
    It seems RealTable.dependencies calls getColumns() calls vindexTableToColumnInfo, which iterates a lot of things.
    And it seems it's put in the busy loop because the planner does so many rounds of rewriting.


Suggestions:
Adding a ColumnInfo cache to RealTable solves inefficiency in 3., can some maintainer comment on correctness (i.e., are RealTable and its members immutable)?

@deepthi deepthi added Component: Query Serving Type: Performance and removed Type: Bug Needs Triage This issue needs to be correctly labelled and triaged labels Jan 10, 2025
@harshit-gangal
Copy link
Member

Thank you @czhang1216 for reporting and providing all the relevant details.

@systay systay self-assigned this Jan 13, 2025
@systay
Copy link
Collaborator

systay commented Jan 13, 2025

Hi there 👋

So, in release-19.0, the caching change does seem to make a nice performance difference:

        │ without_cache │             with_cache             │
        │    sec/op     │    sec/op     vs base              │
Mine-20    8.378m ± ∞ ¹   7.766m ± ∞ ¹  -7.31% (p=0.008 n=5)
¹ need >= 6 samples for confidence interval at level 0.95

        │ without_cache │              with_cache              │
        │     B/op      │     B/op       vs base               │
Mine-20   5.002Mi ± ∞ ¹   4.069Mi ± ∞ ¹  -18.65% (p=0.008 n=5)
¹ need >= 6 samples for confidence interval at level 0.95

        │ without_cache │             with_cache             │
        │   allocs/op   │  allocs/op    vs base              │
Mine-20    82.35k ± ∞ ¹   81.00k ± ∞ ¹  -1.64% (p=0.008 n=5)
¹ need >= 6 samples for confidence interval at level 0.95

However, in main I don't notice any difference:

        │ without_cache │          with_cache           │
        │    sec/op     │   sec/op     vs base          │
Mine-20     5.820m ± 0%   5.838m ± 1%  ~ (p=0.105 n=10)

        │ without_cache │           with_cache           │
        │     B/op      │     B/op      vs base          │
Mine-20    3.517Mi ± 0%   3.517Mi ± 0%  ~ (p=0.066 n=10)

        │ without_cache │          with_cache           │
        │   allocs/op   │  allocs/op   vs base          │
Mine-20     74.51k ± 0%   74.51k ± 0%  ~ (p=1.000 n=10)

And when comparing the cached release-19.0 version with the current one in main, it looks like main is even faster:

        │ r19-with-cache │              main_bench               │
        │     sec/op     │   sec/op     vs base                  │
Mine-20      7.837m ± 0%   5.781m ± 1%  -26.23% (p=0.000 n=10+9)

        │ r19-with-cache │               main_bench               │
        │      B/op      │     B/op      vs base                  │
Mine-20     4.069Mi ± 0%   3.515Mi ± 0%  -13.61% (p=0.000 n=10+9)

        │ r19-with-cache │              main_bench              │
        │   allocs/op    │  allocs/op   vs base                 │
Mine-20      81.00k ± 0%   74.51k ± 0%  -8.01% (p=0.000 n=10+9)

Since we don't backport performance optimisations to older releases, I don't think there is much to do here.

@systay systay closed this as completed Jan 13, 2025
@czhang1216
Copy link
Author

czhang1216 commented Jan 13, 2025

@systay I'm sorry I don't see you comparing the performance between main and 15. In my test main is slightly better than 19, but is still visibly worse than 15. Could you address this?

@deepthi deepthi reopened this Jan 13, 2025
@systay
Copy link
Collaborator

systay commented Jan 14, 2025

Hi again @czhang1216! 👋

Thank you for clarifying the difference you’re seeing between release-15.0 and main. My own measurements confirm the slowdown on your particular UNION query:

        │ ../r15-bench │             ../main_bench             │
        │    sec/op    │   sec/op     vs base                  │
Mine-20    3.940m ± 3%   5.781m ± 1%  +46.72% (p=0.000 n=10+9)

        │ ../r15-bench │             ../main_bench              │
        │     B/op     │     B/op      vs base                  │
Mine-20   2.818Mi ± 0%   3.515Mi ± 0%  +24.73% (p=0.000 n=10+9)

        │ ../r15-bench │             ../main_bench             │
        │  allocs/op   │  allocs/op   vs base                  │
Mine-20    50.46k ± 0%   74.51k ± 0%  +47.65% (p=0.000 n=10+9)

Part of the reason is that our newer planner now does more extensive analysis—especially around merging UNION sub-queries—so we end up investing additional computation time. For most workloads, we expect the plan cache to spread that cost out across many runs, making it less noticeable.

Does the plan cache help in your scenario, or are you seeing this as a persistent bottleneck?

@czhang1216 czhang1216 changed the title Bug Report: vtgate planning performance regression for big "union" from v15 to v19 Bug Report: vtgate planning performance regression for big "union" since v15 Jan 15, 2025
@czhang1216
Copy link
Author

Hi @systay, thanks for your suggestion.

I think there are some performance problems, even when we are able to mitigate with query plan caching.

For my 50-union query on main branch, rewriteRefOfColName is called 2447 times, copyOnRewriteRefOfColName 1665 times, even when there're less than 10 columns involved in my query. I'd suggest carefully examine redundant work in planning.

Secondly, long unions creates very deep stack. And golang is not good at handling that. golang pprof default stack size is 64 and go has no tail recursion. I'd suggesting folding UNION ALLs before treating them as lots of binary ops.

@deepthi
Copy link
Member

deepthi commented Jan 15, 2025

@czhang1216 what made you focus on the planning time? Is it a significant proportion of overall query time?
Can you share a flame graph that shows that as well?

@czhang1216
Copy link
Author

czhang1216 commented Jan 15, 2025

@czhang1216 what made you focus on the planning time? Is it a significant proportion of overall query time? Can you share a flame graph that shows that as well?

Yes the plan time is significant, in real world, when we read ~50 rows out of ~20 shards (all microbenchmark above is single shard), the vtgate plan time is 50+% of overall query time. After v15 it's 2x, sometimes 10x worse, taking a large hit on the total performance.

Yes caching mitigates the problem. But still when something is untidy we should clean it up.

The flame graph on the 50-subquery union look like this.
image
Notice the big stash of Rewrite BottomUp calls in the right, they should all be inside the testing call.
This deep stack makes looking for culprit much harder.
(Even increasing the default pprof stack size to 2000 by recompliling go doesn't solve it completely. Go copies stack to a buffer for profiling, when the stack is so deep, the buffer cannot keep up and result in similar broken flame graph)

@deepthi
Copy link
Member

deepthi commented Jan 15, 2025

Thanks for the detailed info. We will continue to look into this.

@systay
Copy link
Collaborator

systay commented Jan 17, 2025

Hi again @czhang1216

I think @deepthi asked for a flamegraph that includes the running of the query. You are showing only the planning part.
Would it be possible for you to show the flamegraph that includes both planning and query execution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants