Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insert_overwrite (insert+replace) partitions incremental strategy #201

Merged
merged 10 commits into from
Aug 1, 2024

Conversation

bryzgaloff
Copy link
Contributor

@bryzgaloff bryzgaloff commented Oct 26, 2023

Summary

This PR implements insert+replace strategy discussed in #128 which does the following:

  • Creates a new staging table with the same structure as the target table.
  • Inserts data into the staging table.
  • Replaces partitions in the target table from the staging table.

Advantages:

  • Only the involved partitions are replaced: this is much cheaper than reinserting the full table which is implemented in other strategies.
  • If an insertion fails, the target table is not affected.

Checklist

Delete items not relevant to your PR:

  • Unit and integration tests covering the common scenarios were added
  • A human-readable description of the changes was provided to include in CHANGELOG
  • For significant changes, documentation in https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

@CLAassistant
Copy link

CLAassistant commented Oct 26, 2023

CLA assistant check
All committers have signed the CLA.

@bryzgaloff
Copy link
Contributor Author

At the time of publishing the PR is in WIP (work in progress) state since I need an advice from the community. Thus, neither documentation nor tests are updated.

@bryzgaloff bryzgaloff force-pushed the incremental-insert_replace branch from 262eaec to 143f599 Compare October 26, 2023 17:03
Comment on lines 45 to 47
inserts_only
or unique_key is none
and config.get('incremental_strategy', none) != 'insert+replace' -%}
Copy link
Contributor Author

@bryzgaloff bryzgaloff Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A mix of legacy (insert_only) and new ("append" strategy) approaches to configure the strategy introduces counter-intuitive and even conflicting set of configuration in dbt-clickhouse. To not break the current conditions check flow I have implemented this dirty way of checking, though in general I would suggest refactoring these conditions tree.

And introduce strict config consistency checks: e.g. inserts_only must prohibit using any incremental_strategy apart from "append" (emit a warning in that case since it is redundant) or unspecified. Please let me know if some work here has started: reference an issue or a PR.

To narrow down the reviewed scope, I won't blend the changes into this PR, thus I suggest keeping this dirty-yet-working workaround here.

dbt/adapters/clickhouse/impl.py Outdated Show resolved Hide resolved
@@ -234,3 +244,36 @@
{% do adapter.drop_relation(new_data_relation) %}
{{ drop_relation_if_exists(distributed_new_data_relation) }}
{% endmacro %}

{% macro clickhouse__incremental_insert_replace(existing_relation, intermediate_relation, partition_by) %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this strategy should also take care about cluster setup and distributed tables.
We can insert data through Distributed table and then do replace for local table on each shard. This part will be mostly the same as in clickhouse__incremental_delete_insert macro

Copy link
Contributor Author

@bryzgaloff bryzgaloff Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe defer the distributed implementation till the next iteration? My immediate requirement was to support the strategy for single-node setups (which we are using currently with my team). We may leave a chance to others implement and test the cluster-specific version. I won't have a quick option to test it, thus I won't be able to confirm it works.

Your suggestion with system.parts works completely fine, thanks for that! I plan to cover it with tests and document in the next couple of weeks (having a vacation next week :) ).

My team will test the approach on our real use cases to make sure it works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about merging without Distributed support, because this is new functionality and mostly experimental, I think this is a question to @genzgd, as he is a maintainer.

But I believe that it should work with cluster from beginning because this is core CH functionality, and most of the production users don't use single server setup

Copy link
Contributor Author

@bryzgaloff bryzgaloff Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use dbt with ClickHouse without a cluster currently, so I have not had an opportunity to test it. I suggest on merging the tested version of my PR now. Once someone needs a clustered version of the strategy, they may contribute it too and test themselves.

If I add the clusters-relates logic, I cannot guarantee that it works. However, if the maintainers are ok to rely on their review, I am completely fine with adding a related code snippet.

Please let me know which code snippet I have to add, if it is required. Once again, my suggestion is to proceed with a tested version leaving a room for someone else's contribution.

{%- endcall %}
{% if execute %}
{% set select_changed_partitions %}
select distinct {{ partition_by|join(', ') }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do something like this to get partitions
this way we are guaranteed to select all partitions in temp relation
and and will be faster

{% set partitions = get_partitions(relation) %}
...
{% macro get_partitions(relation) %}
  {% set cluster = adapter.get_clickhouse_cluster_name() %}
  {% set source = 'system.parts' %}
  {% if cluster is not none %}
    {% set source = "cluster('{{ cluster }}', system.parts)" %}
  {% set sql -%}
    SELECT DISTINCT partition_id
    FROM {{ source }} WHERE active AND database = '{{ relation.schema }}' AND table = '{{ relation.identifier }}'
  {%- endset -%}

  {{ return(run_query(sql)) }}
{% endmacro %}

@bryzgaloff bryzgaloff requested a review from simpl1g November 17, 2023 20:28
@bryzgaloff
Copy link
Contributor Author

Hi @simpl1g and @genzgd, I've wrapped up this PR for insert+replace strategy implementation: added some fancy documentation in the README and threw in integration tests (taking a page from lw-deletes). Motivation for the feature is given in the PR's description.

Some discussions above are unresolved, waiting on your pearls of wisdom there :)

Please take a peek at the PR when you get a chance, and hit me up with your thoughts. Hoping we can merge the feature soon and let the good times roll with the new strategy! :)

@bryzgaloff bryzgaloff changed the title [WIP] insert+replace partitions incremental strategy insert+replace partitions incremental strategy Nov 22, 2023
@bryzgaloff
Copy link
Contributor Author

Hi @simpl1g and @genzgd! This is a kind reminder about my PR which is ready for your review. We have successfully battle-tested it internally

We install my version from GitHub currently. It would be nice if you can approve it and release to PyPI. If any adjustments are required, please let me know! 🙏

@genzgd
Copy link
Contributor

genzgd commented Dec 15, 2023

@bryzgaloff I apologize that we haven't yet had the resources to fully review this PR. As you may have noticed we've been focused on bug fixes and compatibility with the new dbt releases. Please know that we very much appreciate the contribution (especially with test cases and real world usage) and your work is next on the roadmap as we get time.

If you have a chance to resolve the conflicts over the next few weeks that would be appreciated and make the review just a bit easier.

create table {{ intermediate_relation }} as {{ existing_relation }}
{%- endcall %}
{% call statement('insert_new_data') -%}
insert into {{ intermediate_relation }} select * from {{ new_data_relation }}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryzgaloff as you said you has a single node deployment. Here is should be {{ on_cluster_clause()}}. Otherwise error below occurs in my deployment. I'm currently working on adaptation to work on cluster.

15:04:39    :HTTPDriver for http://10.100.0.106:8123 returned response code 400)
15:04:39     Code: 36. DB::Exception: Macro 'uuid' and empty arguments of ReplicatedMergeTree are supported only for ON CLUSTER queries with Atomic database engine. (BAD_ARGUMENTS) (version 24.2.1.2248 (official build))

Also I have to configure model adding allow_nullable_key option because of otherwise another error occurs. Did you meet with that?

14:18:56    :HTTPDriver for http://10.100.0.106:8123 returned response code 400)
14:18:56     Code: 44. DB::Exception: There was an error on [10.100.0.106:9000]: Code: 44. DB::Exception: Partition key contains nullable columns, but merge tree setting `allow_nullable_key` is disabled. (ILLEGAL_COLUMN) (version 24.2.1.2248 (official build)). (ILLEGAL_COLUMN) (version 24.2.1.2248 (official build))
{{
config(
    materialized = "incremental",
    partition_by = "transaction_date_part",
    incremental_strategy = "insert+replace",
    engine = 'ReplicatedMergeTree',
    order_by = ['city_id', 'transaction_date_part'],
    schema = "analytics",
    settings = {'allow_nullable_key': 1}
)
}}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dev-mkc19, thank you for your input! I plan to get back to working on the PR next week. For now, I will keep it without cluster support as I do not have quick-to-setup infrastructure to test it. Feel free to make your own PR adding cluster support 🤝 I may review it to ensure it does not break any of the insert+replace semantics. Tag me as a reviewer once it is published!

@BentsiLeviav BentsiLeviav self-assigned this Jul 3, 2024
@BentsiLeviav
Copy link
Contributor

@bryzgaloff
Thanks again for your contribution!
I would like to review your PR and merge it within the next few days. Can you sync your fork with the main repo? We upgraded dbt-core to 1.8.0.

After syncing I'll review this one right away.

@bryzgaloff
Copy link
Contributor Author

Hi @BentsiLeviav thank you (and the other reviewers, of course!) for your participation. I am not actively using the plugin right now, but I may get back to handling your review feedback late next week. If the conflicts are not too critical, I might be able to resolve them quickly.

…ment insert+replace strategy

TODO: convert partition_expression to ClickHouse literals
…ting affected partitions from system.parts

This allows to avoid translation of agate.Row to ClickHouse literals.
…m.parts.partition_id -> partition

partition_id is a String field with internal partition ID, it cannot be used in REPLACE PARTITION clause. "partition" field is a string representation of partition expression and can be used in a REPLACE PARTITION query as-is.
…ify partition by ID

system.parts.partition field does not work for strings. ClickHouse allows to manipulate partitions referencing their IDs.
According to a PR review comments
@bryzgaloff bryzgaloff force-pushed the incremental-insert_replace branch from 8940ca9 to 616ea7a Compare July 15, 2024 15:03
@bryzgaloff
Copy link
Contributor Author

Hi @BentsiLeviav I have rebased my contribution onto the latest main of this repository. Also, I have renamed the strategy to insert_replace. Please have a look! 👋

@bryzgaloff
Copy link
Contributor Author

bryzgaloff commented Jul 15, 2024

Does this repo has automated tests? I see there are workflows to be approved.

I do not have a quick infra to retest the contribution after the rebase, but if there is not testing workflow, I will perform the manual testing.

@bryzgaloff bryzgaloff requested a review from BentsiLeviav July 15, 2024 15:07
@bryzgaloff bryzgaloff changed the title insert+replace partitions incremental strategy insert_overwrite (insert+replace) partitions incremental strategy Jul 15, 2024
@bryzgaloff
Copy link
Contributor Author

bryzgaloff commented Jul 22, 2024

Hi @BentsiLeviav and @genzgd — will you have a chance review and merge the PR soon? I have updated everything according to the review feedback and all the checks have passed. I would like to avoid the need to rebase it again 😅

@BentsiLeviav
Copy link
Contributor

Hi @bryzgaloff
Huge thanks for this.
I will review and update you within the next few days.

@bryzgaloff
Copy link
Contributor Author

Thank you @BentsiLeviav for the approval! What are the next steps for the PR to be merged?

@BentsiLeviav
Copy link
Contributor

Hi @bryzgaloff
Sure, thank you for your contribution!

Before merging this, could you please add to the doc that this feature is experimental, and wasn't tested with cluster setup?
It is crucial to highlight these 2 points.

Once we are done with that, I'll merge your PR.

Thanks again for your work!

@BentsiLeviav
Copy link
Contributor

Never mind, I'll take care of it :)

@BentsiLeviav BentsiLeviav merged commit 46904a4 into ClickHouse:main Aug 1, 2024
21 checks passed
@bryzgaloff
Copy link
Contributor Author

Thank you for all you help and merge, @BentsiLeviav! 🤝

@pheepa
Copy link

pheepa commented Nov 25, 2024

@BentsiLeviav @genzgd

Hi! I’m really interested in contributing to this project, particularly with the cluster functionality.

Before diving into development, I want to ensure I write robust tests for the support of ReplicatedMergeTree and Distributed.

Here’s my plan:

• Add TestInsertReplaceIncremental (adapted for Replicated) to test_base_incremental.py.
• Add another version of TestInsertReplaceIncremental (for materialized=distributed_incremental).

Does this approach align with the project’s practices, or would you recommend any adjustments?

Thanks in advance for your guidance!

@pheepa
Copy link

pheepa commented Nov 28, 2024

Support of insert_overwrite in cluster setup #394 here it is!
I would appreciate it if you would review this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants