Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spike] Explore integration with Dagster #3180

Open
astrojuanlu opened this issue Oct 16, 2023 · 20 comments
Open

[Spike] Explore integration with Dagster #3180

astrojuanlu opened this issue Oct 16, 2023 · 20 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@astrojuanlu
Copy link
Member

Description

I have heard from several data people that they're happy with Dagster, which is probably the only "modern", widely used orchestrator that is not mentioned in our docs.

There was a request upstream to add Kedro integration to Dagster dagster-io/dagster#2062 but it's unclear what finally happened.

@astrojuanlu astrojuanlu added Issue: Feature Request New feature or improvement to existing feature Component: Documentation 📄 Issue/PR for markdown and API documentation labels Oct 16, 2023
@stichbury
Copy link
Contributor

I'm not clear what the ticket here is for. Is this documentation along the lines of #2817 ?

@datajoely
Copy link
Contributor

I think it involves more of spike to work out how it would actually work. I think Flyte (LFAI), Dagster and Metaflow all fall into the modern orchestrator space which isn't served by Kedro. I also would push we address some of the fundamentals outlined in #3094 before doing this.

@stichbury
Copy link
Contributor

Thanks! But in that case, it's not a docs ticket so I'll remove the label.

@stichbury stichbury removed the Component: Documentation 📄 Issue/PR for markdown and API documentation label Oct 16, 2023
@astrojuanlu
Copy link
Member Author

Thanks both - yeah initially I thought about it as a docs ticket (even though the phrasing didn't match) but you're right, this should be a spike first.

And good point @datajoely on looking at Flyte and Metaflow too (let's call them Tier 3), although both have 0.1x times the PyPI downloads of Dagster, so I wouldn't consider them on the same level of adoption. For reference, Dagster and Prefect (Tier 2) have about the same number of downloads, and both have 0.05x times Airflow (Tier 1). Kedro lies between Tier 2 and 3 at the moment.

@astrojuanlu astrojuanlu changed the title Explore integration with Dagster [Spike] Explore integration with Dagster Oct 16, 2023
@datajoely
Copy link
Contributor

Aligned - I also think Dagster is closer to Kedro than the others in terms of granularity. In recent years they've really invested in their dbt integration and perhaps we can take inspiration in how they've done that.

@MatthiasRoels
Copy link

I never explored Dagster as much as I should have, I really like the idea of software defined assets. However, Dagster looks complicated as it has many concepts to understand. Also not sure on how individual task run (especially in a Kubernetes context).

@astrojuanlu
Copy link
Member Author

@gtauzin experimenting with Kedro & Dagster! https://github.com/gtauzin/kedro-spaceflights-dagster

@gtauzin
Copy link

gtauzin commented Nov 5, 2024

Hey! Thanks @astrojuanlu for pinging me, it's nice to see some interest for a dagster integration!

It seems to me kedro and dagster are nicely complementary:

  • dagster has an asset-driven perspective which pushes you to define nodes in a graph from the perspective of the assets they generate. However, the node function do not have to return the assets or take the one they depend on as inputs. The data assets I/O is left to the user to write. This can be very confusing at first.
  • kedro has numerous data connectors and dataset factories which help provide some structure and clarity to complex pipelines. They can be directly of use to help define dagster "asset factories" and remove the trouble of having to handle I/O.

I also feel as @MatthiasRoels that dagster has a lot of concepts. Each of them separately is not necessarily complex, but the way their relate to each other is not always clear to me from the documentation (and the chatbot in there has confused me more than anything else so far).

For example, there are several way of mapping kedro to dagster because dagster has many concepts around generic tasks:

  • an op: a task not necessarily associated to an asset;
  • a graph of op;
  • an asset: which is also an op;
  • a multi asset: an op that defines multiple assets;
  • an asset graph: a graph of op that ends up defining an asset.

In practice, to map kedro nodes, I believe multi assets would make sense even in the case of a node that does not have any outputs (and therefore does not define any assets). This is because ops are second-rate citizens in dagster: they do not even appear on their DAG visualization (the global asset lineage) on the UI, but are presented in a form a of a list lost somewhere in a menu. In the case of the spaceflights example, the last node, "evaluate_model_node" does not have any outputs. Defining it as a multi_asset with a corresponding asset that is intangible allows to have it as a part of the asset DAG:

image

This small projects is a way for me to deepen my understanding of both kedro and dagster and this is also something I am planning on using for work in the near future. So if you're interested or are also looking into it, don't hesistate to ping me on the kedro slack, I'd be happy to discuss it more.

@datajoely
Copy link
Contributor

Super cool work!

@astrojuanlu
Copy link
Member Author

New link 🔥 https://github.com/gtauzin/kedro-dagster

@fdroessler
Copy link
Contributor

Love seeing this getting some traction! I'm trying to find time to look into kedro + dagster myself but miserably failing at it 😞

Just adding a few observations here of things I've seen while looking into this. The approaches that dagsters and kedro take to integrate with other frameworks differ slightly. I'll try to add more details as I understand them in the dagster approach because it might be the less familiar and also will help me structure my thoughts.

Before starting I just want to mention that I don't think either approach is more legitimate than the other just wanted to share what I found so far. Hope this helps others and I'll try to put some code down or iterate on these thoughts as I go along. Also I am sure @deepyaman will soon have a much better view on this 😄 (congrats btw).

General approaches

TLDR: The difference in approaches mostly evolves around whether the logic for converting kedro to dagster is the responsibility of kedro or dagster.

Kedro approach:

The kedro approach (as far as there is one according to the examples in the documentation) to integrating with an orchestrator is to transpile the kedro primitives into the orchestrator primitives using the general alignment of DAG abstractions between the two frameworks. This often takes the form of a plugin/cli addition in kedro, that when run, creates some artifacts that can be used by the orchestrator. I sometimes think about it as kedro "wrapping" the orchestrator. I think this is what I see @gtauzin doing very nicely in the current kedro-dagster plugin (please correct me if I'm wrong).

Dagster approach

In contrast it seems like the approach that dagster takes with integrating other authoring frameworks is to wrap them (I have long hunted for a term that describes dbt and kedro nicely and I think authoring framework is currently my favourite). In essence as far as I understand it dagster still uses the wrapped frameworks cli/functions to actually run the code. E.g. for dbt it uses the dbt build command of the dbt cli and for dlt it uses the dlt.run function. It will use a translator to interpretate the frameworks dag and create a multi asset plus metadata.

There are usually the following components:

  • asset decorator: This decorator will be the main entry point to parsing and running the wrapped framework. The output can be a multi asset with internal asset dependency (dbt) or a asset/multi_asset without internal asset dependency (dlt)
  • DagsterTranslator: A class that translates from the wrapped framework to dagster entities e.g. asset key, tags, owners etc.
  • ConfigurableResource: This runs the actual framework given the current dagster context
  • event iterator: Haven't fully understood this part but I think it allows both converting framework execution events into dagster events for logging and ability to iterate over results

Random thought on dependencies:

  • Using kedro -> dagster we would need a kedro project having dependencies on dagster and kedro-dagster to make things work while in the dagster approach the kedro project could be on its own without any additional dependencies and only the dagster code environment would need a kedro + dagster-kedro dependency.

@astrojuanlu
Copy link
Member Author

I'd like to add some color to @fdroessler comment.

The kedro approach (as far as there is one according to the examples in the documentation) to integrating with an orchestrator is to transpile the kedro primitives into the orchestrator primitives using the general alignment of DAG abstractions between the two frameworks.
[...] In contrast it seems like the approach that dagster takes with integrating other authoring frameworks is to wrap them (I have long hunted for a term that describes dbt and kedro nicely and I think authoring framework is currently my favourite). In essence as far as I understand it dagster still uses the wrapped frameworks cli/functions to actually run the code. E.g. for dbt it uses the dbt build command of the dbt cli and for dlt it uses the dlt.run function.

Setting aside the pros and cons of the Kedro approach, historically it has made more sense for translating Kedro to "bigger" systems, like Airflow. We've never tried to call dlt or dbt from Kedro so it's unclear whether we'd replicate the same translation approach or try something different.


Now, in my opinion, I don't think it would make sense that Kedro were the "driver" of a Dagster project. So either translating Kedro to Dagster or making Dagster drive Kedro would make more sense to me.

@gtauzin
Copy link

gtauzin commented Dec 18, 2024

Thanks for initiating this interesting discussion @fdroessler. I probably need some more time to digest it, but it seems to me, the distinction you make between the two approaches makes sense and indeed kedro-dagster has kedro wrap dagster rather than the opposite. I personally did not consider something like a dagster-kedro integration as I felt
as @astrojuanlu that kedro is not meant to be a driver although it can drive in some simple cases. That being said, I would be curious to explore the pros and cons of both approaches.

@fdroessler
Copy link
Contributor

Now, in my opinion, I don't think it would make sense that Kedro were the "driver" of a Dagster project. So either translating Kedro to Dagster or making Dagster drive Kedro would make more sense to me.

💯% agreed @astrojuanlu and @gtauzin, I didn't mean to imply that kedro would be the driver so apologies for any misunderstanding. The point I was trying to make and probably poorly was that in one case kedro is initiating the translation of a pipeline into the "bigger" system and in the other case the "bigger" system knows how to "read/execute" a kedro project.

@deepyaman
Copy link
Member

(I have long hunted for a term that describes dbt and kedro nicely and I think authoring framework is currently my favourite)

FWIW I personally use "transformation framework", although that is driven more from their position at the "T" in ELT. Authoring framework seems fine if you want to more generically view them as a place where you define some logic.

@fdroessler's understanding is aligned with mine; since Dagster is operating more as your asset-oriented data platform/single pane of glass and not just an orchestrator, it wants to have visibility into each component. In fact, Dagster is probably moving even more in this direction with a new Components initiative; the idea is to make it trivial to drop in your dbt project via the dbt component, dlt pipeline via a dlt component, and perhaps Kedro through an (official or custom) Kedro component. (The Components functionality is very new/being built.)

I think the ideal state is to have a Dagster-Kedro library, and eventually it could even live in https://github.com/dagster-io/dagster/tree/master/python_modules/libraries (however, right now it looks like community-driven integrations often live in their own repo; e.g. this is the SQLMesh integration that Dagster is tracking: https://github.com/opensource-observer/dagster-sqlmesh).

I personally did not consider something like a dagster-kedro integration as I felt
as @astrojuanlu that kedro is not meant to be a driver although it can drive in some simple cases.

I don't think this is a big problem. Dagster can orchestrate a Kedro pipeline, similar to how it can orchestrate Python code. I think this fits into the view of Kedro as a micro-orchestrator, and, similar to the dbt integration, a kedro run command could result in the creation of some multi-asset.

Especially on the final point, I will admit that my knowledge of Dagster is quite tenuous right now, at best; @gtauzin and @fdroessler you all almost certainly have more experience as a user. :) I think @gtauzin's work is a great starting point in that it actually works and is demoable, rather than just being a conceptual discussion. As soon as it's "ready" and docs are updated, I'd love to try and get eyes from some of the people who are very familiar with Dagster integrations—I'm sure they could provide some thoughts and feedback! I imagine, even if want to structure it more like Dagster-Kedro, a lot of the logic/mapping should be the same regardless. Furthermore, I don't actually know the best model to follow from the perspective of creating a new integration, since (1) I haven't yet had the time to really dig into Dagster-dbt and (2) I've heard Dagster-dbt, while very widely used, was also one of the first integrations; some design decisions are there more for legacy and backwards compatibility reasons, so it may not be the ideal template.

Last but not least, Dagster-Kedro is definitely the way to go where Kedro projects are part of your data stack (e.g. you use dlt for ingestion, Kedro for DE + DS, and some dashboarding tool downstream). However, there are undeniably Kedro users who just want to be orchestrate their Kedro projects. Kedro doesn't have a great, ergonomic default way of doing this, so Kedro-Dagster may still end up being an ideal, Kedro-first way to get your project into production. (Ideally, I think it should be easy for Kedro users who know nothing about Dagster to get their project running using Dagster, and it should also be easy to add other components around it as your scope grows.)

@datajoely
Copy link
Contributor

Don't have much to add other than - I love that this is getting momentum it feels like a really powerful pattern

@gtauzin
Copy link

gtauzin commented Dec 19, 2024

@deepyaman Thanks for the nice answers - including the teasing about dagster Components ! Looking forward to discovering what it is :)

I definitely need to give some more thoughts to that. I do not yet fully understand how dagster integrations are typically built. I'll have a quick look at dagster-dbt.

The way I think about kedro-dagster vs dagster-kedro at the moment is that they would both need to "read"/"translate" a kedro project. I guess the code to do that would be very similar. They differ in how the pipeline is being executed. kedro-dagster wraps kedro nodes into dagster ops and associate any dagster executor to them. From what I understand, dagster-kedro would call kedro run. Do you see any other difference?

Last but not least, Dagster-Kedro is definitely the way to go where Kedro projects are part of your data stack (e.g. you use dlt for ingestion, Kedro for DE + DS, and some dashboarding tool downstream). However, there are undeniably Kedro users who just want to be orchestrate their Kedro projects. Kedro doesn't have a great, ergonomic default way of doing this, so Kedro-Dagster may still end up being an ideal, Kedro-first way to get your project into production. (Ideally, I think it should be easy for Kedro users who know nothing about Dagster to get their project running using Dagster, and it should also be easy to add other components around it as your scope grows.)

Ok let me see if I understand properly:

  • If you are kedro users and you are in charge of the orchestration of your pipelines, kedro-dagster is a legitimate option;
  • If you are in charge of the data stack in your organization and you have to manage data assets across teams, you use dagster-kedro to include the kedro projects of the DS team.

Makes sense to me. Is this what you meant?

I think the ideal state is to have a Dagster-Kedro library, and eventually it could even live in https://github.com/dagster-io/dagster/tree/master/python_modules/libraries (however, right now it looks like community-driven integrations often live in their own repo; e.g. this is the SQLMesh integration that Dagster is tracking: https://github.com/opensource-observer/dagster-sqlmesh).

This would be great and I am very happy to support/discuss with anyone willing to work on it.

My impression is that it might be a harder endeavour to write a dagster integration compared to a kedro plugin. The dagster documentation is sometimes incomplete, and mixes new concepts with legacy ones, and so does the dagster codebase and the existing integrations. The dagster slack relies a lot on chatbots to answer questions, which I am not a fan of: they are trained on the imperfect documentation and legacy code, so they often cannot really help and generate massive amount of text which means it is harder to search/learn from other's questions.

@deepyaman
Copy link
Member

The way I think about kedro-dagster vs dagster-kedro at the moment is that they would both need to "read"/"translate" a kedro project. I guess the code to do that would be very similar. They differ in how the pipeline is being executed. kedro-dagster wraps kedro nodes into dagster ops and associate any dagster executor to them. From what I understand, dagster-kedro would call kedro run. Do you see any other difference?

That seems reasonable. kedro run could be not necessarily as a CLI command, but through invoking Python API, too.

Makes sense to me. Is this what you meant?

Think our understandings are aligned!

I think the ideal state is to have a Dagster-Kedro library, and eventually it could even live in https://github.com/dagster-io/dagster/tree/master/python_modules/libraries (however, right now it looks like community-driven integrations often live in their own repo; e.g. this is the SQLMesh integration that Dagster is tracking: https://github.com/opensource-observer/dagster-sqlmesh).

This would be great and I am very happy to support/discuss with anyone willing to work on it.

My impression is that it might be a harder endeavour to write a dagster integration compared to a kedro plugin.

I agree it would be harder. I would also love to get some input from people who work on Dagster integrations on how best to approach this. Regardless, I think your plugin is a critical starting point, since I don't think most people who work on Dagster are familiar with Kedro and how it could integrate.

@deepyaman
Copy link
Member

From what I understand, dagster-kedro would call kedro run.

@gtauzin Actually, thinking about this more, maybe Dagster-Kedro could also convert Kedro nodes/create the appropriate Dagster assets based on the data catalog entries and nodes that create them.

Dagster-dbt calls dbt CLI command because Dagster doesnt know how to execute SQL (I think?), but Dagster does know how to run Python code, so it seems feasible.

If that's the case, it might end up very close to what you have for Kedro-Dagster!

@gtauzin
Copy link

gtauzin commented Dec 20, 2024

@gtauzin Actually, thinking about this more, maybe Dagster-Kedro could also convert Kedro nodes/create the appropriate Dagster assets based on the data catalog entries and nodes that create them.

If that's the case, it might end up very close to what you have for Kedro-Dagster!

Indeed, what kedro-dagster does is translate every part of the kedro project into the corresponding dagster objects and provide the user with a definitions.py to be used as a dragster code location. (non-kedro) Users might modify those as they see fit. One example might be to change the executor of a job, although this will be also possible by filling up the dragster.yml configuration file that kedro-dagster also provides.

Ok let me see if I understand properly:

  • If you are kedro users and you are in charge of the orchestration of your pipelines, kedro-dagster is a legitimate option;
  • If you are in charge of the data stack in your organization and you have to manage data assets across teams, you use dagster-kedro to include the kedro projects of the DS team.

I was thinking about this distinction we were making. It seemed to me that even if you are a non-kedro user the approach where you use Dagster to manage various assets in a large organisation, having a kedro project fully translated (and modifiable through definitions.py) is a more attractive option than wrapping kedro. In that case, you do not even need to understand kedro at all, you just get your typical Dagster objects.

Dagster-dbt calls dbt CLI command because Dagster doesnt know how to execute SQL (I think?), but Dagster does know how to run Python code, so it seems feasible.

I do not know much about dbt, but the more I think about it, the less I understand why in the case of kedro one would prefer to use Dagster to run 'kedro run' through the CLI or python API. I feel it is indeed likely they did not have a way to do a dbt-dagster easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: No status
Development

No branches or pull requests

7 participants