-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spike] Explore integration with Dagster #3180
Comments
I'm not clear what the ticket here is for. Is this documentation along the lines of #2817 ? |
I think it involves more of spike to work out how it would actually work. I think Flyte (LFAI), Dagster and Metaflow all fall into the modern orchestrator space which isn't served by Kedro. I also would push we address some of the fundamentals outlined in #3094 before doing this. |
Thanks! But in that case, it's not a docs ticket so I'll remove the label. |
Thanks both - yeah initially I thought about it as a docs ticket (even though the phrasing didn't match) but you're right, this should be a spike first. And good point @datajoely on looking at Flyte and Metaflow too (let's call them Tier 3), although both have 0.1x times the PyPI downloads of Dagster, so I wouldn't consider them on the same level of adoption. For reference, Dagster and Prefect (Tier 2) have about the same number of downloads, and both have 0.05x times Airflow (Tier 1). Kedro lies between Tier 2 and 3 at the moment. |
Aligned - I also think Dagster is closer to Kedro than the others in terms of granularity. In recent years they've really invested in their dbt integration and perhaps we can take inspiration in how they've done that. |
I never explored Dagster as much as I should have, I really like the idea of software defined assets. However, Dagster looks complicated as it has many concepts to understand. Also not sure on how individual task run (especially in a Kubernetes context). |
@gtauzin experimenting with Kedro & Dagster! https://github.com/gtauzin/kedro-spaceflights-dagster |
Hey! Thanks @astrojuanlu for pinging me, it's nice to see some interest for a dagster integration! It seems to me kedro and dagster are nicely complementary:
I also feel as @MatthiasRoels that dagster has a lot of concepts. Each of them separately is not necessarily complex, but the way their relate to each other is not always clear to me from the documentation (and the chatbot in there has confused me more than anything else so far). For example, there are several way of mapping kedro to dagster because dagster has many concepts around generic tasks:
In practice, to map kedro nodes, I believe multi assets would make sense even in the case of a node that does not have any outputs (and therefore does not define any assets). This is because ops are second-rate citizens in dagster: they do not even appear on their DAG visualization (the global asset lineage) on the UI, but are presented in a form a of a list lost somewhere in a menu. In the case of the spaceflights example, the last node, "evaluate_model_node" does not have any outputs. Defining it as a multi_asset with a corresponding asset that is intangible allows to have it as a part of the asset DAG: This small projects is a way for me to deepen my understanding of both kedro and dagster and this is also something I am planning on using for work in the near future. So if you're interested or are also looking into it, don't hesistate to ping me on the kedro slack, I'd be happy to discuss it more. |
Super cool work! |
New link 🔥 https://github.com/gtauzin/kedro-dagster |
Love seeing this getting some traction! I'm trying to find time to look into kedro + dagster myself but miserably failing at it 😞 Just adding a few observations here of things I've seen while looking into this. The approaches that dagsters and kedro take to integrate with other frameworks differ slightly. I'll try to add more details as I understand them in the dagster approach because it might be the less familiar and also will help me structure my thoughts. Before starting I just want to mention that I don't think either approach is more legitimate than the other just wanted to share what I found so far. Hope this helps others and I'll try to put some code down or iterate on these thoughts as I go along. Also I am sure @deepyaman will soon have a much better view on this 😄 (congrats btw). General approachesTLDR: The difference in approaches mostly evolves around whether the logic for converting kedro to dagster is the responsibility of kedro or dagster. Kedro approach:The kedro approach (as far as there is one according to the examples in the documentation) to integrating with an orchestrator is to transpile the kedro primitives into the orchestrator primitives using the general alignment of DAG abstractions between the two frameworks. This often takes the form of a plugin/cli addition in kedro, that when run, creates some artifacts that can be used by the orchestrator. I sometimes think about it as kedro "wrapping" the orchestrator. I think this is what I see @gtauzin doing very nicely in the current kedro-dagster plugin (please correct me if I'm wrong). Dagster approachIn contrast it seems like the approach that dagster takes with integrating other authoring frameworks is to wrap them (I have long hunted for a term that describes dbt and kedro nicely and I think authoring framework is currently my favourite). In essence as far as I understand it dagster still uses the wrapped frameworks cli/functions to actually run the code. E.g. for dbt it uses the There are usually the following components:
Random thought on dependencies:
|
I'd like to add some color to @fdroessler comment.
Setting aside the pros and cons of the Kedro approach, historically it has made more sense for translating Kedro to "bigger" systems, like Airflow. We've never tried to call dlt or dbt from Kedro so it's unclear whether we'd replicate the same translation approach or try something different. Now, in my opinion, I don't think it would make sense that Kedro were the "driver" of a Dagster project. So either translating Kedro to Dagster or making Dagster drive Kedro would make more sense to me. |
Thanks for initiating this interesting discussion @fdroessler. I probably need some more time to digest it, but it seems to me, the distinction you make between the two approaches makes sense and indeed kedro-dagster has kedro wrap dagster rather than the opposite. I personally did not consider something like a dagster-kedro integration as I felt |
💯% agreed @astrojuanlu and @gtauzin, I didn't mean to imply that kedro would be the driver so apologies for any misunderstanding. The point I was trying to make and probably poorly was that in one case kedro is initiating the translation of a pipeline into the "bigger" system and in the other case the "bigger" system knows how to "read/execute" a kedro project. |
FWIW I personally use "transformation framework", although that is driven more from their position at the "T" in ELT. Authoring framework seems fine if you want to more generically view them as a place where you define some logic. @fdroessler's understanding is aligned with mine; since Dagster is operating more as your asset-oriented data platform/single pane of glass and not just an orchestrator, it wants to have visibility into each component. In fact, Dagster is probably moving even more in this direction with a new Components initiative; the idea is to make it trivial to drop in your dbt project via the dbt component, dlt pipeline via a dlt component, and perhaps Kedro through an (official or custom) Kedro component. (The Components functionality is very new/being built.) I think the ideal state is to have a Dagster-Kedro library, and eventually it could even live in https://github.com/dagster-io/dagster/tree/master/python_modules/libraries (however, right now it looks like community-driven integrations often live in their own repo; e.g. this is the SQLMesh integration that Dagster is tracking: https://github.com/opensource-observer/dagster-sqlmesh).
I don't think this is a big problem. Dagster can orchestrate a Kedro pipeline, similar to how it can orchestrate Python code. I think this fits into the view of Kedro as a micro-orchestrator, and, similar to the dbt integration, a Especially on the final point, I will admit that my knowledge of Dagster is quite tenuous right now, at best; @gtauzin and @fdroessler you all almost certainly have more experience as a user. :) I think @gtauzin's work is a great starting point in that it actually works and is demoable, rather than just being a conceptual discussion. As soon as it's "ready" and docs are updated, I'd love to try and get eyes from some of the people who are very familiar with Dagster integrations—I'm sure they could provide some thoughts and feedback! I imagine, even if want to structure it more like Dagster-Kedro, a lot of the logic/mapping should be the same regardless. Furthermore, I don't actually know the best model to follow from the perspective of creating a new integration, since (1) I haven't yet had the time to really dig into Dagster-dbt and (2) I've heard Dagster-dbt, while very widely used, was also one of the first integrations; some design decisions are there more for legacy and backwards compatibility reasons, so it may not be the ideal template. Last but not least, Dagster-Kedro is definitely the way to go where Kedro projects are part of your data stack (e.g. you use dlt for ingestion, Kedro for DE + DS, and some dashboarding tool downstream). However, there are undeniably Kedro users who just want to be orchestrate their Kedro projects. Kedro doesn't have a great, ergonomic default way of doing this, so Kedro-Dagster may still end up being an ideal, Kedro-first way to get your project into production. (Ideally, I think it should be easy for Kedro users who know nothing about Dagster to get their project running using Dagster, and it should also be easy to add other components around it as your scope grows.) |
Don't have much to add other than - I love that this is getting momentum it feels like a really powerful pattern |
@deepyaman Thanks for the nice answers - including the teasing about dagster Components ! Looking forward to discovering what it is :) I definitely need to give some more thoughts to that. I do not yet fully understand how dagster integrations are typically built. I'll have a quick look at dagster-dbt. The way I think about kedro-dagster vs dagster-kedro at the moment is that they would both need to "read"/"translate" a kedro project. I guess the code to do that would be very similar. They differ in how the pipeline is being executed. kedro-dagster wraps kedro nodes into dagster ops and associate any dagster executor to them. From what I understand, dagster-kedro would call
Ok let me see if I understand properly:
Makes sense to me. Is this what you meant?
This would be great and I am very happy to support/discuss with anyone willing to work on it. My impression is that it might be a harder endeavour to write a dagster integration compared to a kedro plugin. The dagster documentation is sometimes incomplete, and mixes new concepts with legacy ones, and so does the dagster codebase and the existing integrations. The dagster slack relies a lot on chatbots to answer questions, which I am not a fan of: they are trained on the imperfect documentation and legacy code, so they often cannot really help and generate massive amount of text which means it is harder to search/learn from other's questions. |
That seems reasonable.
Think our understandings are aligned!
I agree it would be harder. I would also love to get some input from people who work on Dagster integrations on how best to approach this. Regardless, I think your plugin is a critical starting point, since I don't think most people who work on Dagster are familiar with Kedro and how it could integrate. |
@gtauzin Actually, thinking about this more, maybe Dagster-Kedro could also convert Kedro nodes/create the appropriate Dagster assets based on the data catalog entries and nodes that create them. Dagster-dbt calls dbt CLI command because Dagster doesnt know how to execute SQL (I think?), but Dagster does know how to run Python code, so it seems feasible. If that's the case, it might end up very close to what you have for Kedro-Dagster! |
Indeed, what kedro-dagster does is translate every part of the kedro project into the corresponding dagster objects and provide the user with a definitions.py to be used as a dragster code location. (non-kedro) Users might modify those as they see fit. One example might be to change the executor of a job, although this will be also possible by filling up the dragster.yml configuration file that kedro-dagster also provides.
I was thinking about this distinction we were making. It seemed to me that even if you are a non-kedro user the approach where you use Dagster to manage various assets in a large organisation, having a kedro project fully translated (and modifiable through definitions.py) is a more attractive option than wrapping kedro. In that case, you do not even need to understand kedro at all, you just get your typical Dagster objects.
I do not know much about dbt, but the more I think about it, the less I understand why in the case of kedro one would prefer to use Dagster to run 'kedro run' through the CLI or python API. I feel it is indeed likely they did not have a way to do a dbt-dagster easily. |
Description
I have heard from several data people that they're happy with Dagster, which is probably the only "modern", widely used orchestrator that is not mentioned in our docs.
There was a request upstream to add Kedro integration to Dagster dagster-io/dagster#2062 but it's unclear what finally happened.
The text was updated successfully, but these errors were encountered: