-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CachedDataset example usage #3616
Comments
@inigohidalgo Are you using the CachedDataset in your work? And if yes, what are you using it to do? We're not sure about how often it's used. It's one of the older features of Kedro, so we're always open to understanding how it should be helpful. |
Hi @yetudada, we use it for one specific pipeline in one project. Very niche requirement so I'm not surprised it isn't more widely used. I mentioned in Slack I consider it an antipattern we are supporting more than anything else. Basically we are running some pipelines which extract some "live" data and append it to a table, but then we want to do some further downstream processing with that extracted batch as the input to some market processes, but we don't want to reload from the saved dataset as we are saving to a big table and our I/O isn't super fast. kedro-accelerator covered part of the same usecase but went further. This pipeline could relatively-trivially be reworked to not have this requirement, but CachedDataset was the exact functionality we needed in that case. If you don't plan on supporting CachedDataset longterm feel free to close the issue, I mostly opened it "for reference" as i mentioned in that slack thread. EDIT: the reason I consider it an antipattern is that the return of loading the dataset can be totally different than what is returned by the actual node, this is quite common with deferred loading libraries like ibis and polars. this makes the pipeline structure very brittle. EDIT2: maybe related #3578 |
IMO the use case of Consider this classic example: # A Pandas dataframe
df1.to_csv("raw.csv")
df2= pd.read_csv("raw.csv") # Skip by CacheDataset or `kedro-accelerator` Depending how dataframe looks like, In a lot case, it makes sense to use cache because the data is in memory, it doesn't make sense to throw it away and re-load it from disk (takes a lot of I/O time if the data is big) |
Your example is a less extreme version of the problem I described. In your case you could still write a reproducible pipeline by explicitly casting the data to the correct pandas types. (For clarity as there is duplicated terminology, We have a So when loading these So in this case, the act of skipping loading the dataset actually returns a totally-incompatible object. |
Description
In my process of researching smth the discoverability of CachedDataset functionality was a bit low. There isn't any reference to it in the "main" documentation.
I understand from @noklam that it's not a particularly widely-used feature though.
Documentation page (if applicable)
Context
The text was updated successfully, but these errors were encountered: