Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CachedDataset example usage #3616

Open
inigohidalgo opened this issue Feb 12, 2024 · 4 comments
Open

CachedDataset example usage #3616

inigohidalgo opened this issue Feb 12, 2024 · 4 comments
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation

Comments

@inigohidalgo
Copy link
Contributor

Description

In my process of researching smth the discoverability of CachedDataset functionality was a bit low. There isn't any reference to it in the "main" documentation.

I understand from @noklam that it's not a particularly widely-used feature though.

Documentation page (if applicable)

Context

@noklam noklam added the Component: Documentation 📄 Issue/PR for markdown and API documentation label Feb 12, 2024
@yetudada
Copy link
Contributor

@inigohidalgo Are you using the CachedDataset in your work? And if yes, what are you using it to do? We're not sure about how often it's used. It's one of the older features of Kedro, so we're always open to understanding how it should be helpful.

@inigohidalgo
Copy link
Contributor Author

inigohidalgo commented Mar 11, 2024

Hi @yetudada, we use it for one specific pipeline in one project. Very niche requirement so I'm not surprised it isn't more widely used. I mentioned in Slack I consider it an antipattern we are supporting more than anything else.
https://linen-slack.kedro.org/t/16408833/hiya-is-https-github-com-deepyaman-kedro-accelerator-still-s

Basically we are running some pipelines which extract some "live" data and append it to a table, but then we want to do some further downstream processing with that extracted batch as the input to some market processes, but we don't want to reload from the saved dataset as we are saving to a big table and our I/O isn't super fast. kedro-accelerator covered part of the same usecase but went further.

This pipeline could relatively-trivially be reworked to not have this requirement, but CachedDataset was the exact functionality we needed in that case.

If you don't plan on supporting CachedDataset longterm feel free to close the issue, I mostly opened it "for reference" as i mentioned in that slack thread.

EDIT: the reason I consider it an antipattern is that the return of loading the dataset can be totally different than what is returned by the actual node, this is quite common with deferred loading libraries like ibis and polars. this makes the pipeline structure very brittle.

EDIT2: maybe related #3578

@noklam
Copy link
Contributor

noklam commented Mar 12, 2024

IMO the use case of CacheDataset or kedro-accelerator are valid. It accelerate the pipelines so it's performance gain without much penalty. The downside is as @inigohidalgo described, if it's not used properly it may cause the pipeline less reproducible.

Consider this classic example:

# A Pandas dataframe
df1.to_csv("raw.csv")
df2= pd.read_csv("raw.csv") # Skip by CacheDataset or `kedro-accelerator` 

Depending how dataframe looks like, df1 may not be identical as df2 due to bad typing, this is a lot less common if one is using stronger type format like parquet. So in this case, the cache approach may "hide" this problem until someone try to put this into production.

In a lot case, it makes sense to use cache because the data is in memory, it doesn't make sense to throw it away and re-load it from disk (takes a lot of I/O time if the data is big)

@inigohidalgo
Copy link
Contributor Author

Your example is a less extreme version of the problem I described. In your case you could still write a reproducible pipeline by explicitly casting the data to the correct pandas types.

(For clarity as there is duplicated terminology, KeDataset means a kedro Dataset implementation, PaDataset means the Pyarrow Dataset object)

We have a KeDataset implementation which takes a pandas dataframe and saves it to parquet using a PaDataset. KeDataset.load returns a PaDataset instance which we filter and finally convert into a pandas dataframe again.

So when loading these KeDatasets we have some nodes which are meant to specifically filter PaDatasets. If we cached that data, the object being passed into the node would be a pandas dataframe instead of the expected type which would totally break the pipeline.

So in this case, the act of skipping loading the dataset actually returns a totally-incompatible object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation
Projects
Status: No status
Development

No branches or pull requests

3 participants