CachedDataset example usage #3616

inigohidalgo · 2024-02-12T17:41:46Z

Description

In my process of researching smth the discoverability of CachedDataset functionality was a bit low. There isn't any reference to it in the "main" documentation.

I understand from @noklam that it's not a particularly widely-used feature though.

Documentation page (if applicable)

Context

yetudada · 2024-03-11T17:48:37Z

@inigohidalgo Are you using the CachedDataset in your work? And if yes, what are you using it to do? We're not sure about how often it's used. It's one of the older features of Kedro, so we're always open to understanding how it should be helpful.

inigohidalgo · 2024-03-11T17:59:51Z

Hi @yetudada, we use it for one specific pipeline in one project. Very niche requirement so I'm not surprised it isn't more widely used. I mentioned in Slack I consider it an antipattern we are supporting more than anything else.
https://linen-slack.kedro.org/t/16408833/hiya-is-https-github-com-deepyaman-kedro-accelerator-still-s

Basically we are running some pipelines which extract some "live" data and append it to a table, but then we want to do some further downstream processing with that extracted batch as the input to some market processes, but we don't want to reload from the saved dataset as we are saving to a big table and our I/O isn't super fast. kedro-accelerator covered part of the same usecase but went further.

This pipeline could relatively-trivially be reworked to not have this requirement, but CachedDataset was the exact functionality we needed in that case.

If you don't plan on supporting CachedDataset longterm feel free to close the issue, I mostly opened it "for reference" as i mentioned in that slack thread.

EDIT: the reason I consider it an antipattern is that the return of loading the dataset can be totally different than what is returned by the actual node, this is quite common with deferred loading libraries like ibis and polars. this makes the pipeline structure very brittle.

EDIT2: maybe related #3578

noklam · 2024-03-12T13:33:26Z

IMO the use case of CacheDataset or kedro-accelerator are valid. It accelerate the pipelines so it's performance gain without much penalty. The downside is as @inigohidalgo described, if it's not used properly it may cause the pipeline less reproducible.

Consider this classic example:

# A Pandas dataframe
df1.to_csv("raw.csv")
df2= pd.read_csv("raw.csv") # Skip by CacheDataset or `kedro-accelerator`

Depending how dataframe looks like, df1 may not be identical as df2 due to bad typing, this is a lot less common if one is using stronger type format like parquet. So in this case, the cache approach may "hide" this problem until someone try to put this into production.

In a lot case, it makes sense to use cache because the data is in memory, it doesn't make sense to throw it away and re-load it from disk (takes a lot of I/O time if the data is big)

inigohidalgo · 2024-03-12T14:55:13Z

Your example is a less extreme version of the problem I described. In your case you could still write a reproducible pipeline by explicitly casting the data to the correct pandas types.

(For clarity as there is duplicated terminology, KeDataset means a kedro Dataset implementation, PaDataset means the Pyarrow Dataset object)

We have a KeDataset implementation which takes a pandas dataframe and saves it to parquet using a PaDataset. KeDataset.load returns a PaDataset instance which we filter and finally convert into a pandas dataframe again.

So when loading these KeDatasets we have some nodes which are meant to specifically filter PaDatasets. If we cached that data, the object being passed into the node would be a pandas dataframe instead of the expected type which would totally break the pipeline.

So in this case, the act of skipping loading the dataset actually returns a totally-incompatible object.

noklam added this to Kedro Framework Feb 12, 2024

noklam added the Component: Documentation 📄 Issue/PR for markdown and API documentation label Feb 12, 2024

github-actions bot mentioned this issue Mar 1, 2024

Monthly issue metrics report #3671

Open

noklam mentioned this issue Mar 21, 2024

[KED-2744] Memory Leakage - Unexpected Caching behaviour with CacheDataset #819

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CachedDataset example usage #3616

CachedDataset example usage #3616

inigohidalgo commented Feb 12, 2024

yetudada commented Mar 11, 2024

inigohidalgo commented Mar 11, 2024 •

edited

Loading

noklam commented Mar 12, 2024 •

edited

Loading

inigohidalgo commented Mar 12, 2024

CachedDataset example usage #3616

CachedDataset example usage #3616

Comments

inigohidalgo commented Feb 12, 2024

Description

Documentation page (if applicable)

Context

yetudada commented Mar 11, 2024

inigohidalgo commented Mar 11, 2024 • edited Loading

noklam commented Mar 12, 2024 • edited Loading

inigohidalgo commented Mar 12, 2024

inigohidalgo commented Mar 11, 2024 •

edited

Loading

noklam commented Mar 12, 2024 •

edited

Loading