[KED-2744] Memory Leakage - Unexpected Caching behaviour with CacheDataset #819

noklam · 2021-07-03T09:57:27Z

Description

When I have a pipeline like this, I expected once the execution of the node is finished, it should release the input to reduce memory footprint. I found this is not the case and causing Memory Error for my pipeline.
A -> B # Step1
B -> C # Step2
C -> D # Step3

For example, as soon as step1 is finished, there is no reason why we should keep A in memory anymore.

Context

It keeps unnecessary variables in memory. Originally I was trying to write some notes to dive deep into Kedro dataset management and found that the behavior is not what I would expect.

Steps to Reproduce

https://noklam.github.io/blog/posts/2021-07-02-kedro-datacatalog.html

Expected Result

For a non-cached dataset, it should exist in memory only in its node. For CachedDataSet, it should be deleted as soon as it is not needed anymore.

Actual Result

-- If you received an error, place it here.

-- Separate them if you have more than one.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used (pip show kedro or kedro -V):
Python version used (python -V):
Operating system and version:

The text was updated successfully, but these errors were encountered:

datajoely · 2021-07-05T08:34:34Z

See full discussion on Discord

merelcht · 2021-07-05T14:41:55Z

Hi @noklam, thanks for flagging this and writing up such a thorough explanation in your blog post 👏 😄
I will get to the bottom of why it's been implemented in this way and keep you up to date about what I find.

merelcht · 2021-07-06T07:27:22Z

Hi @noklam, I've found that this is not actually a bug but explicitly designed this way.

The main reason being that pipeline inputs and outputs are special datasets. They can be used for debugging after the run of a pipeline. So this means that the release functionality has explicitly been developed for the intermediate datasets.

On top of that, input parameters are handled as MemoryDataSets and so if the check for data_set not in pipeline.inputs() didn't exist, parameters would be released from memory as well. The code was written a couple of years ago, and we haven't further investigated whether it would be a problem for the parameters to be released, but regardless pipeline inputs and outputs should be treated differently according to our philosophy.

noklam · 2021-07-06T09:24:54Z

@MerelTheisenQB

Thanks for the explanation. If it is only for debugging purposes, there should be a flag to enable this behavior rather than setting it as default? Maybe I am missing something here, but I don't see why parameters need to be treated specially. It should be okay to release it as long as it is not needed in the remaining nodes. A function should only depend on its inputs.

In addition, CacheDataSet is just MemoryDataSet, which means CacheDataSet will stay in memory forever. The purpose of using Cache is to reduce the I/O instead of writing to disk in 1 node then reading the same data from disk immediately. This is a big problem for large pipelines.

This is a common pattern for my training pipeline.

preprocess raw data -> processed data
processed data ->Train/test split
processed data concat with some metadata or prediction label for analysis

I'll use CacheDataSet for these functions for sure because I don't want to write 10GB of data and reading it off from Disk (it could take minutes).

Thoughts?

idanov · 2021-07-06T10:12:17Z

@noklam Thanks for flagging this. This functionality was implemented long ago as @MerelTheisenQB pointed out and the main purpose is to enable interactive workflows more generally (and as extension to that debugging). E.g. the idea is that you could run a pipeline multiple times with the same catalog instance in the process of experimenting with different pipelines and if you start off with a MemoryDataSet inputs which are released after the run, that will make for a very painful experience.

Most of the time this shouldn't cause problems, since one would normally use MemoryDataSet as pipeline inputs only in interactive workflows. However you have correctly identified that CachedDataSet can be used in non-interactive workflows for inputs quite often and the desired behaviour would be to release the data once no longer needed.

For the case of pipeline outputs unconsumed by anyone after the run, there will be no reason for someone to define them as MemoryDataSet unless they work in interactive sessions (notebooks). Therefore the behaviour there is as expected.

CachedDataSet was not developed as a core Kedro component and it was done as internal contribution, so we might not have considered the issue you point out. That means that it's probably a good idea for us to revisit this and change the behaviour as follows:

Both MemoryDataSet and CachedDataSet should be released when used as intermediate datasets (Kedro already behaves like this to my knowledge) ✅
MemoryDataSet's used as pipeline inputs or outputs should not be released after runs to support interactive workflows (Kedro already behaves like this as claimed in this GitHub issue) ✅
CachedDataSet should be released after being used as pipeline input, but not as pipeline output (this is where Kedro fails user expectation) ❌

Would such behaviour suit your needs @noklam ?

noklam · 2021-07-06T10:35:07Z

@idanov Thanks, I rarely use Kedro in interactive mode and I can now see why this was implemented this way.

Yes, I think this match my expectation. Any datasets that are not in remanining nodes inputs should be released.

datajoely · 2021-07-07T12:53:49Z

Created ticket on backlog

acnazarejr · 2021-07-09T20:57:32Z

I am facing the same problem, however, my dataset is a pandas.ParquetDataSet. Even after the node is terminated and the dataframe is saved to disk, the data is still in memory. The memory is only released after all pipeline execution.

Maybe I missed something, but should I indicate something to kedro so it doesn't keep the data in memory?

datajoely · 2021-07-12T08:57:34Z

@acnazarejr - to confirm is this also using the CachedDataSet? Persisted writes to Parquet should discard the memory and be garbage collected by Python.

acnazarejr · 2021-07-14T13:39:55Z

@datajoely, sorry for the delayed response. I'm not using the CachedDataSet.

datajoely · 2021-07-14T13:56:04Z

Hi @acnazarejr I don't think it should be behaving this way - how do you know the memory is not being released?

stale · 2021-09-12T14:19:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2021-12-05T11:18:30Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

merelcht · 2023-12-13T17:38:59Z

This ticket hasn't had any recent activity, so I'm closing it.

noklam · 2024-03-21T12:08:38Z

Reopen as this issue still exists, though usage of CacheDataset is not known. Partly due to this is an internal contribution and no docs has ever been written about it other than the API docs #3616.

E.g. the idea is that you could run a pipeline multiple times with the same catalog instance in the process of experimenting with different pipelines and if you start off with a MemoryDataSet inputs which are released after the run, that will make for a very painful experience.

This is a very interesting point and I would love to re-visit this at some points. It seems that years ago Kedro focus more on the interactive flow witht he idea of re-run pipeline with the same DataCatalog, which pretty much is now blocked by the KedroSession unless users use Runner ,DataCatalog directly.

Both MemoryDataSet and CachedDataSet should be released when used as intermediate datasets (Kedro already behaves like this to my knowledge) ✅

MemoryDataSet's used as pipeline inputs or outputs should not be released after runs to support interactive workflows (Kedro already behaves like this as claimed in this GitHub issue) ✅

CachedDataSet should be released after being used as pipeline input, but not as pipeline output (this is where Kedro fails user expectation) ❌

The last point is the real problem here. We may not need to fix this issue immediately, but I think it's important if we think about re-running pipeline in a notebook.

datajoely · 2024-03-21T12:47:38Z

Could I also ask - why not make MemoryDatasets cache by default?

noklam · 2024-03-21T13:25:12Z

@datajoely Can you clarify a bit? MemoryDataset do "cache" in the sense that it only get load once only when passing around nodes. CacheDataset is slightly different because it is used as a wrapper dataset, under the hood it use MemoryDataset as its cache.

datajoely · 2024-03-21T14:25:59Z

I guess my question is - if we can, why not make regular datasets cache by default?

noklam · 2024-03-21T15:13:12Z

@datajoely I think it's because there are 1% edge case that is not solved. I very much welcome this to be changed.

df.to_csv("my_csv.csv")
df2 = pd.read_csv("my_csv.csv")

In most case they will be identical, in badly typed cases they are different, read/load separately is more closed to an orchestrator mode I guess. This is very bad because depends where you start running your pipeline, you get different outputs.

datajoely · 2024-03-21T15:20:25Z

Got it! They say cache invalidation is the 2nd hardest problem for a reason!

noklam added the Issue: Bug Report 🐞 Bug that needs to be fixed label Jul 3, 2021

merelcht changed the title ~~Memory Leakage - Unexpected Caching behavior with CacheDataSet~~ [KED-2744] Memory Leakage - Unexpected Caching behaviour with CacheDataSet Jul 7, 2021

stale bot added the stale label Sep 12, 2021

stale bot closed this as completed Sep 19, 2021

datajoely reopened this Oct 6, 2021

stale bot removed the stale label Oct 6, 2021

stale bot added the stale label Dec 5, 2021

stale bot closed this as completed Dec 12, 2021

idanov added this to Kedro Framework Feb 14, 2022

merelcht reopened this Mar 7, 2022

merelcht removed the stale label Mar 7, 2022

merelcht added this to the Redesign Catalog and Datasets milestone Feb 6, 2023

merelcht closed this as not planned Won't fix, can't repro, duplicate, stale Dec 13, 2023

github-project-automation bot moved this to Done in Kedro Framework Dec 13, 2023

astrojuanlu removed this from Kedro Framework Jan 24, 2024

merelcht removed this from the Redesign the API for IO (catalog) milestone Feb 2, 2024

noklam reopened this Mar 21, 2024

noklam changed the title ~~[KED-2744] Memory Leakage - Unexpected Caching behaviour with CacheDataSet~~ [KED-2744] Memory Leakage - Unexpected Caching behaviour with CacheDataset Mar 21, 2024

noklam mentioned this issue Mar 21, 2024

Stored in memory and saved data are not the same #3236

Closed

merelcht added this to Kedro Framework Jul 12, 2024

merelcht moved this to To Do in Kedro Framework Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KED-2744] Memory Leakage - Unexpected Caching behaviour with CacheDataset #819

[KED-2744] Memory Leakage - Unexpected Caching behaviour with CacheDataset #819

noklam commented Jul 3, 2021 •

edited

Loading

datajoely commented Jul 5, 2021

merelcht commented Jul 5, 2021

merelcht commented Jul 6, 2021

noklam commented Jul 6, 2021

idanov commented Jul 6, 2021

noklam commented Jul 6, 2021

datajoely commented Jul 7, 2021

acnazarejr commented Jul 9, 2021 •

edited

Loading

datajoely commented Jul 12, 2021

acnazarejr commented Jul 14, 2021

datajoely commented Jul 14, 2021

stale bot commented Sep 12, 2021

stale bot commented Dec 5, 2021

merelcht commented Dec 13, 2023

noklam commented Mar 21, 2024 •

edited

Loading

datajoely commented Mar 21, 2024

noklam commented Mar 21, 2024

datajoely commented Mar 21, 2024

noklam commented Mar 21, 2024

datajoely commented Mar 21, 2024

[KED-2744] Memory Leakage - Unexpected Caching behaviour with CacheDataset #819

[KED-2744] Memory Leakage - Unexpected Caching behaviour with CacheDataset #819

Comments

noklam commented Jul 3, 2021 • edited Loading

Description

Context

Steps to Reproduce

Expected Result

Actual Result

Your Environment

datajoely commented Jul 5, 2021

merelcht commented Jul 5, 2021

merelcht commented Jul 6, 2021

noklam commented Jul 6, 2021

idanov commented Jul 6, 2021

noklam commented Jul 6, 2021

datajoely commented Jul 7, 2021

acnazarejr commented Jul 9, 2021 • edited Loading

datajoely commented Jul 12, 2021

acnazarejr commented Jul 14, 2021

datajoely commented Jul 14, 2021

stale bot commented Sep 12, 2021

stale bot commented Dec 5, 2021

merelcht commented Dec 13, 2023

noklam commented Mar 21, 2024 • edited Loading

datajoely commented Mar 21, 2024

noklam commented Mar 21, 2024

datajoely commented Mar 21, 2024

noklam commented Mar 21, 2024

datajoely commented Mar 21, 2024

noklam commented Jul 3, 2021 •

edited

Loading

acnazarejr commented Jul 9, 2021 •

edited

Loading

noklam commented Mar 21, 2024 •

edited

Loading