-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KED-2744] Memory Leakage - Unexpected Caching behaviour with CacheDataset #819
Comments
Hi @noklam, thanks for flagging this and writing up such a thorough explanation in your blog post 👏 😄 |
Hi @noklam, I've found that this is not actually a bug but explicitly designed this way. The main reason being that pipeline On top of that, input |
@MerelTheisenQB Thanks for the explanation. If it is only for debugging purposes, there should be a flag to enable this behavior rather than setting it as default? Maybe I am missing something here, but I don't see why In addition, CacheDataSet is just This is a common pattern for my training pipeline.
I'll use CacheDataSet for these functions for sure because I don't want to write 10GB of data and reading it off from Disk (it could take minutes). Thoughts? |
@noklam Thanks for flagging this. This functionality was implemented long ago as @MerelTheisenQB pointed out and the main purpose is to enable interactive workflows more generally (and as extension to that debugging). E.g. the idea is that you could run a pipeline multiple times with the same catalog instance in the process of experimenting with different pipelines and if you start off with a Most of the time this shouldn't cause problems, since one would normally use For the case of pipeline outputs unconsumed by anyone after the run, there will be no reason for someone to define them as
Would such behaviour suit your needs @noklam ? |
@idanov Thanks, I rarely use Kedro in interactive mode and I can now see why this was implemented this way. Yes, I think this match my expectation. Any datasets that are not in remanining nodes inputs should be released. |
Created ticket on backlog |
I am facing the same problem, however, my dataset is a pandas.ParquetDataSet. Even after the node is terminated and the dataframe is saved to disk, the data is still in memory. The memory is only released after all pipeline execution. Maybe I missed something, but should I indicate something to kedro so it doesn't keep the data in memory? |
@acnazarejr - to confirm is this also using the |
@datajoely, sorry for the delayed response. I'm not using the |
Hi @acnazarejr I don't think it should be behaving this way - how do you know the memory is not being released? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This ticket hasn't had any recent activity, so I'm closing it. |
Reopen as this issue still exists, though usage of
This is a very interesting point and I would love to re-visit this at some points. It seems that years ago Kedro focus more on the interactive flow witht he idea of re-run pipeline with the same DataCatalog, which pretty much is now blocked by the KedroSession unless users use
The last point is the real problem here. We may not need to fix this issue immediately, but I think it's important if we think about re-running pipeline in a notebook. |
Could I also ask - why not make MemoryDatasets cache by default? |
@datajoely Can you clarify a bit? |
I guess my question is - if we can, why not make regular datasets cache by default? |
@datajoely I think it's because there are 1% edge case that is not solved. I very much welcome this to be changed. df.to_csv("my_csv.csv")
df2 = pd.read_csv("my_csv.csv") In most case they will be identical, in badly typed cases they are different, read/load separately is more closed to an orchestrator mode I guess. This is very bad because depends where you start running your pipeline, you get different outputs. |
Got it! They say cache invalidation is the 2nd hardest problem for a reason! |
Description
When I have a pipeline like this, I expected once the execution of the node is finished, it should release the input to reduce memory footprint. I found this is not the case and causing Memory Error for my pipeline.
A -> B # Step1
B -> C # Step2
C -> D # Step3
For example, as soon as step1 is finished, there is no reason why we should keep
A
in memory anymore.Context
It keeps unnecessary variables in memory. Originally I was trying to write some notes to dive deep into Kedro dataset management and found that the behavior is not what I would expect.
Steps to Reproduce
https://noklam.github.io/blog/posts/2021-07-02-kedro-datacatalog.html
Expected Result
For a non-cached dataset, it should exist in memory only in its node. For CachedDataSet, it should be deleted as soon as it is not needed anymore.
Actual Result
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
pip show kedro
orkedro -V
):python -V
):The text was updated successfully, but these errors were encountered: