Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add remote (cloud) options for --conf-source #3982

Open
OlegPodlipalin opened this issue Jul 2, 2024 · 7 comments
Open

Add remote (cloud) options for --conf-source #3982

OlegPodlipalin opened this issue Jul 2, 2024 · 7 comments
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@OlegPodlipalin
Copy link

Description

Currently, we deploy Kedro projects as part of our SaaS solutions within Docker containers, managed by a Prefect server on AWS ECS. These containers are configured to handle all possible configurations for various tenants. However, as we transition to an on-prem solution, we face a challenge where exposing all tenant configurations within a single container poses security and relevance issues since the client will have access to configurations that are not pertinent to them.

We're looking for a way to modularize configuration management without exposing unnecessary data. While Kedro supports cloud storage in the data catalog via the "protocol" option with the fsspec library, it does not support a similar mechanism for the configuration source. Enhancing this functionality would significantly streamline the deployment process for our on-prem solutions.

Context

This change is crucial for us as it would allow for more secure and relevant configurations per tenant, without the risk of exposing data not intended for them. Implementing this feature would also align configuration management with how data sources are currently handled, making our systems more robust and flexible. This feature could benefit other users who require more granular control over configuration management in environments where host file systems are not a viable option.

Possible Implementation

A potential implementation could involve extending the existing functionality of Kedro's OmegaConfLoader to recognize and handle various storage protocols using the _parse_filepath function from kedro/io/core.py lines 690-728:

def _parse_filepath(filepath: str) -> dict[str, str]:
    """Split filepath on protocol and path. Based on `fsspec.utils.infer_storage_options`.

    Args:
        filepath: Either local absolute file path or URL (s3://bucket/file.csv)

    Returns:
        Parsed filepath.
    """
    if (
        re.match(r"^[a-zA-Z]:[\\/]", filepath)
        or re.match(r"^[a-zA-Z0-9]+://", filepath) is None
    ):
        return {"protocol": "file", "path": filepath}

    parsed_path = urlsplit(filepath)
    protocol = parsed_path.scheme or "file"

    if protocol in HTTP_PROTOCOLS:
        return {"protocol": protocol, "path": filepath}

    path = parsed_path.path
    if protocol == "file":
        windows_path = re.match(r"^/([a-zA-Z])[:|]([\\/].*)$", path)
        if windows_path:
            path = ":".join(windows_path.groups())

    options = {"protocol": protocol, "path": path}

    if parsed_path.netloc and protocol in CLOUD_PROTOCOLS:
        host_with_port = parsed_path.netloc.rsplit("@", 1)[-1]
        host = host_with_port.rsplit(":", 1)[0]
        options["path"] = host + options["path"]
        # Azure Data Lake Storage Gen2 URIs can store the container name in the
        # 'username' field of a URL (@ syntax), so we need to add it to the path
        if protocol == "abfss" and parsed_path.username:
            options["path"] = parsed_path.username + "@" + options["path"]

    return options

integrating this logic into the OmegaConfLoader kedro/config/omegaconf_config.py lines 127-138 under the else statement, for example:

file_mimetype, _ = mimetypes.guess_type(conf_source)
if file_mimetype == "application/x-tar":
    self._protocol = "tar"
elif file_mimetype in (
    "application/zip",
    "application/x-zip-compressed",
    "application/zip-compressed",
):
    self._protocol = "zip"
else:
    self._protocol = "file"
self._fs = fsspec.filesystem(protocol=self._protocol, fo=conf_source)

will enable the use of fsspec protocols for configuration files would allow configurations to be fetched dynamically from specified storages, whether they are local filesystems or cloud-based.
Credentials in this case can probably be provided via environmental variables, which are set up during the starting of the container.

@OlegPodlipalin OlegPodlipalin added the Issue: Feature Request New feature or improvement to existing feature label Jul 2, 2024
@astrojuanlu
Copy link
Member

Thanks @OlegPodlipalin for opening this issue.

@deepyaman, @bpmeek and others were discussing about something like this a few days ago https://linen-slack.kedro.org/t/19613568/hey-everyone-a-while-ago-i-took-part-in-a-kedro-user-group-a#ceca4bdb-8037-4ec3-80e3-7eee0476bc6b and in fact @bpmeek open sourced https://github.com/bpmeek/Kedro-Universal-Catalog prompted by some past discussions.

Something similar was already proposed in #1239 and probably in other places.

We will add this to our backlog and discuss it shortly.

@astrojuanlu
Copy link
Member

@datajoely I quickly scanned #3094 and I didn't see any mentions to config, do you have any extra insight? Also @DimedS @ankatiyar do you think this would help with, say, Airflow deployments?

@datajoely
Copy link
Contributor

Yes, IIRC we mention how the introduction of --conf-source is already a good step in the right direction, it was much worse before!

I actually really want this to happen, we actually support this kind of pattern in the micropackaging piece we're about to remove... so we should just steal the already written implementation.

image

@DimedS
Copy link
Member

DimedS commented Jul 3, 2024

@datajoely I quickly scanned #3094 and I didn't see any mentions to config, do you have any extra insight? Also @DimedS @ankatiyar do you think this would help with, say, Airflow deployments?

Yes, it will be quite useful for Airflow deployment for the same reason. It won't be necessary to pack the configuration into the same container if you can retrieve it from remote storage. However, I'm trying to understand how this will increase security. Could you please comment on this, @OlegPodlipalin? If someone gains access to the container, wouldn't they be able to download all configurations stored in cloud services?

@OlegPodlipalin
Copy link
Author

I'm trying to understand how this will increase security. Could you please comment on this, @OlegPodlipalin?

@DimedS
It will increase security while adding an on-prem solution to existing SaaS. I our case we plan to mirror our ECR with all it's images to a client's infrastructure. The intension is not to build another set of images. The configuration for a new on-prem client will be created separately anyways and it will be the only set available for them, whereas all the rest of the configurations (for all the SaaS) will be stored in our s3.

@ElenaKhaustova
Copy link
Contributor

@OlegPodlipalin, thank you for the issue created! We've discussed it and we think is a good idea, so we would like to move forward with it. Do you want to create a PR?

@OlegPodlipalin
Copy link
Author

@ElenaKhaustova
I'm ready to get started on this. I'll fork the repo and work on the solution. For accessing the cloud with the conf folder, I'm thinking of using environment variables for credentials.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: To Do
Status: Todo
Development

No branches or pull requests

6 participants