Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable Fidelity Image Understanding (Image Detail) in v0.4 / v0.2 #4844

Open
Leon0402 opened this issue Dec 28, 2024 · 2 comments
Open
Milestone

Comments

@Leon0402
Copy link
Contributor

What feature would you like to be added?

Being able to specify https://platform.openai.com/docs/guides/vision#low-or-high-fidelity-image-understanding would be great. Setting this param to low can significantly reduce costs.

There is a TODO for this https://github.com/microsoft/autogen/blob/6e0f65b7d18c9f72d4053d84151b4bb9a6027698/python/packages/autogen-ext/src/autogen_ext/models/openai/_openai_client.py#L181in v0.4, but no issue I could find.

Here is quick patch for v0.4 for those who need it already:

from typing import Literal

from autogen_core import Image
from openai.types.chat import ChatCompletionContentPartImageParam


def patch_to_openai_format(image_detail: Literal["auto", "low", "high"]):
    def to_openai_format_patched(
        self, detail: Literal["auto", "low", "high"] = "auto"
    ) -> ChatCompletionContentPartImageParam:
        return {"type": "image_url", "image_url": {"url": self.data_uri, "detail": image_detail}}

    Image.to_openai_format = to_openai_format_patched

Call this BEFORE anything else. E.g.

patch_to_openai_format(image_detail=config.image_quality)
from your_main_script import your_main

For v0.2 it is:

from autogen.agentchat import utils
import autogen.agentchat.contrib.img_utils
from autogen.agentchat.contrib.img_utils import get_pil_image, get_image_data, convert_base64_to_data_uri


def patch_gpt4v_formatter(image_detail: str):
    def gpt4v_formatter_patched(prompt: str, img_format: str = "uri") -> list[str | dict]:
        assert img_format in ["uri", "url", "pil"]

        output = []
        last_index = 0
        image_count = 0

        for parsed_tag in utils.parse_tags_from_content("img", prompt):
            image_location = parsed_tag["attr"]["src"]
            try:
                if img_format == "pil":
                    img_data = get_pil_image(image_location)
                elif img_format == "uri":
                    img_data = get_image_data(image_location)
                    img_data = convert_base64_to_data_uri(img_data)
                elif img_format == "url":
                    img_data = image_location
                else:
                    raise ValueError(f"Unknown image format {img_format}")
            except Exception as e:
                print(f"Warning! Unable to load image from {image_location}, because {e}")
                continue

            output.append({"type": "text", "text": prompt[last_index : parsed_tag["match"].start()]})

            # Here we patch the method and add the image detail!
            output.append({"type": "image_url", "image_url": {"url": img_data, "detail": image_detail}})

            last_index = parsed_tag["match"].end()
            image_count += 1

        output.append({"type": "text", "text": prompt[last_index:]})
        return output

    autogen.agentchat.contrib.img_utils.gpt4v_formatter = gpt4v_formatter_patched

Why is this needed?

It is supported by OpenAI, so it should be configurable

@ekzhu ekzhu added this to the 0.4.1 milestone Dec 29, 2024
@ekzhu
Copy link
Collaborator

ekzhu commented Dec 29, 2024

This is an important feature. I included in the 0.4.1 milestone.

My first thought is to include this as a custom extra_create_args in ChatCompletionClient's create and create_stream, however, I feel this is best associated with the data. Perhaps the Image object can contain a default resolution setting. cc @jackgerrits. It should be a simple change.

@Leon0402
Copy link
Contributor Author

This is an important feature. I included in the 0.4.1 milestone.

My first thought is to include this as a custom extra_create_args in ChatCompletionClient's create and create_stream, however, I feel this is best associated with the data. Perhaps the Image object can contain a default resolution setting. cc @jackgerrits. It should be a simple change.

I think making it an attribute of ChatCompletionClient or its baseclases makes most sense. Some thoughts:

  • The information should be stored then in the Image class, so it can be later retrieved (think about Convert AgentChat v0.4 messages to v0.2 format #4833)
  • Perhaps it makes sense to have this as a capability somehow? I am not quite sure, whether this is very openai specific or something other models already adapted or will adapt in the future. But I guess at least what low / high means might be very openai specific (e.g. that it processes the images 7 times then, ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants