Configurable Fidelity Image Understanding (Image Detail) in v0.4 / v0.2 #4844

Leon0402 · 2024-12-28T11:20:43Z

What feature would you like to be added?

Being able to specify https://platform.openai.com/docs/guides/vision#low-or-high-fidelity-image-understanding would be great. Setting this param to low can significantly reduce costs.

There is a TODO for this https://github.com/microsoft/autogen/blob/6e0f65b7d18c9f72d4053d84151b4bb9a6027698/python/packages/autogen-ext/src/autogen_ext/models/openai/_openai_client.py#L181in v0.4, but no issue I could find.

Here is quick patch for v0.4 for those who need it already:

from typing import Literal

from autogen_core import Image
from openai.types.chat import ChatCompletionContentPartImageParam


def patch_to_openai_format(image_detail: Literal["auto", "low", "high"]):
    def to_openai_format_patched(
        self, detail: Literal["auto", "low", "high"] = "auto"
    ) -> ChatCompletionContentPartImageParam:
        return {"type": "image_url", "image_url": {"url": self.data_uri, "detail": image_detail}}

    Image.to_openai_format = to_openai_format_patched

Call this BEFORE anything else. E.g.

patch_to_openai_format(image_detail=config.image_quality)
from your_main_script import your_main

For v0.2 it is:

from autogen.agentchat import utils
import autogen.agentchat.contrib.img_utils
from autogen.agentchat.contrib.img_utils import get_pil_image, get_image_data, convert_base64_to_data_uri


def patch_gpt4v_formatter(image_detail: str):
    def gpt4v_formatter_patched(prompt: str, img_format: str = "uri") -> list[str | dict]:
        assert img_format in ["uri", "url", "pil"]

        output = []
        last_index = 0
        image_count = 0

        for parsed_tag in utils.parse_tags_from_content("img", prompt):
            image_location = parsed_tag["attr"]["src"]
            try:
                if img_format == "pil":
                    img_data = get_pil_image(image_location)
                elif img_format == "uri":
                    img_data = get_image_data(image_location)
                    img_data = convert_base64_to_data_uri(img_data)
                elif img_format == "url":
                    img_data = image_location
                else:
                    raise ValueError(f"Unknown image format {img_format}")
            except Exception as e:
                print(f"Warning! Unable to load image from {image_location}, because {e}")
                continue

            output.append({"type": "text", "text": prompt[last_index : parsed_tag["match"].start()]})

            # Here we patch the method and add the image detail!
            output.append({"type": "image_url", "image_url": {"url": img_data, "detail": image_detail}})

            last_index = parsed_tag["match"].end()
            image_count += 1

        output.append({"type": "text", "text": prompt[last_index:]})
        return output

    autogen.agentchat.contrib.img_utils.gpt4v_formatter = gpt4v_formatter_patched

Why is this needed?

It is supported by OpenAI, so it should be configurable

The text was updated successfully, but these errors were encountered:

ekzhu · 2024-12-29T05:57:13Z

This is an important feature. I included in the 0.4.1 milestone.

My first thought is to include this as a custom extra_create_args in ChatCompletionClient's create and create_stream, however, I feel this is best associated with the data. Perhaps the Image object can contain a default resolution setting. cc @jackgerrits. It should be a simple change.

Leon0402 · 2024-12-29T07:57:23Z

This is an important feature. I included in the 0.4.1 milestone.

My first thought is to include this as a custom extra_create_args in ChatCompletionClient's create and create_stream, however, I feel this is best associated with the data. Perhaps the Image object can contain a default resolution setting. cc @jackgerrits. It should be a simple change.

I think making it an attribute of ChatCompletionClient or its baseclases makes most sense. Some thoughts:

The information should be stored then in the Image class, so it can be later retrieved (think about Convert AgentChat v0.4 messages to v0.2 format #4833)
Perhaps it makes sense to have this as a capability somehow? I am not quite sure, whether this is very openai specific or something other models already adapted or will adapt in the future. But I guess at least what low / high means might be very openai specific (e.g. that it processes the images 7 times then, ...)

github-actions bot added the needs-triage label Dec 28, 2024

ekzhu added this to the 0.4.1 milestone Dec 29, 2024

ekzhu added proj-extensions and removed needs-triage labels Dec 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable Fidelity Image Understanding (Image Detail) in v0.4 / v0.2 #4844

Configurable Fidelity Image Understanding (Image Detail) in v0.4 / v0.2 #4844

Leon0402 commented Dec 28, 2024

ekzhu commented Dec 29, 2024

Leon0402 commented Dec 29, 2024

Configurable Fidelity Image Understanding (Image Detail) in v0.4 / v0.2 #4844

Configurable Fidelity Image Understanding (Image Detail) in v0.4 / v0.2 #4844

Comments

Leon0402 commented Dec 28, 2024

What feature would you like to be added?

Why is this needed?

ekzhu commented Dec 29, 2024

Leon0402 commented Dec 29, 2024