Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for new formats of content #21

Open
hellohejinyu opened this issue May 16, 2024 · 2 comments
Open

Support for new formats of content #21

hellohejinyu opened this issue May 16, 2024 · 2 comments

Comments

@hellohejinyu
Copy link

import OpenAI from "openai";

const openai = new OpenAI();

async function main() {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "What’s in this image?" },
          {
            type: "image_url",
            image_url: {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
            },
          },
        ],
      },
    ],
  });
  console.log(response.choices[0]);
}
main();
image

openai supports passing images and text at the same time, but the token calculation rules for images depend on the image size and processing mode. So I think we need to artificially supplement the two parameters of the image size and processing mode to calculate the token of the image.

image

@sean-nicholas
Copy link

I guess you could add a lib that extracts the size from the images like https://www.npmjs.com/package/image-size
Should be pretty easy to fetch an image or create a buffer from base64 to pipe that into image-size. But currently this won't work in cloudflare workers: image-size/image-size#405

I'm not quite sure if you can guess what detail level is chosen when you do not send it (when it's in auto mode), but from the documentation I would guess if it's smaller than 512px in both directions it will be low otherwise high.

Funny there are two description of costs in the official docs. The one that you posted and this: https://platform.openai.com/docs/guides/vision/low-or-high-fidelity-image-understanding
In this section they say the costs are 65 tokens per crop:

  • low will enable the "low res" mode. The model will receive a low-res 512px x 512px version of the image, and represent the image with a budget of 65 tokens. This allows the API to return faster responses and consume fewer input tokens for use cases that do not require high detail.
  • high will enable "high res" mode, which first allows the model to see the low res image and then creates detailed crops of input images as 512px squares based on the input image size. Each of the detailed crops uses twice the token budget (65 tokens) for a total of 129 tokens.

I'm not quite sure why two time 65 should be 129 but hey 🤷‍♂️😁

@hellohejinyu
Copy link
Author

function calculateHighDetailTokens(width: number, height: number): number {
  // First, check if the image needs to be scaled to fit within the 2048 x 2048 size limit
  if (width > 2048 || height > 2048) {
    const aspectRatio = width / height;
    if (width > height) {
      width = 2048;
      height = Math.round(2048 / aspectRatio);
    } else {
      height = 2048;
      width = Math.round(2048 * aspectRatio);
    }
  }

  // Next, scale the image so that the shortest side is 768px
  const minSideLength = 768;
  const currentMinSide = Math.min(width, height);
  if (currentMinSide > minSideLength) {
    const scaleFactor = minSideLength / currentMinSide;
    width = Math.round(width * scaleFactor);
    height = Math.round(height * scaleFactor);
  }

  // Calculate how many 512px tiles the image is composed of
  const tilesWide = Math.ceil(width / 512);
  const tilesHigh = Math.ceil(height / 512);
  const totalTiles = tilesWide * tilesHigh;

  // The token cost for each tile is 170, with an additional 85 tokens added at the end
  const totalTokens = totalTiles * 170 + 85;

  return totalTokens;
}

// Example usage
console.log(calculateHighDetailTokens(1024, 1024)); // Should output 765
console.log(calculateHighDetailTokens(2048, 4096)); // Should output 1105

In our project, we actually only use high mode, and the front-end knows the image width and height when uploading images. So I asked gpt to write a code to calculate the code in high mode. This temporarily solves the problem of image message token calculation.😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants