Snippets: OpenAI Charges for Words, So I Sent It a Picture Instead

Format engineering instead of prompt engineering?

Inspiration

What prompted my experiment

We’ve all likely encountered the tip that to save money on audio transcription with services like OpenAI’s Whisper, you can simply speed up the audio beforehand. It’s a surprisingly effective trick: shorter audio, fewer processing minutes, and often negligible loss in transcription quality. This got me thinking about other ways we interact with large language models (LLMs) and whether similar “format shifts” could yield unexpected efficiencies.

The Counter-Intuitive Question: Text to Image for Cheaper Answers?

OpenAI, like many LLM providers, charges based on token usage for text input and output. This naturally leads to strategies around prompt optimization and keeping input contexts concise. But what if we fundamentally changed the nature of the input itself?

My slightly mad idea:

Could we turn a text document into an image and then ask a multimodal LLM to “read” it and answer questions, potentially using fewer tokens and thus costing less?

It sounds counter-intuitive. We’re essentially taking structured data (text) and converting it into a less structured format (an image of that text). However, the pricing models for vision capabilities in LLMs are different. Some models charge based on the resolution or tiling of the image, which might, under certain conditions, be cheaper than the equivalent number of text tokens.

Diving into the Experiment

To explore this, I designed a simple Question Answering (QA) task using the provided Python script. Here’s the gist of the setup:

The Subject Matter: A ~2000-word fictional story (“The Keeper of Lost Tides”).
The Challenge: A specific question requiring the AI to find and synthesize information from two distinct parts of the text: “Based on the clues embedded in the text, what was the true identity of the village blacksmith, Mr. Graeme?”
The Models: Several OpenAI models with vision capabilities (gpt-4o-mini, gpt-4o, gpt-4.1) and the o1 model.
Text-Based QA (Baseline): Sending the raw text of the story and the question directly to the model.
Image-Based QA (The Experiment):
Rendering the entire story into an image using different fonts and sizes.
Sending this image (along with the same question) to the model’s vision endpoint.
The Metrics: Cost of the API call and the accuracy of the answer.

The script systematically generated images of the text in various fonts and sizes, then compared the cost and accuracy of getting the answer via the traditional text method versus the image-based method for each model.

Glimmers of Possibility: Unexpected Efficiencies

The results, as often happens in these kinds of explorations, weren’t a simple slam dunk. Here are some key observations:

The “Low Detail” Trap: Using the “low detail” setting for image analysis was consistently cheaper but also consistently resulted in incorrect or nonsensical answers. High-quality input remains crucial.
Model Matters (A Lot): The performance and cost-effectiveness of the image-based approach varied significantly between models.
gpt-4o-mini: This model consistently proved more expensive with the image-based approach, and always wrong.
gpt-4o: Showed some instances of being cheaper, but the accuracy was less reliable.
gpt-4.1 and o1: Here’s where things got interesting. For certain legible font sizes (8pt and 12pt), the image-based QA was not only cheaper than the text-based baseline but also yielded accurate answers in a significant number of test runs. In some of these successful instances with o1, the savings reached up to 47% compared to its text-based cost for the same task!

A snapshot of the results

Format Engineering: A New Lens on Optimization?

This small experiment hints at a potentially new dimension in how we think about interacting with LLMs. We’ve spent years focused on prompt engineering – crafting the ideal text input to guide the model’s output. Could we now be entering an era of format engineering?

Perhaps the key to optimizing cost and performance isn’t just about what we say to the AI, but how we present the information. Just as speeding up audio can be a form of format optimization for transcription, converting text to an image might, under specific circumstances and with the right models, offer a more efficient pathway for certain tasks.

Caveats and Further Exploration

It’s crucial to remember that this was a quick experiment with a specific task and a limited set of parameters. The optimal font, size, and model will likely vary depending on the document complexity and the nature of the questions being asked. Factors like image compression and the underlying vision model architectures play a significant role. And of course your results may vary since the outputs of the LLM will likely vary.

Much more research is needed to confirm the best way to optimize this

However, the findings suggest that the relationship between input format, model pricing, and performance is more nuanced than simply counting text tokens. It opens up intriguing avenues for further exploration:

Investigating the cost-effectiveness of this approach with different types of documents (e.g., tables, code).
Exploring other input format transformations.
Understanding the underlying mechanisms in vision models that can lead to these cost efficiencies.

Explore the Code and Data

For those interested in digging deeper, the Python script used for this experiment is available at https://github.com/playgrdstar/snippet_text_2_image_compression.

The raw data and formatted results are also available in the same repository as experiment_results.csv and experiment_results.xlsx.

This was a small foray into a larger field. It seems that as multimodal AI continues to evolve, so too will our understanding of the most effective and efficient ways to interact with it. Perhaps sometimes, to save on words, the answer is a picture.