Harnessing Claude 3's Vision for Structured Data Extraction
Written on
Chapter 1: Introduction to Claude 3's Vision Capabilities
Claude 3 has revolutionized the way we convert images into useful structured outputs. This model family by Anthropic excels not only in reasoning and coding but also in generating structured data, making it an effective tool for various applications.
Up until recently, OpenAI's models were considered the gold standard for producing structured JSON outputs and function calls. However, the introduction of Claude 3 has changed the landscape significantly. These models can generate accurate JSON outputs while demonstrating impressive reasoning abilities, which is essential for function calling scenarios. I recently developed a Python package called claudetools to facilitate function calling using the Claude 3 models. For more information, feel free to check out my blog post about Claudetools.
P.S.: Claudetools can easily replace OpenAI function calling with minimal adjustments.
Vision Capabilities Overview
All models in the Claude 3 family come equipped with vision capabilities, enabling exciting multimodal interactions. In fact, their vision features rival those of the GPT-4-Vision model and even outperform it in certain benchmarks, as illustrated in the table below.
Thanks to these sophisticated capabilities, Claude 3 can process various visual formats, such as images, charts, graphs, and technical diagrams. You can utilize the Claudetools package for function calling with image input without needing a different model version.
Use Case: Building a RAG Bot
Imagine being tasked with creating a Retrieval-Augmented Generation (RAG) bot to answer questions based on a knowledge base derived from PDF documents. Each page of these PDFs could contain tables, images, or text, which may include charts, diagrams, or multiple objects.
Rather than relying on existing parsers—some paid and some free—to extract structured information from complex documents, we can leverage Claude 3's vision, reasoning, and data generation capabilities for this task. The Claudetools package simplifies this process, and we will develop this use case further in the following sections.
You can access the claudetools-vision-example GitHub repository for the code we will implement.
Extraction Structure
We will instruct the Claude 3 model to extract content in blocks, which can consist of text, images, or tables. Text will be extracted verbatim, while table outputs will be formatted as lists of dictionaries, with each dictionary representing a column name and its corresponding row value. Although the model is still developing its object detection capabilities, we will prompt it to describe images or provide any text contained within them.
The Pydantic schema for extracting text content from document page images is as follows:
from pydantic import BaseModel, Field
from typing import List, Dict, Union, Literal
class TextContent(BaseModel):
content_label: Literal["text"]
content: str = Field(..., description="Text content on the page.")
class TableContent(BaseModel):
content_label: Literal["table"]
table_description: Union[str, None] = Field(
...,
description="Table name, table caption, or table description if available."
)
content: List[Dict] = Field(
...,
description="A list of dictionaries where each dictionary is a column name-row value pair."
)
class ImageContent(BaseModel):
content_label: Literal["image"]
content: str = Field(
...,
description="Explain the content in the image if textual. If not, describe the image's content."
)
class PageMetaInfo(BaseModel):
meta_info: str = Field(
...,
description="Meta info of the page image content. Include an analysis of the document: count of paragraphs, tables, and images."
)
class ExtractedContent(BaseModel):
page_meta_info: PageMetaInfo
content: List[Union[TextContent, TableContent, ImageContent]] = Field(
...,
description="Content extracted from the page image document in the order they appear."
)
Rationale for Page Meta Info
Including a page_meta_info key at the top of the schema allows the model to take a moment to analyze the document image. Just as we reflect before making decisions, this step enables the model to assess how many paragraphs, tables, and images are present before proceeding to extract the detailed content.
Crafting the Extraction Prompt
To facilitate the basic parsing of document images and extract content, we can use a straightforward prompt:
You are a powerful content extraction engine. Given an image of the document, please extract the content in a structured manner as defined by the provided schema.
I recommend organizing prompts in a dedicated folder with a structured format.
├── prompts
│ ├── __init__.py
│ ├── prompts
│ │ └── extract_content.txt
│ └── system_prompts.py
The prompts/system_prompts.py file will read all prompt files from the prompts/prompts folder.
Building the Extraction Engine
Now that we have the necessary schema and prompt, let's create the content extraction engine using Claudetools.
from claudetools.tools.tool import AsyncTool
from prompts.system_prompts import SYSTEM_PROMPTS
from schemas import ExtractedContent
from configs import ANTHROPIC_API_KEY
tool = AsyncTool(ANTHROPIC_API_KEY)
async def extractContentFromImage(image_str: str, media_type: str):
user_messages = [{
"role": "user",
"content": [{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": image_str
}
}, {
"type": "text",
"text": SYSTEM_PROMPTS.EXTRACT_CONTENT
}]
}]
tools = [{
"name": "extractedContent",
"description": "Extract content from page image",
"parameters": ExtractedContent.model_json_schema()
}]
tool_choice = {"name": "extractedContent"}
return await tool("claude-3-opus-20240229", user_messages, tools=tools, tool_choice=tool_choice, max_tokens=4095, temperature=0.2)
Exposing the Function via API
Let’s quickly set up a FastAPI server to expose the extraction function through a POST API that accepts a base64 image string and its media type.
import uvicorn
from pydantic import BaseModel
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from engine import extractContentFromImage
app = FastAPI()
origins = ["*"]
app.add_middleware(
CORSMiddleware,
allow_origins=origins,
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
class ImageInput(BaseModel):
image_str: str
media_type: str
@app.get("/")
async def index():
return {"ok": True}
@app.post("/api/extract")
async def api_extract(payload: ImageInput):
return await extractContentFromImage(payload.image_str, payload.media_type)
if __name__ == "__main__":
uvicorn.run("app:app", port=8890, host="0.0.0.0", reload=True)
Testing the API
With the FastAPI server operational, we can create a test file to read an image from our system, convert it to a base64 string, and determine its media type before sending it to the API.
import json
import base64
import requests
import mimetypes
from typing import Dict
def get_payload(image_path: str):
media_type, _ = mimetypes.guess_type(image_path)
if not media_type:
raise Exception(f"Unable to determine the media type for {image_path}")
try:
with open(image_path, "rb") as image_file:
image_data = image_file.read()except IOError:
raise Exception(f"Error opening image file {image_path}")
image_base64 = base64.b64encode(image_data).decode('utf-8')
return {"image_str": image_base64, "media_type": media_type}
def make_call(payload: Dict):
response = requests.post("http://localhost:8890/api/extract", json=payload)
data = response.json()
return data
if __name__ == "__main__":
payload = get_payload("./images/test_img_LR_3.png")
print(json.dumps(make_call(payload), indent=4))
Expected Output
When running the test file, we should expect results that include the model's accurate analysis of the document image, identifying and categorizing the content blocks appropriately.
Conclusion
In this blog post, we explored how to utilize the Claude 3 family's reasoning, vision, and structured output generation features for function calls involving image data through the Claudetools library. We also highlighted the simplicity of developing structured output generation and function calling workflows using Pydantic and Claudetools. With that, we conclude our discussion.