Harnessing Claude 3's Vision for Structured Data Extraction

Chapter 1: Introduction to Claude 3's Vision Capabilities

Claude 3 has revolutionized the way we convert images into useful structured outputs. This model family by Anthropic excels not only in reasoning and coding but also in generating structured data, making it an effective tool for various applications.

Up until recently, OpenAI's models were considered the gold standard for producing structured JSON outputs and function calls. However, the introduction of Claude 3 has changed the landscape significantly. These models can generate accurate JSON outputs while demonstrating impressive reasoning abilities, which is essential for function calling scenarios. I recently developed a Python package called claudetools to facilitate function calling using the Claude 3 models. For more information, feel free to check out my blog post about Claudetools.

P.S.: Claudetools can easily replace OpenAI function calling with minimal adjustments.

Vision Capabilities Overview

All models in the Claude 3 family come equipped with vision capabilities, enabling exciting multimodal interactions. In fact, their vision features rival those of the GPT-4-Vision model and even outperform it in certain benchmarks, as illustrated in the table below.

Thanks to these sophisticated capabilities, Claude 3 can process various visual formats, such as images, charts, graphs, and technical diagrams. You can utilize the Claudetools package for function calling with image input without needing a different model version.

Use Case: Building a RAG Bot

Imagine being tasked with creating a Retrieval-Augmented Generation (RAG) bot to answer questions based on a knowledge base derived from PDF documents. Each page of these PDFs could contain tables, images, or text, which may include charts, diagrams, or multiple objects.

Rather than relying on existing parsers—some paid and some free—to extract structured information from complex documents, we can leverage Claude 3's vision, reasoning, and data generation capabilities for this task. The Claudetools package simplifies this process, and we will develop this use case further in the following sections.

You can access the claudetools-vision-example GitHub repository for the code we will implement.

Extraction Structure

We will instruct the Claude 3 model to extract content in blocks, which can consist of text, images, or tables. Text will be extracted verbatim, while table outputs will be formatted as lists of dictionaries, with each dictionary representing a column name and its corresponding row value. Although the model is still developing its object detection capabilities, we will prompt it to describe images or provide any text contained within them.

The Pydantic schema for extracting text content from document page images is as follows:

from pydantic import BaseModel, Field

from typing import List, Dict, Union, Literal

class TextContent(BaseModel):

content_label: Literal["text"]

content: str = Field(..., description="Text content on the page.")

class TableContent(BaseModel):

content_label: Literal["table"]

table_description: Union[str, None] = Field(

...,

description="Table name, table caption, or table description if available."

)

content: List[Dict] = Field(

...,

description="A list of dictionaries where each dictionary is a column name-row value pair."

)

class ImageContent(BaseModel):

content_label: Literal["image"]

content: str = Field(

...,

description="Explain the content in the image if textual. If not, describe the image's content."

)

class PageMetaInfo(BaseModel):

meta_info: str = Field(

...,

description="Meta info of the page image content. Include an analysis of the document: count of paragraphs, tables, and images."

)

class ExtractedContent(BaseModel):

page_meta_info: PageMetaInfo

content: List[Union[TextContent, TableContent, ImageContent]] = Field(

...,

description="Content extracted from the page image document in the order they appear."

)

Rationale for Page Meta Info

Including a page_meta_info key at the top of the schema allows the model to take a moment to analyze the document image. Just as we reflect before making decisions, this step enables the model to assess how many paragraphs, tables, and images are present before proceeding to extract the detailed content.

Crafting the Extraction Prompt

To facilitate the basic parsing of document images and extract content, we can use a straightforward prompt:

You are a powerful content extraction engine. Given an image of the document, please extract the content in a structured manner as defined by the provided schema.

I recommend organizing prompts in a dedicated folder with a structured format.

├── prompts

│ ├── __init__.py

│ ├── prompts

│ │ └── extract_content.txt

│ └── system_prompts.py

The prompts/system_prompts.py file will read all prompt files from the prompts/prompts folder.

Building the Extraction Engine

Now that we have the necessary schema and prompt, let's create the content extraction engine using Claudetools.

from claudetools.tools.tool import AsyncTool

from prompts.system_prompts import SYSTEM_PROMPTS

from schemas import ExtractedContent

from configs import ANTHROPIC_API_KEY

tool = AsyncTool(ANTHROPIC_API_KEY)

async def extractContentFromImage(image_str: str, media_type: str):

user_messages = [{

"role": "user",

"content": [{

"type": "image",

"source": {

"type": "base64",

"media_type": media_type,

"data": image_str

}

}, {

"type": "text",

"text": SYSTEM_PROMPTS.EXTRACT_CONTENT

}]

}]

tools = [{

"name": "extractedContent",

"description": "Extract content from page image",

"parameters": ExtractedContent.model_json_schema()

}]

tool_choice = {"name": "extractedContent"}

return await tool("claude-3-opus-20240229", user_messages, tools=tools, tool_choice=tool_choice, max_tokens=4095, temperature=0.2)

Exposing the Function via API

Let’s quickly set up a FastAPI server to expose the extraction function through a POST API that accepts a base64 image string and its media type.

import uvicorn

from pydantic import BaseModel

from fastapi import FastAPI

from fastapi.middleware.cors import CORSMiddleware

from engine import extractContentFromImage

app = FastAPI()

origins = ["*"]

app.add_middleware(

CORSMiddleware,

allow_origins=origins,

allow_credentials=True,

allow_methods=["*"],

allow_headers=["*"],

)

class ImageInput(BaseModel):

image_str: str

media_type: str

@app.get("/")

async def index():

return {"ok": True}

@app.post("/api/extract")

async def api_extract(payload: ImageInput):

return await extractContentFromImage(payload.image_str, payload.media_type)

if __name__ == "__main__":

uvicorn.run("app:app", port=8890, host="0.0.0.0", reload=True)

Testing the API

With the FastAPI server operational, we can create a test file to read an image from our system, convert it to a base64 string, and determine its media type before sending it to the API.

import json

import base64

import requests

import mimetypes

from typing import Dict

def get_payload(image_path: str):

media_type, _ = mimetypes.guess_type(image_path)

if not media_type:

raise Exception(f"Unable to determine the media type for {image_path}")

try:

with open(image_path, "rb") as image_file:

image_data = image_file.read()

except IOError:

raise Exception(f"Error opening image file {image_path}")

image_base64 = base64.b64encode(image_data).decode('utf-8')

return {"image_str": image_base64, "media_type": media_type}

def make_call(payload: Dict):

response = requests.post("http://localhost:8890/api/extract", json=payload)

data = response.json()

return data

if __name__ == "__main__":

payload = get_payload("./images/test_img_LR_3.png")

print(json.dumps(make_call(payload), indent=4))

Expected Output

When running the test file, we should expect results that include the model's accurate analysis of the document image, identifying and categorizing the content blocks appropriately.

Conclusion

In this blog post, we explored how to utilize the Claude 3 family's reasoning, vision, and structured output generation features for function calls involving image data through the Claudetools library. We also highlighted the simplicity of developing structured output generation and function calling workflows using Pydantic and Claudetools. With that, we conclude our discussion.

afyonkarahisarkitapfuari.com

Harnessing Claude 3's Vision for Structured Data Extraction

Chapter 1: Introduction to Claude 3's Vision Capabilities

Vision Capabilities Overview

Use Case: Building a RAG Bot

Extraction Structure

Rationale for Page Meta Info

Crafting the Extraction Prompt

Building the Extraction Engine

Exposing the Function via API

Testing the API

Expected Output

Conclusion

Share the page:

Recent Post:

Unlocking the Full Capabilities of Your S24 Ultra: Tips & Tricks

The Transformative Power of Running: A Journey Through Pain and Writing

Unlocking the True Path to Sustainable Success: A Journey

The Impact of AI on English Education: A New Perspective

Navigating the Confusion: Artificial Sweeteners and Cancer Risks

Unlocking the Secrets to a Restful Night: 4 Essential Steps

The Essential Impact of Technical Services Firms on Success

Fasting and Its Impact on Infection Protection: What We Know