afyonkarahisarkitapfuari.com

Harnessing Claude 3's Vision for Structured Data Extraction

Written on

Chapter 1: Introduction to Claude 3's Vision Capabilities

Claude 3 has revolutionized the way we convert images into useful structured outputs. This model family by Anthropic excels not only in reasoning and coding but also in generating structured data, making it an effective tool for various applications.

Up until recently, OpenAI's models were considered the gold standard for producing structured JSON outputs and function calls. However, the introduction of Claude 3 has changed the landscape significantly. These models can generate accurate JSON outputs while demonstrating impressive reasoning abilities, which is essential for function calling scenarios. I recently developed a Python package called claudetools to facilitate function calling using the Claude 3 models. For more information, feel free to check out my blog post about Claudetools.

P.S.: Claudetools can easily replace OpenAI function calling with minimal adjustments.

Vision Capabilities Overview

All models in the Claude 3 family come equipped with vision capabilities, enabling exciting multimodal interactions. In fact, their vision features rival those of the GPT-4-Vision model and even outperform it in certain benchmarks, as illustrated in the table below.

Thanks to these sophisticated capabilities, Claude 3 can process various visual formats, such as images, charts, graphs, and technical diagrams. You can utilize the Claudetools package for function calling with image input without needing a different model version.

Use Case: Building a RAG Bot

Imagine being tasked with creating a Retrieval-Augmented Generation (RAG) bot to answer questions based on a knowledge base derived from PDF documents. Each page of these PDFs could contain tables, images, or text, which may include charts, diagrams, or multiple objects.

Rather than relying on existing parsers—some paid and some free—to extract structured information from complex documents, we can leverage Claude 3's vision, reasoning, and data generation capabilities for this task. The Claudetools package simplifies this process, and we will develop this use case further in the following sections.

You can access the claudetools-vision-example GitHub repository for the code we will implement.

Extraction Structure

We will instruct the Claude 3 model to extract content in blocks, which can consist of text, images, or tables. Text will be extracted verbatim, while table outputs will be formatted as lists of dictionaries, with each dictionary representing a column name and its corresponding row value. Although the model is still developing its object detection capabilities, we will prompt it to describe images or provide any text contained within them.

The Pydantic schema for extracting text content from document page images is as follows:

from pydantic import BaseModel, Field

from typing import List, Dict, Union, Literal

class TextContent(BaseModel):

content_label: Literal["text"]

content: str = Field(..., description="Text content on the page.")

class TableContent(BaseModel):

content_label: Literal["table"]

table_description: Union[str, None] = Field(

...,

description="Table name, table caption, or table description if available."

)

content: List[Dict] = Field(

...,

description="A list of dictionaries where each dictionary is a column name-row value pair."

)

class ImageContent(BaseModel):

content_label: Literal["image"]

content: str = Field(

...,

description="Explain the content in the image if textual. If not, describe the image's content."

)

class PageMetaInfo(BaseModel):

meta_info: str = Field(

...,

description="Meta info of the page image content. Include an analysis of the document: count of paragraphs, tables, and images."

)

class ExtractedContent(BaseModel):

page_meta_info: PageMetaInfo

content: List[Union[TextContent, TableContent, ImageContent]] = Field(

...,

description="Content extracted from the page image document in the order they appear."

)

Rationale for Page Meta Info

Including a page_meta_info key at the top of the schema allows the model to take a moment to analyze the document image. Just as we reflect before making decisions, this step enables the model to assess how many paragraphs, tables, and images are present before proceeding to extract the detailed content.

Crafting the Extraction Prompt

To facilitate the basic parsing of document images and extract content, we can use a straightforward prompt:

You are a powerful content extraction engine. Given an image of the document, please extract the content in a structured manner as defined by the provided schema.

I recommend organizing prompts in a dedicated folder with a structured format.

├── prompts

│ ├── __init__.py

│ ├── prompts

│ │ └── extract_content.txt

│ └── system_prompts.py

The prompts/system_prompts.py file will read all prompt files from the prompts/prompts folder.

Building the Extraction Engine

Now that we have the necessary schema and prompt, let's create the content extraction engine using Claudetools.

from claudetools.tools.tool import AsyncTool

from prompts.system_prompts import SYSTEM_PROMPTS

from schemas import ExtractedContent

from configs import ANTHROPIC_API_KEY

tool = AsyncTool(ANTHROPIC_API_KEY)

async def extractContentFromImage(image_str: str, media_type: str):

user_messages = [{

"role": "user",

"content": [{

"type": "image",

"source": {

"type": "base64",

"media_type": media_type,

"data": image_str

}

}, {

"type": "text",

"text": SYSTEM_PROMPTS.EXTRACT_CONTENT

}]

}]

tools = [{

"name": "extractedContent",

"description": "Extract content from page image",

"parameters": ExtractedContent.model_json_schema()

}]

tool_choice = {"name": "extractedContent"}

return await tool("claude-3-opus-20240229", user_messages, tools=tools, tool_choice=tool_choice, max_tokens=4095, temperature=0.2)

Exposing the Function via API

Let’s quickly set up a FastAPI server to expose the extraction function through a POST API that accepts a base64 image string and its media type.

import uvicorn

from pydantic import BaseModel

from fastapi import FastAPI

from fastapi.middleware.cors import CORSMiddleware

from engine import extractContentFromImage

app = FastAPI()

origins = ["*"]

app.add_middleware(

CORSMiddleware,

allow_origins=origins,

allow_credentials=True,

allow_methods=["*"],

allow_headers=["*"],

)

class ImageInput(BaseModel):

image_str: str

media_type: str

@app.get("/")

async def index():

return {"ok": True}

@app.post("/api/extract")

async def api_extract(payload: ImageInput):

return await extractContentFromImage(payload.image_str, payload.media_type)

if __name__ == "__main__":

uvicorn.run("app:app", port=8890, host="0.0.0.0", reload=True)

Testing the API

With the FastAPI server operational, we can create a test file to read an image from our system, convert it to a base64 string, and determine its media type before sending it to the API.

import json

import base64

import requests

import mimetypes

from typing import Dict

def get_payload(image_path: str):

media_type, _ = mimetypes.guess_type(image_path)

if not media_type:

raise Exception(f"Unable to determine the media type for {image_path}")

try:

with open(image_path, "rb") as image_file:

image_data = image_file.read()

except IOError:

raise Exception(f"Error opening image file {image_path}")

image_base64 = base64.b64encode(image_data).decode('utf-8')

return {"image_str": image_base64, "media_type": media_type}

def make_call(payload: Dict):

response = requests.post("http://localhost:8890/api/extract", json=payload)

data = response.json()

return data

if __name__ == "__main__":

payload = get_payload("./images/test_img_LR_3.png")

print(json.dumps(make_call(payload), indent=4))

Expected Output

When running the test file, we should expect results that include the model's accurate analysis of the document image, identifying and categorizing the content blocks appropriately.

Conclusion

In this blog post, we explored how to utilize the Claude 3 family's reasoning, vision, and structured output generation features for function calls involving image data through the Claudetools library. We also highlighted the simplicity of developing structured output generation and function calling workflows using Pydantic and Claudetools. With that, we conclude our discussion.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Unlocking the Full Capabilities of Your S24 Ultra: Tips & Tricks

Discover essential tips and tricks to maximize the potential of your Samsung S24 Ultra, from photo editing to customization.

The Transformative Power of Running: A Journey Through Pain and Writing

Exploring how running serves as a catalyst for creativity and mental well-being, intertwining physical exertion with the writing process.

Unlocking the True Path to Sustainable Success: A Journey

Discover why genuine success requires more than shortcuts; it demands dedication and perseverance.

The Impact of AI on English Education: A New Perspective

Explore the evolving role of AI in writing and education, and how it can enhance creativity and productivity in English classes.

Navigating the Confusion: Artificial Sweeteners and Cancer Risks

Unraveling the confusion surrounding artificial sweeteners and their potential cancer risks, particularly focusing on aspartame.

Unlocking the Secrets to a Restful Night: 4 Essential Steps

Discover the profound effects of quality sleep on well-being and learn essential habits for restful nights.

The Essential Impact of Technical Services Firms on Success

Discover how technical services firms like Talasoft empower large businesses to thrive in a competitive landscape.

Fasting and Its Impact on Infection Protection: What We Know

Exploring the potential protective effects of fasting against infections, with insights from recent research.