Skip to main content

What is Schema Projection?

Schema Projection is DocIntell’s core differentiator: instead of dumping gigabytes of raw OCR data, you define exactly which fields you need and get back only that structured data.

The Problem with Traditional OCR

Traditional OCR APIs return everything they extract - bounding boxes, confidence scores, page coordinates - resulting in massive payloads:
// Traditional OCR: 50-page invoice → 45MB response
{
  "pages": [
    {
      "page_number": 1,
      "text_annotations": [
        {
          "description": "Invoice",
          "bounding_poly": {"vertices": [...]},
          "confidence": 0.99
        },
        // ... thousands more annotations
      ]
    },
    // ... 49 more pages
  ]
}

DocIntell’s Approach: Schema Projection

With DocIntell, you define which fields matter and get back structured data:
// DocIntell: Same 50-page invoice → 2KB response (20-2000x smaller)
{
  "document_id": "0194e123-4567-7890-abcd-ef1234567890",
  "document_type": "invoice",
  "view": "accounting_v1",
  "data": {
    "invoice_number": "INV-2024-001",
    "invoice_date": "2024-01-15",
    "due_date": "2024-02-15",
    "vendor_name": "Acme Corporation",
    "total_amount": 15432.50,
    "line_items": [
      {
        "description": "Professional Services",
        "quantity": 80,
        "unit_price": 192.91,
        "total": 15432.50
      }
    ]
  }
}
Key Benefits:
  • 20-2000x smaller payloads - Only the data you need, nothing more
  • Ingest once, query many ways - Create multiple views for the same document
  • Type-safe schemas - Well-defined field types with validation

Discover Available Document Types

Before creating views, discover what document types DocIntell supports and what fields are available for extraction.

List All Document Types

Get a high-level overview of all supported document types:
curl -X GET https://api.docintell.com/v1/schemas \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"
Response:
{
  "schemas": [
    {
      "document_type": "capital_call",
      "name": "Capital Call Notice",
      "category": "fund_operations",
      "description": "Capital calls for fund contributions",
      "schema_version": "v1",
      "field_count": 12
    },
    {
      "document_type": "invoice",
      "name": "Invoice",
      "category": "accounting",
      "description": "Vendor invoices and bills",
      "schema_version": "v1",
      "field_count": 18
    },
    {
      "document_type": "k1",
      "name": "Schedule K-1",
      "category": "tax",
      "description": "IRS Schedule K-1 tax forms",
      "schema_version": "v1",
      "field_count": 24
    }
  ]
}

Get Full Schema Definition

Retrieve the complete field definitions for a specific document type:
curl -X GET https://api.docintell.com/v1/schemas/invoice \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"
Response:
{
  "document_type": "invoice",
  "name": "Invoice",
  "category": "accounting",
  "description": "Vendor invoices and bills",
  "schema_version": "v1",
  "fields": [
    {
      "field_name": "invoice_number",
      "field_type": "string",
      "severity": "hard",
      "is_nullable": false,
      "description": "Unique invoice identifier",
      "pattern": null
    },
    {
      "field_name": "invoice_date",
      "field_type": "date",
      "severity": "hard",
      "is_nullable": false,
      "description": "Date invoice was issued",
      "pattern": null
    },
    {
      "field_name": "due_date",
      "field_type": "date",
      "severity": "soft",
      "is_nullable": true,
      "description": "Payment due date",
      "pattern": null
    },
    {
      "field_name": "vendor_name",
      "field_type": "string",
      "severity": "hard",
      "is_nullable": false,
      "description": "Name of the vendor/supplier",
      "pattern": null
    },
    {
      "field_name": "vendor_address",
      "field_type": "string",
      "severity": "soft",
      "is_nullable": true,
      "description": "Vendor's billing address",
      "pattern": null
    },
    {
      "field_name": "total_amount",
      "field_type": "monetary",
      "severity": "hard",
      "is_nullable": false,
      "description": "Total invoice amount including tax",
      "pattern": null
    },
    {
      "field_name": "subtotal",
      "field_type": "monetary",
      "severity": "soft",
      "is_nullable": true,
      "description": "Subtotal before tax",
      "pattern": null
    },
    {
      "field_name": "tax_amount",
      "field_type": "monetary",
      "severity": "soft",
      "is_nullable": true,
      "description": "Total tax amount",
      "pattern": null
    },
    {
      "field_name": "currency",
      "field_type": "string",
      "severity": "soft",
      "is_nullable": true,
      "description": "Currency code (e.g., USD, EUR)",
      "pattern": "^[A-Z]{3}$"
    },
    {
      "field_name": "line_items",
      "field_type": "array",
      "severity": "soft",
      "is_nullable": true,
      "description": "Invoice line items with descriptions and amounts",
      "pattern": null
    }
  ],
  "validations": [
    {
      "name": "total_equals_subtotal_plus_tax",
      "severity": "soft",
      "message": "Total should equal subtotal plus tax",
      "fields_involved": ["total_amount", "subtotal", "tax_amount"]
    }
  ]
}

Understanding Field Definitions

FieldDescription
field_nameField identifier (snake_case) - use this in views
field_typeData type: string, decimal, date, monetary, boolean, integer, array
severityhard = required field (extraction fails if missing)
soft = optional field (extraction continues if missing)
is_nullableWhether the field can be null even if present
descriptionHuman-readable explanation of the field
patternRegex validation pattern (if applicable)
Field Severity Matters:
  • Hard fields are critical and must be present for extraction to succeed
  • Soft fields are nice-to-have and won’t fail extraction if missing

Create Custom Views

Views define which fields you want to retrieve when querying document data. Think of them as SQL SELECT statements that filter the extracted data.

Why Use Views?

Multiple Use Cases

Create different views for accounting, compliance, and auditing teams - all from the same extraction.

Reduced Payload Size

Only retrieve the fields you need. A “quick summary” view might return 5 fields instead of 50.

Separation of Concerns

Different teams see different data without re-processing the document.

Version Control

Name views like “accounting_v1” and “accounting_v2” to manage schema evolution.

Creating a View

Create a view by specifying the document type and which fields to include:
curl -X POST https://api.docintell.com/v1/views \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document_type": "invoice",
    "name": "accounting_v1",
    "description": "Fields needed for accounts payable processing",
    "fields": [
      "invoice_number",
      "invoice_date",
      "due_date",
      "vendor_name",
      "total_amount",
      "currency"
    ],
    "is_default": true
  }'
Response:
{
  "view_id": "0194e456-7890-7abc-def0-123456789abc",
  "document_type": "invoice",
  "name": "accounting_v1",
  "description": "Fields needed for accounts payable processing",
  "fields": [
    "invoice_number",
    "invoice_date",
    "due_date",
    "vendor_name",
    "total_amount",
    "currency"
  ],
  "is_default": true,
  "created_at": "2024-01-15T10:30:00Z",
  "updated_at": "2024-01-15T10:30:00Z"
}

Default Views

Set is_default: true to make a view the default for its document type. When you query document data without specifying a view, the default view is used.
Only one default view per document type. Setting a new default automatically unsets the previous one.

List Your Views

See all views you’ve created:
curl -X GET https://api.docintell.com/v1/views \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"
Filter by document type:
curl -X GET "https://api.docintell.com/v1/views?document_type=invoice" \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"

Update a View

Modify an existing view (fields, description, or default status):
curl -X PUT https://api.docintell.com/v1/views/0194e456-7890-7abc-def0-123456789abc \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "fields": [
      "invoice_number",
      "invoice_date",
      "vendor_name",
      "total_amount",
      "line_items"
    ],
    "description": "Updated to include line items for detailed analysis"
  }'
View names cannot be changed after creation. If you need a different name, create a new view and delete the old one.

Delete a View

Remove a view you no longer need:
curl -X DELETE https://api.docintell.com/v1/views/0194e456-7890-7abc-def0-123456789abc \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"
Response: 204 No Content

Query Data with Views

Once you’ve created views, use them to retrieve extracted document data filtered to exactly the fields you need.

Query with a Specific View

Retrieve document data using a named view:
curl -X GET "https://api.docintell.com/v1/documents/0194e123-4567-7890-abcd-ef1234567890/data?view=accounting_v1" \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"
Response:
{
  "document_id": "0194e123-4567-7890-abcd-ef1234567890",
  "document_type": "invoice",
  "view": "accounting_v1",
  "data": {
    "invoice_number": "INV-2024-001",
    "invoice_date": "2024-01-15",
    "due_date": "2024-02-15",
    "vendor_name": "Acme Corporation",
    "total_amount": 15432.50,
    "currency": "USD"
  },
  "field_metadata": null
}

Query with Default View

If you don’t specify a view, the default view for the document type is used:
# Uses the default view for the document type
curl -X GET "https://api.docintell.com/v1/documents/0194e123-4567-7890-abcd-ef1234567890/data" \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"
If no default view exists, all fields are returned.

Include Field Metadata

Get additional metadata for each field (confidence scores, page numbers, etc.):
curl -X GET "https://api.docintell.com/v1/documents/0194e123-4567-7890-abcd-ef1234567890/data?view=accounting_v1&include_metadata=true" \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"
Response:
{
  "document_id": "0194e123-4567-7890-abcd-ef1234567890",
  "document_type": "invoice",
  "view": "accounting_v1",
  "data": {
    "invoice_number": "INV-2024-001",
    "invoice_date": "2024-01-15",
    "vendor_name": "Acme Corporation",
    "total_amount": 15432.50
  },
  "field_metadata": {
    "invoice_number": {
      "confidence": 0.99,
      "page_number": 1,
      "bounding_box": {
        "x": 450,
        "y": 120,
        "width": 180,
        "height": 24
      }
    },
    "total_amount": {
      "confidence": 0.97,
      "page_number": 1,
      "bounding_box": {
        "x": 650,
        "y": 800,
        "width": 120,
        "height": 20
      }
    }
  }
}
Field metadata is only available if you enable include_metadata=true. It’s disabled by default to reduce payload size.

Query the Same Document with Different Views

This is where Schema Projection shines - query the same document multiple ways:
Accounting View (6 fields for AP processing):
curl -X GET "https://api.docintell.com/v1/documents/{id}/data?view=accounting_v1" \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"
{
  "view": "accounting_v1",
  "data": {
    "invoice_number": "INV-2024-001",
    "invoice_date": "2024-01-15",
    "due_date": "2024-02-15",
    "vendor_name": "Acme Corporation",
    "total_amount": 15432.50,
    "currency": "USD"
  }
}

Compliance View (8 fields for audit trail):
curl -X GET "https://api.docintell.com/v1/documents/{id}/data?view=compliance_v1" \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"
{
  "view": "compliance_v1",
  "data": {
    "invoice_number": "INV-2024-001",
    "vendor_name": "Acme Corporation",
    "vendor_address": "123 Main St, Anytown, CA 94000",
    "vendor_tax_id": "12-3456789",
    "payment_terms": "Net 30",
    "purchase_order": "PO-2024-056",
    "approved_by": "Jane Smith",
    "approval_date": "2024-01-14"
  }
}
Same document, same extraction, different views - no re-processing.

Best Practices

1. Create Views for Each Use Case

Don’t use a single “all fields” view for everything. Create specific views for each team or workflow:

Accounting Team

accounting_v1: invoice_number, vendor_name, total_amount, due_date

Compliance Team

compliance_v1: vendor_tax_id, payment_terms, approved_by, approval_date

Audit Team

audit_v1: All financial fields + approval workflow fields

Quick Summary

summary_v1: Just 3-5 key fields for dashboards

2. Use Semantic Versioning for View Names

Plan for schema evolution by versioning your views:
accounting_v1  → First version
accounting_v2  → Added line_items field
accounting_v3  → Added tax_breakdown field
This allows you to:
  • Migrate gradually - New code uses v2, old code continues using v1
  • A/B test schema changes - Compare v1 vs v2 side-by-side
  • Roll back if needed - Switch back to v1 if v2 has issues

3. Set Default Views for Common Queries

Make your most common view the default:
{
  "name": "accounting_v1",
  "is_default": true
}
This simplifies client code:
# No need to specify view - uses default
response = requests.get(f"/v1/documents/{doc_id}/data")

4. Validate Fields Before Creating Views

Always check the schema first to ensure your fields exist:
# 1. Get schema
schema = requests.get("/v1/schemas/invoice").json()
available_fields = [f["field_name"] for f in schema["fields"]]

# 2. Validate your fields
desired_fields = ["invoice_number", "total_amount", "line_items"]
invalid_fields = [f for f in desired_fields if f not in available_fields]

if invalid_fields:
    print(f"Invalid fields: {invalid_fields}")
else:
    # 3. Create view
    requests.post("/v1/views", json={
        "document_type": "invoice",
        "fields": desired_fields,
        ...
    })

5. Use include_metadata Sparingly

Only request field metadata when you actually need it (e.g., for quality review):
# ❌ Always including metadata adds unnecessary payload size
data = get_document_data(doc_id, view="accounting_v1", include_metadata=True)

# ✅ Only request metadata when needed
if needs_quality_review:
    data = get_document_data(doc_id, view="accounting_v1", include_metadata=True)
else:
    data = get_document_data(doc_id, view="accounting_v1")

6. Document Your Views

Maintain a mapping of views to use cases in your documentation:
# DocIntell Views

## Invoices

- **accounting_v1** - AP processing (6 fields)
- **compliance_v1** - Vendor verification (8 fields)
- **audit_v1** - Full audit trail (15 fields)
- **summary_v1** - Dashboard display (3 fields)

## Capital Calls

- **fund_ops_v1** - Fund operations (10 fields)
- ...

Error Handling

Invalid Fields

If you try to create a view with fields that don’t exist in the schema:
{
  "error": "invalid_fields",
  "message": "The following fields are not available for document type 'invoice': invalid_field, another_bad_field",
  "invalid_fields": ["invalid_field", "another_bad_field"],
  "available_fields": [
    "invoice_number",
    "invoice_date",
    "vendor_name",
    "total_amount",
    "..."
  ]
}
HTTP Status: 400 Bad Request Fix: Check the schema (GET /v1/schemas/invoice) for valid field names.

View Not Found

If you query with a view that doesn’t exist:
{
  "detail": "View not found: 'nonexistent_view' for document type 'invoice'"
}
HTTP Status: 404 Not Found Fix: Check your view name or create the view first (POST /v1/views).

Document Type Not Found

If you try to create a view for an unsupported document type:
{
  "detail": "Document type not found: 'unsupported_type'"
}
HTTP Status: 404 Not Found Fix: List available document types (GET /v1/schemas).

Document Not Ready

If you query data before extraction completes:
{
  "error": "document_not_ready",
  "message": "Document extraction not completed. Current status: processing",
  "status": "processing"
}
HTTP Status: 400 Bad Request Fix: Wait for extraction to complete (check job status with GET /v1/jobs/{job_id}).

Complete Example: End-to-End Workflow

Here’s a complete example showing schema discovery, view creation, and data querying:
import requests

API_KEY = "dk_live_YOUR_API_KEY"
BASE_URL = "https://api.docintell.com/v1"
headers = {"Authorization": f"Bearer {API_KEY}"}

# 1. Discover available document types
schemas = requests.get(f"{BASE_URL}/schemas", headers=headers).json()
print(f"Available document types: {[s['document_type'] for s in schemas['schemas']]}")

# 2. Get full schema for invoices
invoice_schema = requests.get(f"{BASE_URL}/schemas/invoice", headers=headers).json()
print(f"Invoice fields: {[f['field_name'] for f in invoice_schema['fields']]}")

# 3. Create an accounting view
accounting_view = requests.post(
    f"{BASE_URL}/views",
    headers=headers,
    json={
        "document_type": "invoice",
        "name": "accounting_v1",
        "description": "Fields for AP processing",
        "fields": [
            "invoice_number",
            "invoice_date",
            "due_date",
            "vendor_name",
            "total_amount",
            "currency"
        ],
        "is_default": True
    }
).json()
print(f"Created view: {accounting_view['view_id']}")

# 4. Create a compliance view
compliance_view = requests.post(
    f"{BASE_URL}/views",
    headers=headers,
    json={
        "document_type": "invoice",
        "name": "compliance_v1",
        "description": "Fields for vendor verification",
        "fields": [
            "invoice_number",
            "vendor_name",
            "vendor_address",
            "vendor_tax_id",
            "payment_terms",
            "approved_by"
        ],
        "is_default": False
    }
).json()
print(f"Created view: {compliance_view['view_id']}")

# 5. Upload a document (returns immediately with job_id)
with open("invoice.pdf", "rb") as f:
    upload_response = requests.post(
        f"{BASE_URL}/documents",
        headers=headers,
        files={"file": f},
        data={"retention_years": 7, "document_type": "invoice"}
    ).json()

document_id = upload_response["document_id"]
job_id = upload_response["job_id"]
print(f"Document uploaded: {document_id}, job: {job_id}")

# 6. Poll for job completion (in production, use webhooks)
import time
while True:
    job = requests.get(f"{BASE_URL}/jobs/{job_id}", headers=headers).json()
    if job["status"] == "completed":
        print("Extraction completed!")
        break
    elif job["status"] == "failed":
        print(f"Extraction failed: {job.get('error_message')}")
        exit(1)
    time.sleep(2)

# 7. Query with accounting view
accounting_data = requests.get(
    f"{BASE_URL}/documents/{document_id}/data?view=accounting_v1",
    headers=headers
).json()
print(f"Accounting data: {accounting_data['data']}")

# 8. Query with compliance view (same document, different fields!)
compliance_data = requests.get(
    f"{BASE_URL}/documents/{document_id}/data?view=compliance_v1",
    headers=headers
).json()
print(f"Compliance data: {compliance_data['data']}")
Output:
Available document types: ['capital_call', 'invoice', 'k1', ...]
Invoice fields: ['invoice_number', 'invoice_date', 'vendor_name', ...]
Created view: 0194e456-7890-7abc-def0-123456789abc
Created view: 0194e456-7890-7abc-def0-123456789def
Document uploaded: 0194e123-4567-7890-abcd-ef1234567890, job: 0194e123-4567-7890-abcd-ef1234567891
Extraction completed!
Accounting data: {'invoice_number': 'INV-2024-001', 'total_amount': 15432.50, ...}
Compliance data: {'invoice_number': 'INV-2024-001', 'vendor_tax_id': '12-3456789', ...}

Next Steps