Re-processing Documents

Documents sometimes need to be re-processed. This guide covers when and how to re-extract data from documents that have already been uploaded.

When to Re-process

You should re-process a document when:

Classification Issues

Document was classified as OTHER or unknown and you want to provide a type hint

Low Confidence

Extraction had low confidence scores and you want to retry with explicit hints

Model Improvements

New extraction models are available and you want better results

Audit Trail

Compare extraction accuracy across multiple runs for quality assurance

Re-processing creates a new extraction job but uses the same document in the Cold Vault. You don’t need to re-upload the PDF - the original file is immutable and always available.

Re-process a Document

Re-process an existing document by creating a new extraction job:

# Basic re-processing (no hint)
curl -X POST https://api.docintell.com/v1/documents/019370ab-c123-7def-8901-234567890abc/jobs \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{}'

# Re-processing with document type hint
curl -X POST https://api.docintell.com/v1/documents/019370ab-c123-7def-8901-234567890abc/jobs \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document_type": "capital_call"
  }'

Response (202 Accepted):

{
  "document_id": "019370ab-c123-7def-8901-234567890abc",
  "job_id": "019370ab-d456-7def-8901-234567890def",
  "status": "pending",
  "vault_uri": "gs://docintel-vault/019370ab-c123-7def-8901-234567890abc.pdf",
  "created_at": "2024-01-15T14:30:00Z"
}

Deduplication Behavior

DocIntell automatically prevents duplicate jobs for the same document:

If a pending or processing job already exists, the existing job is returned
No duplicate extraction jobs are created
This prevents accidental re-submission from retry logic or double-clicks

Example: Deduplication Response

If you call POST /v1/documents/{id}/jobs while a job is already in progress:

{
  "document_id": "019370ab-c123-7def-8901-234567890abc",
  "job_id": "019370ab-d456-7def-8901-234567890def",
  "status": "processing",
  "vault_uri": "gs://docintel-vault/019370ab-c123-7def-8901-234567890abc.pdf",
  "created_at": "2024-01-15T14:28:00Z"
}

Notice the job_id and created_at match the existing job (not a new one).

Supported Document Types

You can provide a document_type hint to improve classification accuracy:

Type	Description
`capital_call`	Capital call notices
`distribution`	Distribution notices
`k1`	K-1 tax forms
`nav_statement`	NAV statements
`audited_financial`	Audited financial statements
`invoice`	Commercial invoices
`trade_confirmation`	Trade confirmations
`wire_confirmation`	Wire transfer confirmations
`subscription_agreement`	Subscription agreements
`insurance`	Insurance documents
`investor_letter`	Investor letters
`unknown`	Let the system auto-classify

If you’re unsure of the document type, omit document_type entirely. The extraction engine will auto-classify the document based on its content.

View Job History

List all extraction attempts for a document to see the full re-processing history:

curl -X GET https://api.docintell.com/v1/documents/019370ab-c123-7def-8901-234567890abc/jobs \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"

Response:

{
  "jobs": [
    {
      "job_id": "019370ab-d456-7def-8901-234567890def",
      "document_id": "019370ab-c123-7def-8901-234567890abc",
      "status": "completed",
      "created_at": "2024-01-15T14:30:00Z",
      "processing_completed_at": "2024-01-15T14:30:45Z",
      "processing_time_seconds": 45.2
    },
    {
      "job_id": "019370ab-d123-7def-8901-234567890abc",
      "document_id": "019370ab-c123-7def-8901-234567890abc",
      "status": "completed",
      "created_at": "2024-01-15T10:00:00Z",
      "processing_completed_at": "2024-01-15T10:00:52Z",
      "processing_time_seconds": 52.1
    }
  ],
  "total": 2,
  "page": 1,
  "per_page": 20
}

Pagination

Job history supports pagination for documents with many re-processing attempts:

# Get page 2 with 10 items per page
curl -X GET "https://api.docintell.com/v1/documents/019370ab-c123-7def-8901-234567890abc/jobs?page=2&per_page=10" \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"

Parameter	Type	Default	Description
`page`	integer	`1`	Page number (1-indexed)
`per_page`	integer	`20`	Items per page (max 100)

Jobs are ordered newest first. The first item (page 1, index 0) is always the most recent extraction attempt.

Common Re-processing Scenarios

Scenario 1: Document Classified as OTHER

Your document was uploaded but classified as OTHER or unknown:

Check the current classification

curl -X GET https://api.docintell.com/v1/jobs/019370ab-d123-7def-8901-234567890abc \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"

Response shows:

{
  "document_type": "unknown",
  "classification_reasoning": "Unable to classify document"
}

Re-process with document type hint

curl -X POST https://api.docintell.com/v1/documents/019370ab-c123-7def-8901-234567890abc/jobs \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"document_type": "capital_call"}'

Compare results

Retrieve both jobs and compare extraction quality:

# Original job (auto-classified as unknown)
curl -X GET https://api.docintell.com/v1/jobs/019370ab-d123-7def-8901-234567890abc/results \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"

# New job (with capital_call hint)
curl -X GET https://api.docintell.com/v1/jobs/019370ab-d456-7def-8901-234567890def/results \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"

Scenario 2: Low Confidence Extraction

Initial extraction had low confidence scores for critical fields:

Review confidence scores

curl -X GET https://api.docintell.com/v1/jobs/019370ab-d123-7def-8901-234567890abc/results \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"

Response shows low confidence:

{
  "extraction": {
    "data": {
      "total_amount": 50000.00,
      "due_date": "2024-02-01"
    },
    "field_metadata": {
      "total_amount": {
        "confidence": 0.45,
        "reasoning": "Ambiguous - multiple dollar amounts found"
      }
    }
  }
}

Re-process with correct document type

Providing the correct document type often improves confidence:

curl -X POST https://api.docintell.com/v1/documents/019370ab-c123-7def-8901-234567890abc/jobs \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"document_type": "invoice"}'

Verify improved confidence

{
  "extraction": {
    "data": {
      "total_amount": 50000.00,
      "due_date": "2024-02-01"
    },
    "field_metadata": {
      "total_amount": {
        "confidence": 0.92,
        "reasoning": "Found in 'Total Amount Due' section"
      }
    }
  }
}

Confidence scores are self-reported by the LLM and are directionally useful but not calibrated. A 90% confidence score does NOT mean 90% accuracy. Use scores for relative comparison between fields, not as absolute probabilities.

Scenario 3: Comparing Extraction Runs

Track extraction accuracy improvements over time:

import requests

headers = {"Authorization": "Bearer dk_live_YOUR_API_KEY"}
document_id = "019370ab-c123-7def-8901-234567890abc"

# Get all jobs for the document
jobs_response = requests.get(
    f"https://api.docintell.com/v1/documents/{document_id}/jobs",
    headers=headers
)
jobs = jobs_response.json()["jobs"]

# Compare results across all jobs
for job in jobs:
    if job["status"] == "completed":
        results = requests.get(
            f"https://api.docintell.com/v1/jobs/{job['job_id']}/results",
            headers=headers
        ).json()

        print(f"Job {job['job_id']}:")
        print(f"  Created: {job['created_at']}")
        print(f"  Type: {results['classification']['document_type']}")
        print(f"  Confidence: {results['classification']['confidence']}")
        print(f"  Processing Time: {job['processing_time_seconds']}s")
        print()

Best Practices

Don't Re-process Pending Jobs

Why: Deduplication prevents duplicate jobs, but checking status first avoids unnecessary API calls.

# Check status before re-processing
status = requests.get(
    f"https://api.docintell.com/v1/documents/{document_id}/status",
    headers=headers
).json()

if status["status"] in ["pending", "processing"]:
    print("Job already in progress")
else:
    # Safe to re-process
    requests.post(...)

Use Webhooks for Completion

Why: Polling for job completion wastes API quota and adds latency.Set up webhooks once and receive notifications for all jobs:

curl -X POST https://api.docintell.com/v1/webhooks \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY" \
  -d '{
    "url": "https://yourapp.com/webhooks/docintel",
    "events": ["document.processing.completed"]
  }'

Track Job IDs for Comparison

Why: Each extraction creates a unique job_id. Track them to compare results.

job_mapping = {
    "original": "019370ab-d123-7def-8901-234567890abc",
    "with_hint": "019370ab-d456-7def-8901-234567890def"
}

# Compare confidence scores
for label, job_id in job_mapping.items():
    results = get_job_results(job_id)
    print(f"{label}: {results['classification']['confidence']}")

Use Meaningful Document Type Hints

Why: Type hints significantly improve extraction accuracy for known document types.

DO: Provide document_type if you know the document type

DON’T: Use "unknown" as a hint - omit document_type instead

Rate Limiting

Re-processing is subject to the document ingestion rate limit: 100 documents/hour per tenant. Rate limit headers are returned in responses:

HTTP/1.1 202 Accepted
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1705329600

If you exceed the rate limit, you’ll receive a 429 Too Many Requests error:

{
  "error": "rate_limit_exceeded",
  "message": "Document ingestion rate limit exceeded (100 documents/hour)",
  "details": {
    "limit": 100,
    "remaining": 0,
    "reset_at": "2024-01-15T15:00:00Z"
  }
}

Next Steps

Job Status

Monitor extraction job progress

Extraction Results

Retrieve full extraction data with metadata

Webhook Setup

Get real-time notifications for job completion

Error Handling

Handle extraction failures gracefully

Getting Started

Core Concepts

Guides

SDKs & Examples

Resources

Re-processing Documents

When to Re-process

Classification Issues

Low Confidence

Model Improvements

Audit Trail

Re-process a Document

Deduplication Behavior

Supported Document Types

View Job History

Common Re-processing Scenarios

Scenario 1: Document Classified as OTHER

Scenario 2: Low Confidence Extraction

Scenario 3: Comparing Extraction Runs

Best Practices

Don't Re-process Pending Jobs

Use Webhooks for Completion

Track Job IDs for Comparison

Use Meaningful Document Type Hints

Rate Limiting

Next Steps

Job Status

Extraction Results

Webhook Setup

Error Handling

Getting Started

Core Concepts

Guides

SDKs & Examples

Resources

​When to Re-process

Classification Issues

Low Confidence

Model Improvements

Audit Trail

​Re-process a Document

​Deduplication Behavior

​Supported Document Types

​View Job History

​Pagination

​Common Re-processing Scenarios

​Scenario 1: Document Classified as OTHER

​Scenario 2: Low Confidence Extraction

​Scenario 3: Comparing Extraction Runs

​Best Practices

Don't Re-process Pending Jobs

Use Webhooks for Completion

Track Job IDs for Comparison

Use Meaningful Document Type Hints

​Rate Limiting

​Next Steps

Job Status

Extraction Results

Webhook Setup

Error Handling

When to Re-process

Re-process a Document

Deduplication Behavior

Supported Document Types

View Job History

Pagination

Common Re-processing Scenarios

Scenario 1: Document Classified as OTHER

Scenario 2: Low Confidence Extraction

Scenario 3: Comparing Extraction Runs

Best Practices

Rate Limiting

Next Steps