Skip to main content
Documents sometimes need to be re-processed. This guide covers when and how to re-extract data from documents that have already been uploaded.

When to Re-process

You should re-process a document when:

Classification Issues

Document was classified as OTHER or unknown and you want to provide a type hint

Low Confidence

Extraction had low confidence scores and you want to retry with explicit hints

Model Improvements

New extraction models are available and you want better results

Audit Trail

Compare extraction accuracy across multiple runs for quality assurance
Re-processing creates a new extraction job but uses the same document in the Cold Vault. You don’t need to re-upload the PDF - the original file is immutable and always available.

Re-process a Document

Re-process an existing document by creating a new extraction job:
# Basic re-processing (no hint)
curl -X POST https://api.docintell.com/v1/documents/019370ab-c123-7def-8901-234567890abc/jobs \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{}'

# Re-processing with document type hint
curl -X POST https://api.docintell.com/v1/documents/019370ab-c123-7def-8901-234567890abc/jobs \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document_type": "capital_call"
  }'
Response (202 Accepted):
{
  "document_id": "019370ab-c123-7def-8901-234567890abc",
  "job_id": "019370ab-d456-7def-8901-234567890def",
  "status": "pending",
  "vault_uri": "gs://docintel-vault/019370ab-c123-7def-8901-234567890abc.pdf",
  "created_at": "2024-01-15T14:30:00Z"
}

Deduplication Behavior

DocIntell automatically prevents duplicate jobs for the same document:
  • If a pending or processing job already exists, the existing job is returned
  • No duplicate extraction jobs are created
  • This prevents accidental re-submission from retry logic or double-clicks
If you call POST /v1/documents/{id}/jobs while a job is already in progress:
{
  "document_id": "019370ab-c123-7def-8901-234567890abc",
  "job_id": "019370ab-d456-7def-8901-234567890def",
  "status": "processing",
  "vault_uri": "gs://docintel-vault/019370ab-c123-7def-8901-234567890abc.pdf",
  "created_at": "2024-01-15T14:28:00Z"
}
Notice the job_id and created_at match the existing job (not a new one).

Supported Document Types

You can provide a document_type hint to improve classification accuracy:
TypeDescription
capital_callCapital call notices
distributionDistribution notices
k1K-1 tax forms
nav_statementNAV statements
audited_financialAudited financial statements
invoiceCommercial invoices
trade_confirmationTrade confirmations
wire_confirmationWire transfer confirmations
subscription_agreementSubscription agreements
insuranceInsurance documents
investor_letterInvestor letters
unknownLet the system auto-classify
If you’re unsure of the document type, omit document_type entirely. The extraction engine will auto-classify the document based on its content.

View Job History

List all extraction attempts for a document to see the full re-processing history:
curl -X GET https://api.docintell.com/v1/documents/019370ab-c123-7def-8901-234567890abc/jobs \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"
Response:
{
  "jobs": [
    {
      "job_id": "019370ab-d456-7def-8901-234567890def",
      "document_id": "019370ab-c123-7def-8901-234567890abc",
      "status": "completed",
      "created_at": "2024-01-15T14:30:00Z",
      "processing_completed_at": "2024-01-15T14:30:45Z",
      "processing_time_seconds": 45.2
    },
    {
      "job_id": "019370ab-d123-7def-8901-234567890abc",
      "document_id": "019370ab-c123-7def-8901-234567890abc",
      "status": "completed",
      "created_at": "2024-01-15T10:00:00Z",
      "processing_completed_at": "2024-01-15T10:00:52Z",
      "processing_time_seconds": 52.1
    }
  ],
  "total": 2,
  "page": 1,
  "per_page": 20
}

Pagination

Job history supports pagination for documents with many re-processing attempts:
# Get page 2 with 10 items per page
curl -X GET "https://api.docintell.com/v1/documents/019370ab-c123-7def-8901-234567890abc/jobs?page=2&per_page=10" \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"
ParameterTypeDefaultDescription
pageinteger1Page number (1-indexed)
per_pageinteger20Items per page (max 100)
Jobs are ordered newest first. The first item (page 1, index 0) is always the most recent extraction attempt.

Common Re-processing Scenarios

Scenario 1: Document Classified as OTHER

Your document was uploaded but classified as OTHER or unknown:
1

Check the current classification

curl -X GET https://api.docintell.com/v1/jobs/019370ab-d123-7def-8901-234567890abc \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"
Response shows:
{
  "document_type": "unknown",
  "classification_reasoning": "Unable to classify document"
}
2

Re-process with document type hint

curl -X POST https://api.docintell.com/v1/documents/019370ab-c123-7def-8901-234567890abc/jobs \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"document_type": "capital_call"}'
3

Compare results

Retrieve both jobs and compare extraction quality:
# Original job (auto-classified as unknown)
curl -X GET https://api.docintell.com/v1/jobs/019370ab-d123-7def-8901-234567890abc/results \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"

# New job (with capital_call hint)
curl -X GET https://api.docintell.com/v1/jobs/019370ab-d456-7def-8901-234567890def/results \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"

Scenario 2: Low Confidence Extraction

Initial extraction had low confidence scores for critical fields:
1

Review confidence scores

curl -X GET https://api.docintell.com/v1/jobs/019370ab-d123-7def-8901-234567890abc/results \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY"
Response shows low confidence:
{
  "extraction": {
    "data": {
      "total_amount": 50000.00,
      "due_date": "2024-02-01"
    },
    "field_metadata": {
      "total_amount": {
        "confidence": 0.45,
        "reasoning": "Ambiguous - multiple dollar amounts found"
      }
    }
  }
}
2

Re-process with correct document type

Providing the correct document type often improves confidence:
curl -X POST https://api.docintell.com/v1/documents/019370ab-c123-7def-8901-234567890abc/jobs \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"document_type": "invoice"}'
3

Verify improved confidence

{
  "extraction": {
    "data": {
      "total_amount": 50000.00,
      "due_date": "2024-02-01"
    },
    "field_metadata": {
      "total_amount": {
        "confidence": 0.92,
        "reasoning": "Found in 'Total Amount Due' section"
      }
    }
  }
}
Confidence scores are self-reported by the LLM and are directionally useful but not calibrated. A 90% confidence score does NOT mean 90% accuracy. Use scores for relative comparison between fields, not as absolute probabilities.

Scenario 3: Comparing Extraction Runs

Track extraction accuracy improvements over time:
import requests

headers = {"Authorization": "Bearer dk_live_YOUR_API_KEY"}
document_id = "019370ab-c123-7def-8901-234567890abc"

# Get all jobs for the document
jobs_response = requests.get(
    f"https://api.docintell.com/v1/documents/{document_id}/jobs",
    headers=headers
)
jobs = jobs_response.json()["jobs"]

# Compare results across all jobs
for job in jobs:
    if job["status"] == "completed":
        results = requests.get(
            f"https://api.docintell.com/v1/jobs/{job['job_id']}/results",
            headers=headers
        ).json()

        print(f"Job {job['job_id']}:")
        print(f"  Created: {job['created_at']}")
        print(f"  Type: {results['classification']['document_type']}")
        print(f"  Confidence: {results['classification']['confidence']}")
        print(f"  Processing Time: {job['processing_time_seconds']}s")
        print()

Best Practices

Don't Re-process Pending Jobs

Why: Deduplication prevents duplicate jobs, but checking status first avoids unnecessary API calls.
# Check status before re-processing
status = requests.get(
    f"https://api.docintell.com/v1/documents/{document_id}/status",
    headers=headers
).json()

if status["status"] in ["pending", "processing"]:
    print("Job already in progress")
else:
    # Safe to re-process
    requests.post(...)

Use Webhooks for Completion

Why: Polling for job completion wastes API quota and adds latency.Set up webhooks once and receive notifications for all jobs:
curl -X POST https://api.docintell.com/v1/webhooks \
  -H "Authorization: Bearer dk_live_YOUR_API_KEY" \
  -d '{
    "url": "https://yourapp.com/webhooks/docintel",
    "events": ["document.processing.completed"]
  }'

Track Job IDs for Comparison

Why: Each extraction creates a unique job_id. Track them to compare results.
job_mapping = {
    "original": "019370ab-d123-7def-8901-234567890abc",
    "with_hint": "019370ab-d456-7def-8901-234567890def"
}

# Compare confidence scores
for label, job_id in job_mapping.items():
    results = get_job_results(job_id)
    print(f"{label}: {results['classification']['confidence']}")

Use Meaningful Document Type Hints

Why: Type hints significantly improve extraction accuracy for known document types.
DO: Provide document_type if you know the document type
DON’T: Use "unknown" as a hint - omit document_type instead

Rate Limiting

Re-processing is subject to the document ingestion rate limit: 100 documents/hour per tenant. Rate limit headers are returned in responses:
HTTP/1.1 202 Accepted
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1705329600
If you exceed the rate limit, you’ll receive a 429 Too Many Requests error:
{
  "error": "rate_limit_exceeded",
  "message": "Document ingestion rate limit exceeded (100 documents/hour)",
  "details": {
    "limit": 100,
    "remaining": 0,
    "reset_at": "2024-01-15T15:00:00Z"
  }
}

Next Steps