Documents sometimes need to be re-processed. This guide covers when and how to re-extract data from documents that have already been uploaded.
When to Re-process
You should re-process a document when:
Classification Issues Document was classified as OTHER or unknown and you want to provide a type hint
Low Confidence Extraction had low confidence scores and you want to retry with explicit hints
Model Improvements New extraction models are available and you want better results
Audit Trail Compare extraction accuracy across multiple runs for quality assurance
Re-processing creates a new extraction job but uses the same document in the Cold Vault.
You don’t need to re-upload the PDF - the original file is immutable and always available.
Re-process a Document
Re-process an existing document by creating a new extraction job:
# Basic re-processing (no hint)
curl -X POST https://api.docintell.com/v1/documents/019370ab-c123-7def-8901-234567890abc/jobs \
-H "Authorization: Bearer dk_live_YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{}'
# Re-processing with document type hint
curl -X POST https://api.docintell.com/v1/documents/019370ab-c123-7def-8901-234567890abc/jobs \
-H "Authorization: Bearer dk_live_YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"document_type": "capital_call"
}'
Response (202 Accepted):
{
"document_id" : "019370ab-c123-7def-8901-234567890abc" ,
"job_id" : "019370ab-d456-7def-8901-234567890def" ,
"status" : "pending" ,
"vault_uri" : "gs://docintel-vault/019370ab-c123-7def-8901-234567890abc.pdf" ,
"created_at" : "2024-01-15T14:30:00Z"
}
Deduplication Behavior
DocIntell automatically prevents duplicate jobs for the same document:
If a pending or processing job already exists, the existing job is returned
No duplicate extraction jobs are created
This prevents accidental re-submission from retry logic or double-clicks
Example: Deduplication Response
If you call POST /v1/documents/{id}/jobs while a job is already in progress: {
"document_id" : "019370ab-c123-7def-8901-234567890abc" ,
"job_id" : "019370ab-d456-7def-8901-234567890def" ,
"status" : "processing" ,
"vault_uri" : "gs://docintel-vault/019370ab-c123-7def-8901-234567890abc.pdf" ,
"created_at" : "2024-01-15T14:28:00Z"
}
Notice the job_id and created_at match the existing job (not a new one).
Supported Document Types
You can provide a document_type hint to improve classification accuracy:
Type Description capital_callCapital call notices distributionDistribution notices k1K-1 tax forms nav_statementNAV statements audited_financialAudited financial statements invoiceCommercial invoices trade_confirmationTrade confirmations wire_confirmationWire transfer confirmations subscription_agreementSubscription agreements insuranceInsurance documents investor_letterInvestor letters unknownLet the system auto-classify
If you’re unsure of the document type, omit document_type entirely. The extraction engine
will auto-classify the document based on its content.
View Job History
List all extraction attempts for a document to see the full re-processing history:
curl -X GET https://api.docintell.com/v1/documents/019370ab-c123-7def-8901-234567890abc/jobs \
-H "Authorization: Bearer dk_live_YOUR_API_KEY"
Response:
{
"jobs" : [
{
"job_id" : "019370ab-d456-7def-8901-234567890def" ,
"document_id" : "019370ab-c123-7def-8901-234567890abc" ,
"status" : "completed" ,
"created_at" : "2024-01-15T14:30:00Z" ,
"processing_completed_at" : "2024-01-15T14:30:45Z" ,
"processing_time_seconds" : 45.2
},
{
"job_id" : "019370ab-d123-7def-8901-234567890abc" ,
"document_id" : "019370ab-c123-7def-8901-234567890abc" ,
"status" : "completed" ,
"created_at" : "2024-01-15T10:00:00Z" ,
"processing_completed_at" : "2024-01-15T10:00:52Z" ,
"processing_time_seconds" : 52.1
}
],
"total" : 2 ,
"page" : 1 ,
"per_page" : 20
}
Job history supports pagination for documents with many re-processing attempts:
# Get page 2 with 10 items per page
curl -X GET "https://api.docintell.com/v1/documents/019370ab-c123-7def-8901-234567890abc/jobs?page=2&per_page=10" \
-H "Authorization: Bearer dk_live_YOUR_API_KEY"
Parameter Type Default Description pageinteger 1Page number (1-indexed) per_pageinteger 20Items per page (max 100)
Jobs are ordered newest first . The first item (page 1, index 0) is always the most recent extraction attempt.
Common Re-processing Scenarios
Scenario 1: Document Classified as OTHER
Your document was uploaded but classified as OTHER or unknown:
Check the current classification
curl -X GET https://api.docintell.com/v1/jobs/019370ab-d123-7def-8901-234567890abc \
-H "Authorization: Bearer dk_live_YOUR_API_KEY"
Response shows: {
"document_type" : "unknown" ,
"classification_reasoning" : "Unable to classify document"
}
Re-process with document type hint
curl -X POST https://api.docintell.com/v1/documents/019370ab-c123-7def-8901-234567890abc/jobs \
-H "Authorization: Bearer dk_live_YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"document_type": "capital_call"}'
Compare results
Retrieve both jobs and compare extraction quality: # Original job (auto-classified as unknown)
curl -X GET https://api.docintell.com/v1/jobs/019370ab-d123-7def-8901-234567890abc/results \
-H "Authorization: Bearer dk_live_YOUR_API_KEY"
# New job (with capital_call hint)
curl -X GET https://api.docintell.com/v1/jobs/019370ab-d456-7def-8901-234567890def/results \
-H "Authorization: Bearer dk_live_YOUR_API_KEY"
Initial extraction had low confidence scores for critical fields:
Review confidence scores
curl -X GET https://api.docintell.com/v1/jobs/019370ab-d123-7def-8901-234567890abc/results \
-H "Authorization: Bearer dk_live_YOUR_API_KEY"
Response shows low confidence: {
"extraction" : {
"data" : {
"total_amount" : 50000.00 ,
"due_date" : "2024-02-01"
},
"field_metadata" : {
"total_amount" : {
"confidence" : 0.45 ,
"reasoning" : "Ambiguous - multiple dollar amounts found"
}
}
}
}
Re-process with correct document type
Providing the correct document type often improves confidence: curl -X POST https://api.docintell.com/v1/documents/019370ab-c123-7def-8901-234567890abc/jobs \
-H "Authorization: Bearer dk_live_YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"document_type": "invoice"}'
Verify improved confidence
{
"extraction" : {
"data" : {
"total_amount" : 50000.00 ,
"due_date" : "2024-02-01"
},
"field_metadata" : {
"total_amount" : {
"confidence" : 0.92 ,
"reasoning" : "Found in 'Total Amount Due' section"
}
}
}
}
Confidence scores are self-reported by the LLM and are directionally useful but not calibrated.
A 90% confidence score does NOT mean 90% accuracy. Use scores for relative comparison between fields,
not as absolute probabilities.
Track extraction accuracy improvements over time:
import requests
headers = { "Authorization" : "Bearer dk_live_YOUR_API_KEY" }
document_id = "019370ab-c123-7def-8901-234567890abc"
# Get all jobs for the document
jobs_response = requests.get(
f "https://api.docintell.com/v1/documents/ { document_id } /jobs" ,
headers = headers
)
jobs = jobs_response.json()[ "jobs" ]
# Compare results across all jobs
for job in jobs:
if job[ "status" ] == "completed" :
results = requests.get(
f "https://api.docintell.com/v1/jobs/ { job[ 'job_id' ] } /results" ,
headers = headers
).json()
print ( f "Job { job[ 'job_id' ] } :" )
print ( f " Created: { job[ 'created_at' ] } " )
print ( f " Type: { results[ 'classification' ][ 'document_type' ] } " )
print ( f " Confidence: { results[ 'classification' ][ 'confidence' ] } " )
print ( f " Processing Time: { job[ 'processing_time_seconds' ] } s" )
print ()
Best Practices
Don't Re-process Pending Jobs Why: Deduplication prevents duplicate jobs, but checking status first avoids unnecessary API calls.# Check status before re-processing
status = requests.get(
f "https://api.docintell.com/v1/documents/ { document_id } /status" ,
headers = headers
).json()
if status[ "status" ] in [ "pending" , "processing" ]:
print ( "Job already in progress" )
else :
# Safe to re-process
requests.post( ... )
Use Webhooks for Completion Why: Polling for job completion wastes API quota and adds latency.Set up webhooks once and receive notifications for all jobs: curl -X POST https://api.docintell.com/v1/webhooks \
-H "Authorization: Bearer dk_live_YOUR_API_KEY" \
-d '{
"url": "https://yourapp.com/webhooks/docintel",
"events": ["document.processing.completed"]
}'
Track Job IDs for Comparison Why: Each extraction creates a unique job_id. Track them to compare results.job_mapping = {
"original" : "019370ab-d123-7def-8901-234567890abc" ,
"with_hint" : "019370ab-d456-7def-8901-234567890def"
}
# Compare confidence scores
for label, job_id in job_mapping.items():
results = get_job_results(job_id)
print ( f " { label } : { results[ 'classification' ][ 'confidence' ] } " )
Use Meaningful Document Type Hints Why: Type hints significantly improve extraction accuracy for known document types.DO: Provide document_type if you know the document type
DON’T: Use "unknown" as a hint - omit document_type instead
Rate Limiting
Re-processing is subject to the document ingestion rate limit : 100 documents/hour per tenant.
Rate limit headers are returned in responses:
HTTP/1.1 202 Accepted
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1705329600
If you exceed the rate limit, you’ll receive a 429 Too Many Requests error: {
"error" : "rate_limit_exceeded" ,
"message" : "Document ingestion rate limit exceeded (100 documents/hour)" ,
"details" : {
"limit" : 100 ,
"remaining" : 0 ,
"reset_at" : "2024-01-15T15:00:00Z"
}
}
Next Steps