Skip to main content

Process Documents via API

This guide covers the full document processing lifecycle: uploading documents, starting processing, polling for completion, and retrieving results from export files. You'll learn how to use both the one-step and multi-step workflows, and how to target different datasets.

Prerequisites

  • An API key with access to your tenant (see Authentication)
  • A project with workflow settings configured (extraction fields, document classes, etc.). See the Quick Start guide to set up your first project
  • Your tenant ID and project ID (you can find these in the UI):
    • Tenant ID: Click your user icon in the sidebar and look under Current Tenant
    • Project ID: Go to Library, find your project and look in Project Properties

Submit documents and start processing in a single call. This is the easiest way to process documents.

POST /tenants/{tenant_id}/projects/{project_id}/transactions/process

Example: Process an Invoice

curl -X POST "https://app.docaifabric.com/tenants/{tenant_id}/projects/{project_id}/transactions/process" \
-H "X-API-Key: your-api-key" \
-H "Content-Type: application/json" \
-d '{
"source_files": [
{
"filename": "invoice.pdf",
"base64_data": "JVBERi0xLjQK..."
}
],
"correlation_id": "my-tracking-id-001",
"metadata": {
"source": "email_import",
"batch": "daily-invoices"
}
}'

Response:

{
"transaction_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "processing",
"estimated_completion_seconds": 30,
"status_url": "/transactions/550e8400-e29b-41d4-a716-446655440000",
"poll_url": "/transactions/550e8400-e29b-41d4-a716-446655440000?timeout=30"
}

One-Step API Fields

FieldTypeRequiredDescription
source_filesarrayYes*List of files with filename and base64_data
correlation_idstringNoYour own tracking identifier for this transaction
metadataobjectNoArbitrary key-value metadata attached to the transaction
priorityintegerNoProcessing priority (higher = more urgent, default: 0)

* At least one of source_files, documentText, or ocrData is required.


Multi-Step API (For Advanced Workflows)

The multi-step API separates transaction creation from processing, giving you more control. This approach lets you:

  • Upload large or multiple files separately
  • Set a page limit for partial processing
  • Attach metadata and a correlation ID for tracking
  • Choose which dataset to submit to

Step 1: Create the Transaction

POST /tenants/{tenant_id}/projects/{project_id}/transactions
curl -X POST "https://app.docaifabric.com/tenants/{tenant_id}/projects/{project_id}/transactions" \
-H "X-API-Key: your-api-key" \
-H "Content-Type: application/json" \
-d '{
"dataset_id": "production",
"correlation_id": "my-tracking-id-001",
"metadata": {
"source": "email_import",
"batch": "daily-invoices"
}
}'

Response:

{
"transaction_id": "550e8400-e29b-41d4-a716-446655440001",
"tenant_id": "tenant-acme",
"project_id": "project-invoices",
"status": "created",
"correlation_id": "my-tracking-id-001",
"created_at": "2025-01-15T10:35:00Z",
"files_url": "/transactions/550e.../source-files",
"start_url": "/transactions/550e.../start"
}

Create Transaction Fields

FieldTypeRequiredDescription
dataset_idstringNoTarget dataset: "production" (default), "playground", "memory", or an evaluation dataset UUID
correlation_idstringNoYour own tracking identifier for this transaction
metadataobjectNoArbitrary key-value metadata attached to the transaction
page_limitintegerNoProcess only the first N pages (full file always stored)

Step 2: Upload Source Files

POST /transactions/{transaction_id}/source-files

This endpoint uses multipart/form-data: upload files directly under the files field:

curl -X POST "https://app.docaifabric.com/transactions/{transaction_id}/source-files" \
-H "X-API-Key: your-api-key" \
-F "files=@invoice_001.pdf" \
-F "files=@invoice_002.pdf"

Response:

{
"transaction_id": "550e8400-e29b-41d4-a716-446655440001",
"total_files": 2,
"uploaded_count": 2,
"failed_count": 0,
"uploaded_files": [
{
"file_name": "invoice_001.pdf",
"filename": "invoice_001.pdf",
"content_type": "application/pdf",
"index": 0,
"size": 125430,
"uploaded_at": "2025-01-15T10:35:05Z"
},
{
"file_name": "invoice_002.pdf",
"filename": "invoice_002.pdf",
"content_type": "application/pdf",
"index": 1,
"size": 98210,
"uploaded_at": "2025-01-15T10:35:05Z"
}
],
"failed_files": []
}
info

Files are assigned indices (0-based) in the order they appear in the request. If a file with the same name already exists, it will be automatically renamed with a numeric suffix (e.g., file_1.pdf, file_2.pdf).

Upload Fields

FieldInTypeRequiredDescription
filesform-datafile(s)YesOne or more files. Repeat the field to upload several in one request.
upload_tokensform-datastring(s)NoIdempotency token per file, aligned by position with files. Re-uploading the same token is a no-op (see below).
file_indexqueryintegerNo0-based starting index for the files in this request. Use when uploading one file at a time to preserve ordering.
Idempotent retries

If an upload request times out or the connection drops, you can't tell whether the server stored the file. Retrying blindly would create a duplicate copy of the file in the transaction.

To make a retry safe, generate a stable token per file (e.g. a UUID) and send it as upload_tokens. If the server already recorded that token, it returns the original index and stores nothing new, so you can retry as many times as you like:

curl -X POST "https://app.docaifabric.com/transactions/{transaction_id}/source-files" \
-H "X-API-Key: your-api-key" \
-F "files=@invoice_001.pdf" \
-F "upload_tokens=2f1c8e90-6a3d-4b1e-9c77-0a1b2c3d4e5f" \
-F "files=@invoice_002.pdf" \
-F "upload_tokens=8b7a6c54-3210-49fe-8dcb-a9f8e7d6c5b4"

Two genuinely different files keep distinct tokens, so the same-name auto-rename above still applies to them.

For a fully retry-safe submission flow, also handle the other two steps:

  • Create is not idempotent: a retried create makes a second transaction. Generate the transaction once and reuse its transaction_id on retries (correlation_id is for your own tracking only, not server-side dedup).
  • Start is not idempotent, but a repeated call returns 400 "Cannot start transaction in … state". Treat that specific error as success, since the prior call already started processing.

Step 3: Start Processing

POST /transactions/{transaction_id}/start
curl -X POST "https://app.docaifabric.com/transactions/{transaction_id}/start" \
-H "X-API-Key: your-api-key" \
-H "Content-Type: application/json" \
-d '{}'

Response:

{
"transaction_id": "550e8400-e29b-41d4-a716-446655440001",
"status": "processing",
"message": "Processing started",
"status_url": "/transactions/550e8400-e29b-41d4-a716-446655440001",
"poll_url": "/transactions/550e8400-e29b-41d4-a716-446655440001?timeout=30"
}

Datasets

By default, both the one-step and multi-step APIs create transactions in the Production dataset. To submit to Playground, Memory, or Evaluation datasets, use the dataset_id field in the Multi-Step API. See Datasets for details.


Getting Results: Poll or Webhook

Once a transaction is submitted, there are two ways to know when it's done and to fetch the results:

  1. Polling (covered below): call GET /transactions/{id} until status is completed, then list and download the export files. Simple, works from any client, no inbound HTTP required.
  2. Webhooks: DocAI Fabric POSTs a notification to a URL of your choice the moment the workflow finishes (or fails), with download links to the export files included. No polling, lower latency. Requires a publicly reachable HTTPS endpoint.

For high-volume or production integrations, webhooks are recommended. See Webhooks - Get Results Without Polling for the full guide. Every new project ships with a Notification activity already wired in at the end of its workflow; you just need to fill in the webhook URL.

The rest of this article covers the polling path.

Checking Transaction Status

After submitting a transaction, poll for completion using the transaction ID:

GET /transactions/{transaction_id}

Basic Status Check

curl "https://app.docaifabric.com/transactions/{transaction_id}" \
-H "X-API-Key: your-api-key"

Use the timeout parameter to wait for the transaction to complete instead of polling repeatedly:

# Wait up to 30 seconds for completion
curl "https://app.docaifabric.com/transactions/{transaction_id}?timeout=30" \
-H "X-API-Key: your-api-key"

The server holds the connection and returns immediately when the status changes, or after the timeout expires.

Lightweight Polling with Field Filtering

Use the fields parameter to request only the fields you need. This reduces the response from 200+ KB to just 2-3 KB:

# Poll with minimal payload
curl "https://app.docaifabric.com/transactions/{transaction_id}?fields=status,created_at,metadata.step_progress" \
-H "X-API-Key: your-api-key"

Batch Status Check

If you're tracking multiple transactions, use the batch endpoint instead of polling each one individually:

curl -X POST "https://app.docaifabric.com/transactions/batch-status" \
-H "X-API-Key: your-api-key" \
-H "Content-Type: application/json" \
-d '{
"transaction_ids": ["id-001", "id-002", "id-003"],
"fields": "status,created_at,metadata.progress"
}'

Response:

{
"statuses": {
"id-001": { "status": "completed", "created_at": "..." },
"id-002": { "status": "processing", "created_at": "..." },
"id-003": { "status": "queued", "created_at": "..." }
}
}

Up to 100 transaction IDs can be checked in a single call.

Query Parameters

ParameterTypeDescription
timeoutintegerLong-polling: server waits up to N seconds for a status change before responding. Recommended: 30.
fieldsstringComma-separated list of fields to return (supports dot notation, e.g., metadata.step_progress). Reduces payload size significantly.
include_requestbooleanInclude the original request data in the response (only applies to completed transactions)

Status Values

StatusDescription
createdTransaction created, awaiting files or start
queuedTransaction is queued and waiting for a worker to pick it up
processingActively processing (conversion → OCR → split → extraction)
awaiting_inputWorkflow is paused waiting for external input (e.g., human review)
completedAll processing phases completed successfully
failedProcessing failed (check error_message in the response)

Completed Response

This endpoint is for status, not data delivery. Once the transaction reaches completed, it confirms completion with timing and progress:

{
"transaction_id": "550e8400-e29b-41d4-a716-446655440000",
"tenant_id": "your-tenant-id",
"project_id": "your-project-id",
"status": "completed",
"created_at": "2025-01-15T10:35:00Z",
"started_at": "2025-01-15T10:35:01Z",
"completed_at": "2025-01-15T10:35:13Z",
"processing_time_seconds": 12.5,
"metadata": {
"progress": {
"total_documents": 1,
"completed_documents": 1,
"completion_percentage": 100.0
},
"step_progress": {
"current_step": "extract",
"current_step_name": "Extraction",
"steps": {}
}
}
}
Key fields to know
  • metadata.progress: Track overall completion with total_documents, completed_documents, and completion_percentage.
  • metadata.step_progress: Track which workflow step is currently running.
  • review_metadata: Present when status is awaiting_input, contains review step details like step_name, instructions, and timeout_hours.
Retrieving the extracted data

To read the results, don't parse the status response. Use the export files. Once status is completed, fetch the structured JSON (and any PDF / XLSX) as described in Retrieving Export Results below. The export file is the stable, canonical results format; the status endpoint only tells you when the data is ready.


Retrieving Export Results

Export files are how you retrieve transaction data. A project's export profiles render results into downloadable files; new projects ship with a JSON profile by default, and you can add PDF, XLSX, CSV, or XML profiles.

Exports run automatically as the final workflow step: when a transaction reaches completed, its export files are ready (no separate API call triggers them).

The default JSON file

Every project is created with a JSON export profile that produces a transaction-level file named {transaction_id}_data.json. This is the canonical, structured results file (see JSON Results Schema below). The profile is editable in Project Settings → Export: change its parameters to reshape the JSON, or remove it if you only need PDF/XLSX.

Step 1: List the Export Manifest

GET /transactions/{transaction_id}/exports
curl "https://app.docaifabric.com/transactions/{transaction_id}/exports" \
-H "X-API-Key: your-api-key"

Response:

{
"transaction_id": "550e8400-e29b-41d4-a716-446655440000",
"profiles": {
"invoice-json": {
"profile_id": "invoice-json",
"profile_name": "Invoice JSON",
"output_format": "json",
"scope_mode": "per_document",
"files": [
{
"filename": "INV-12345_data.json",
"path": "exports/INV-12345_data.json",
"size_bytes": 4821,
"document_ids": ["doc-uuid-001"],
"created_at": "2025-01-15T10:35:13Z"
}
]
},
"invoice-pdf": {
"profile_id": "invoice-pdf",
"profile_name": "Invoice PDF",
"output_format": "pdf",
"scope_mode": "per_document",
"files": [
{
"filename": "INV-12345.pdf",
"path": "exports/INV-12345.pdf",
"size_bytes": 128430,
"document_ids": ["doc-uuid-001"],
"created_at": "2025-01-15T10:35:13Z"
}
]
}
},
"total_files": 2,
"total_size_bytes": 133251
}

Each export profile appears under profiles, keyed by profile ID. The files array lists the generated files. If a profile fails, its entry has an empty files array and an error string. If the project has no export profiles, the response is {"profiles": {}, "total_files": 0, "total_size_bytes": 0, ...}.

Step 2: Download a File

GET /transactions/{transaction_id}/exports/download/{filename}

Use the filename value from the manifest (it may contain subfolders, e.g. inv123/scan.pdf):

# Download the JSON export
curl "https://app.docaifabric.com/transactions/{transaction_id}/exports/download/INV-12345_data.json" \
-H "X-API-Key: your-api-key" -OJ

# Download the PDF export
curl "https://app.docaifabric.com/transactions/{transaction_id}/exports/download/INV-12345.pdf" \
-H "X-API-Key: your-api-key" -OJ

The response is the raw file bytes with a Content-Type matching the extension (application/json, application/pdf, text/csv, application/xml, .xlsx) and a Content-Disposition: attachment header. Add ?inline=true to render in-browser instead of forcing a download.

ParameterTypeDescription
inlinebooleantrue serves the file inline (Content-Disposition: inline); default false forces a download
note

Export endpoints require the transaction.download_results permission on your API key. The download path is read from the manifest's filename field, not path (which carries the internal exports/ prefix the endpoint adds for you).

JSON Results Schema

The default JSON export ({transaction_id}_data.json) is a single transaction-level file. With the profile's default settings it looks like this:

{
"transaction_id": "550e8400-e29b-41d4-a716-446655440000",
"tenant_id": "your-tenant-id",
"project_id": "your-project-id",
"transaction": {
"exported_at": "2025-01-15T10:35:14Z",
"created_at": "2025-01-15T10:35:00Z",
"source_file_names": ["invoice.pdf"],
"project_version": 12,
"correlation_id": "my-tracking-id-001",
"source": "api",
"last_human_review": { "reviewed": false },
"metadata": { "batch": "daily-invoices" }
},
"validation": {
"overall_status": "success",
"transaction_rule_results": []
},
"documents": [
{
"document_id": "doc-uuid-001",
"document_index": 1,
"document_type": "Invoice",
"classification_confidence": 0.98,
"classification": {
"class_id": "3620f15d",
"type": "Invoice",
"confidence": 0.98,
"reasoning": "Matched layout and 'INVOICE' header.",
"alternatives": [],
"confirmed": null,
"review_required": false
},
"split": {},
"fields": {
"717dd8c1": {
"name": "Invoice Number",
"value": "INV-12345",
"confidence": 0.95,
"bounding_boxes": [
{ "coordinates": [571.0, 661.0, 1163.0, 699.0], "page_index": 0 }
],
"reasoning": "Shown top-right as 'Invoice No. INV-12345'.",
"review_required": false,
"is_confirmed": null
}
}
}
]
}
PathDescription
transactionTransaction metadata: timing, source files, your correlation_id and metadata, and last_human_review (the final review action, or {"reviewed": false})
validationTransaction-level business-rule results; overall_status is success, warning, or error
documents[]One entry per document. fields is keyed by field ID, where each value has name (display name), value, confidence, bounding_boxes, reasoning, review_required, and is_confirmed
documents[].validationPer-document rule results, present only when the document has warnings or errors

This shape is controlled by the JSON export profile in Project Settings → Export. Common options: include_validation, include_pages, include_extraction_metadata (value-only vs. full field detail), field_ids (export a subset of fields), and scope (one file per transaction vs. one file per document). Adjusting the profile changes this file for every subsequent transaction: it is the single source of truth for your JSON results.


Listing Transactions

Retrieve transactions from a specific dataset:

GET /transactions?dataset_id={dataset_id}
# List production transactions
curl "https://app.docaifabric.com/transactions?dataset_id=production&limit=20" \
-H "X-API-Key: your-api-key"

# List playground transactions
curl "https://app.docaifabric.com/transactions?dataset_id=playground&limit=20" \
-H "X-API-Key: your-api-key"

Filtering

ParameterDescription
dataset_idDataset to list from (default: "production")
statusFilter by status: created, queued, processing, awaiting_input, completed, failed
limitMaximum results to return (default: 100)
offsetPagination offset
searchFree-text search across transaction ID, correlation ID, file names, document types, and index fields
sort_byField to sort by (e.g., created_at, status, processing_time_seconds, transaction_id, or an index field like idx_vendor)
sort_order"asc" or "desc"
stepFilter by current workflow step name
date_fromISO datetime: only return transactions created at or after this time
date_toISO datetime: only return transactions created at or before this time
index_filterJSON filter on index fields, e.g., {"idx_vendor": ["Acme"]}
filter_expressionAdvanced expression tree (JSON). Overrides status, step, date_from, date_to, and index_filter when provided

Complete Example: Python

End-to-end example using Python to submit a document and retrieve results:

import requests
import base64
import json

BASE_URL = "https://app.docaifabric.com"
API_KEY = "your-api-key"
TENANT_ID = "your-tenant-id"
PROJECT_ID = "your-project-id"

headers = {"X-API-Key": API_KEY}

# 1. Read and encode the file
with open("invoice.pdf", "rb") as f:
file_data = base64.b64encode(f.read()).decode("utf-8")

# 2. Submit for processing (one-step)
response = requests.post(
f"{BASE_URL}/tenants/{TENANT_ID}/projects/{PROJECT_ID}/transactions/process",
headers={**headers, "Content-Type": "application/json"},
json={
"source_files": [
{"filename": "invoice.pdf", "base64_data": file_data}
]
}
)
transaction_id = response.json()["transaction_id"]
print(f"Transaction created: {transaction_id}")

# 3. Poll for completion (status only, lightweight)
while True:
status = requests.get(
f"{BASE_URL}/transactions/{transaction_id}",
headers=headers,
params={"timeout": 30, "fields": "status,metadata.progress"}
).json()

print(f"Status: {status['status']}")

if status["status"] in ("completed", "failed", "awaiting_input"):
break

# 4. Retrieve results from the export files
if status["status"] == "completed":
manifest = requests.get(
f"{BASE_URL}/transactions/{transaction_id}/exports",
headers=headers
).json()

for profile in manifest.get("profiles", {}).values():
for file in profile.get("files", []):
content = requests.get(
f"{BASE_URL}/transactions/{transaction_id}"
f"/exports/download/{file['filename']}",
headers=headers
).content
with open(file["filename"], "wb") as out:
out.write(content)
print(f"Saved {file['filename']} ({file['size_bytes']} bytes)")

# Parse the default JSON export to read extracted fields
json_file = next(
(f["filename"]
for p in manifest["profiles"].values() if p["output_format"] == "json"
for f in p["files"]),
None
)
if json_file:
with open(json_file) as f:
data = json.load(f)
for doc in data["documents"]:
print(f"\nDocument: {doc.get('document_type', 'Unknown')}")
for field_id, field in doc["fields"].items():
print(f" {field['name']}: {field['value']} "
f"(confidence: {field['confidence']})")

Complete Example: Multi-Step with Dataset Selection

Submit a document to the Playground dataset using the multi-step API:

import requests

BASE_URL = "https://app.docaifabric.com"
API_KEY = "your-api-key"
TENANT_ID = "your-tenant-id"
PROJECT_ID = "your-project-id"

headers = {"X-API-Key": API_KEY}

# 1. Create transaction in the Playground dataset
tx = requests.post(
f"{BASE_URL}/tenants/{TENANT_ID}/projects/{PROJECT_ID}/transactions",
headers={**headers, "Content-Type": "application/json"},
json={
"dataset_id": "playground",
"correlation_id": "test-run-001"
}
).json()

transaction_id = tx["transaction_id"]
print(f"Created transaction: {transaction_id}")

# 2. Upload files (multipart/form-data)
with open("invoice.pdf", "rb") as f:
requests.post(
f"{BASE_URL}/transactions/{transaction_id}/source-files",
headers=headers,
files=[("files", ("invoice.pdf", f, "application/pdf"))]
)

# 3. Start processing
requests.post(
f"{BASE_URL}/transactions/{transaction_id}/start",
headers={**headers, "Content-Type": "application/json"},
json={}
)

print(f"Processing started. Poll status at: GET /transactions/{transaction_id}?timeout=30")

Tips

  • Use correlation IDs: Set correlation_id to your internal document ID for easy cross-referencing.
  • Make uploads retry-safe: Send a stable upload_tokens value per file so a retry after a network drop doesn't duplicate the file. See Step 2.
  • Poll for status, not data: GET /transactions/{id} tells you when a transaction is done; read the actual results from the export files. Add ?timeout=30 for long-polling and ?fields=status,metadata.progress to keep poll responses small.
  • Start with Playground: Test your integration against the Playground dataset before switching to Production.
  • Page limits for testing: Set page_limit to process only the first few pages of large documents during development.
  • Batch status checks: When tracking many transactions, use POST /transactions/batch-status instead of polling each one individually.
  • Exports are ready at completed: Export files are produced by the final workflow step, so list them with GET /transactions/{id}/exports as soon as the status is completed, with no separate trigger call needed.
  • Skip polling with webhooks: If your receiver is publicly reachable, configure a Notification activity in the project workflow and DocAI Fabric will POST the results manifest to you the moment the transaction completes. See Webhooks.