Process Documents via API
This guide covers the full document processing lifecycle: uploading documents, starting processing, polling for completion, and retrieving results from export files. You'll learn how to use both the one-step and multi-step workflows, and how to target different datasets.
Prerequisites
- An API key with access to your tenant (see Authentication)
- A project with workflow settings configured (extraction fields, document classes, etc.). See the Quick Start guide to set up your first project
- Your tenant ID and project ID (you can find these in the UI):
- Tenant ID: Click your user icon in the sidebar and look under Current Tenant
- Project ID: Go to Library, find your project and look in Project Properties
One-Step API (Recommended for Simple Use Cases)
Submit documents and start processing in a single call. This is the easiest way to process documents.
POST /tenants/{tenant_id}/projects/{project_id}/transactions/process
Example: Process an Invoice
curl -X POST "https://app.docaifabric.com/tenants/{tenant_id}/projects/{project_id}/transactions/process" \
-H "X-API-Key: your-api-key" \
-H "Content-Type: application/json" \
-d '{
"source_files": [
{
"filename": "invoice.pdf",
"base64_data": "JVBERi0xLjQK..."
}
],
"correlation_id": "my-tracking-id-001",
"metadata": {
"source": "email_import",
"batch": "daily-invoices"
}
}'
Response:
{
"transaction_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "processing",
"estimated_completion_seconds": 30,
"status_url": "/transactions/550e8400-e29b-41d4-a716-446655440000",
"poll_url": "/transactions/550e8400-e29b-41d4-a716-446655440000?timeout=30"
}
One-Step API Fields
| Field | Type | Required | Description |
|---|---|---|---|
source_files | array | Yes* | List of files with filename and base64_data |
correlation_id | string | No | Your own tracking identifier for this transaction |
metadata | object | No | Arbitrary key-value metadata attached to the transaction |
priority | integer | No | Processing priority (higher = more urgent, default: 0) |
* At least one of source_files, documentText, or ocrData is required.
Multi-Step API (For Advanced Workflows)
The multi-step API separates transaction creation from processing, giving you more control. This approach lets you:
- Upload large or multiple files separately
- Set a page limit for partial processing
- Attach metadata and a correlation ID for tracking
- Choose which dataset to submit to
Step 1: Create the Transaction
POST /tenants/{tenant_id}/projects/{project_id}/transactions
curl -X POST "https://app.docaifabric.com/tenants/{tenant_id}/projects/{project_id}/transactions" \
-H "X-API-Key: your-api-key" \
-H "Content-Type: application/json" \
-d '{
"dataset_id": "production",
"correlation_id": "my-tracking-id-001",
"metadata": {
"source": "email_import",
"batch": "daily-invoices"
}
}'
Response:
{
"transaction_id": "550e8400-e29b-41d4-a716-446655440001",
"tenant_id": "tenant-acme",
"project_id": "project-invoices",
"status": "created",
"correlation_id": "my-tracking-id-001",
"created_at": "2025-01-15T10:35:00Z",
"files_url": "/transactions/550e.../source-files",
"start_url": "/transactions/550e.../start"
}
Create Transaction Fields
| Field | Type | Required | Description |
|---|---|---|---|
dataset_id | string | No | Target dataset: "production" (default), "playground", "memory", or an evaluation dataset UUID |
correlation_id | string | No | Your own tracking identifier for this transaction |
metadata | object | No | Arbitrary key-value metadata attached to the transaction |
page_limit | integer | No | Process only the first N pages (full file always stored) |
Step 2: Upload Source Files
POST /transactions/{transaction_id}/source-files
This endpoint uses multipart/form-data: upload files directly under the files field:
curl -X POST "https://app.docaifabric.com/transactions/{transaction_id}/source-files" \
-H "X-API-Key: your-api-key" \
-F "files=@invoice_001.pdf" \
-F "files=@invoice_002.pdf"
Response:
{
"transaction_id": "550e8400-e29b-41d4-a716-446655440001",
"total_files": 2,
"uploaded_count": 2,
"failed_count": 0,
"uploaded_files": [
{
"file_name": "invoice_001.pdf",
"filename": "invoice_001.pdf",
"content_type": "application/pdf",
"index": 0,
"size": 125430,
"uploaded_at": "2025-01-15T10:35:05Z"
},
{
"file_name": "invoice_002.pdf",
"filename": "invoice_002.pdf",
"content_type": "application/pdf",
"index": 1,
"size": 98210,
"uploaded_at": "2025-01-15T10:35:05Z"
}
],
"failed_files": []
}
Files are assigned indices (0-based) in the order they appear in the request. If a file with the same name already exists, it will be automatically renamed with a numeric suffix (e.g., file_1.pdf, file_2.pdf).
Upload Fields
| Field | In | Type | Required | Description |
|---|---|---|---|---|
files | form-data | file(s) | Yes | One or more files. Repeat the field to upload several in one request. |
upload_tokens | form-data | string(s) | No | Idempotency token per file, aligned by position with files. Re-uploading the same token is a no-op (see below). |
file_index | query | integer | No | 0-based starting index for the files in this request. Use when uploading one file at a time to preserve ordering. |
If an upload request times out or the connection drops, you can't tell whether the server stored the file. Retrying blindly would create a duplicate copy of the file in the transaction.
To make a retry safe, generate a stable token per file (e.g. a UUID) and send it as upload_tokens. If the server already recorded that token, it returns the original index and stores nothing new, so you can retry as many times as you like:
curl -X POST "https://app.docaifabric.com/transactions/{transaction_id}/source-files" \
-H "X-API-Key: your-api-key" \
-F "files=@invoice_001.pdf" \
-F "upload_tokens=2f1c8e90-6a3d-4b1e-9c77-0a1b2c3d4e5f" \
-F "files=@invoice_002.pdf" \
-F "upload_tokens=8b7a6c54-3210-49fe-8dcb-a9f8e7d6c5b4"
Two genuinely different files keep distinct tokens, so the same-name auto-rename above still applies to them.
For a fully retry-safe submission flow, also handle the other two steps:
- Create is not idempotent: a retried create makes a second transaction. Generate the transaction once and reuse its
transaction_idon retries (correlation_idis for your own tracking only, not server-side dedup). - Start is not idempotent, but a repeated call returns
400 "Cannot start transaction in … state". Treat that specific error as success, since the prior call already started processing.
Step 3: Start Processing
POST /transactions/{transaction_id}/start
curl -X POST "https://app.docaifabric.com/transactions/{transaction_id}/start" \
-H "X-API-Key: your-api-key" \
-H "Content-Type: application/json" \
-d '{}'
Response:
{
"transaction_id": "550e8400-e29b-41d4-a716-446655440001",
"status": "processing",
"message": "Processing started",
"status_url": "/transactions/550e8400-e29b-41d4-a716-446655440001",
"poll_url": "/transactions/550e8400-e29b-41d4-a716-446655440001?timeout=30"
}
By default, both the one-step and multi-step APIs create transactions in the Production dataset. To submit to Playground, Memory, or Evaluation datasets, use the dataset_id field in the Multi-Step API. See Datasets for details.
Getting Results: Poll or Webhook
Once a transaction is submitted, there are two ways to know when it's done and to fetch the results:
- Polling (covered below): call
GET /transactions/{id}untilstatusiscompleted, then list and download the export files. Simple, works from any client, no inbound HTTP required. - Webhooks: DocAI Fabric POSTs a notification to a URL of your choice the moment the workflow finishes (or fails), with download links to the export files included. No polling, lower latency. Requires a publicly reachable HTTPS endpoint.
For high-volume or production integrations, webhooks are recommended. See Webhooks - Get Results Without Polling for the full guide. Every new project ships with a Notification activity already wired in at the end of its workflow; you just need to fill in the webhook URL.
The rest of this article covers the polling path.
Checking Transaction Status
After submitting a transaction, poll for completion using the transaction ID:
GET /transactions/{transaction_id}
Basic Status Check
curl "https://app.docaifabric.com/transactions/{transaction_id}" \
-H "X-API-Key: your-api-key"
Long-Polling (Recommended)
Use the timeout parameter to wait for the transaction to complete instead of polling repeatedly:
# Wait up to 30 seconds for completion
curl "https://app.docaifabric.com/transactions/{transaction_id}?timeout=30" \
-H "X-API-Key: your-api-key"
The server holds the connection and returns immediately when the status changes, or after the timeout expires.
Lightweight Polling with Field Filtering
Use the fields parameter to request only the fields you need. This reduces the response from 200+ KB to just 2-3 KB:
# Poll with minimal payload
curl "https://app.docaifabric.com/transactions/{transaction_id}?fields=status,created_at,metadata.step_progress" \
-H "X-API-Key: your-api-key"
Batch Status Check
If you're tracking multiple transactions, use the batch endpoint instead of polling each one individually:
curl -X POST "https://app.docaifabric.com/transactions/batch-status" \
-H "X-API-Key: your-api-key" \
-H "Content-Type: application/json" \
-d '{
"transaction_ids": ["id-001", "id-002", "id-003"],
"fields": "status,created_at,metadata.progress"
}'
Response:
{
"statuses": {
"id-001": { "status": "completed", "created_at": "..." },
"id-002": { "status": "processing", "created_at": "..." },
"id-003": { "status": "queued", "created_at": "..." }
}
}
Up to 100 transaction IDs can be checked in a single call.
Query Parameters
| Parameter | Type | Description |
|---|---|---|
timeout | integer | Long-polling: server waits up to N seconds for a status change before responding. Recommended: 30. |
fields | string | Comma-separated list of fields to return (supports dot notation, e.g., metadata.step_progress). Reduces payload size significantly. |
include_request | boolean | Include the original request data in the response (only applies to completed transactions) |
Status Values
| Status | Description |
|---|---|
created | Transaction created, awaiting files or start |
queued | Transaction is queued and waiting for a worker to pick it up |
processing | Actively processing (conversion → OCR → split → extraction) |
awaiting_input | Workflow is paused waiting for external input (e.g., human review) |
completed | All processing phases completed successfully |
failed | Processing failed (check error_message in the response) |
Completed Response
This endpoint is for status, not data delivery. Once the transaction
reaches completed, it confirms completion with timing and progress:
{
"transaction_id": "550e8400-e29b-41d4-a716-446655440000",
"tenant_id": "your-tenant-id",
"project_id": "your-project-id",
"status": "completed",
"created_at": "2025-01-15T10:35:00Z",
"started_at": "2025-01-15T10:35:01Z",
"completed_at": "2025-01-15T10:35:13Z",
"processing_time_seconds": 12.5,
"metadata": {
"progress": {
"total_documents": 1,
"completed_documents": 1,
"completion_percentage": 100.0
},
"step_progress": {
"current_step": "extract",
"current_step_name": "Extraction",
"steps": {}
}
}
}
metadata.progress: Track overall completion withtotal_documents,completed_documents, andcompletion_percentage.metadata.step_progress: Track which workflow step is currently running.review_metadata: Present when status isawaiting_input, contains review step details likestep_name,instructions, andtimeout_hours.
To read the results, don't parse the status response. Use the export
files. Once status is completed, fetch the structured JSON (and any PDF /
XLSX) as described in Retrieving Export Results
below. The export file is the stable, canonical results format; the status
endpoint only tells you when the data is ready.
Retrieving Export Results
Export files are how you retrieve transaction data. A project's export profiles render results into downloadable files; new projects ship with a JSON profile by default, and you can add PDF, XLSX, CSV, or XML profiles.
Exports run automatically as the final workflow step: when a transaction reaches completed, its export files are ready (no separate API call triggers them).
Every project is created with a JSON export profile that produces a transaction-level file named {transaction_id}_data.json. This is the canonical, structured results file (see JSON Results Schema below). The profile is editable in Project Settings → Export: change its parameters to reshape the JSON, or remove it if you only need PDF/XLSX.
Step 1: List the Export Manifest
GET /transactions/{transaction_id}/exports
curl "https://app.docaifabric.com/transactions/{transaction_id}/exports" \
-H "X-API-Key: your-api-key"
Response:
{
"transaction_id": "550e8400-e29b-41d4-a716-446655440000",
"profiles": {
"invoice-json": {
"profile_id": "invoice-json",
"profile_name": "Invoice JSON",
"output_format": "json",
"scope_mode": "per_document",
"files": [
{
"filename": "INV-12345_data.json",
"path": "exports/INV-12345_data.json",
"size_bytes": 4821,
"document_ids": ["doc-uuid-001"],
"created_at": "2025-01-15T10:35:13Z"
}
]
},
"invoice-pdf": {
"profile_id": "invoice-pdf",
"profile_name": "Invoice PDF",
"output_format": "pdf",
"scope_mode": "per_document",
"files": [
{
"filename": "INV-12345.pdf",
"path": "exports/INV-12345.pdf",
"size_bytes": 128430,
"document_ids": ["doc-uuid-001"],
"created_at": "2025-01-15T10:35:13Z"
}
]
}
},
"total_files": 2,
"total_size_bytes": 133251
}
Each export profile appears under profiles, keyed by profile ID. The files array lists the generated files. If a profile fails, its entry has an empty files array and an error string. If the project has no export profiles, the response is {"profiles": {}, "total_files": 0, "total_size_bytes": 0, ...}.
Step 2: Download a File
GET /transactions/{transaction_id}/exports/download/{filename}
Use the filename value from the manifest (it may contain subfolders, e.g. inv123/scan.pdf):
# Download the JSON export
curl "https://app.docaifabric.com/transactions/{transaction_id}/exports/download/INV-12345_data.json" \
-H "X-API-Key: your-api-key" -OJ
# Download the PDF export
curl "https://app.docaifabric.com/transactions/{transaction_id}/exports/download/INV-12345.pdf" \
-H "X-API-Key: your-api-key" -OJ
The response is the raw file bytes with a Content-Type matching the extension (application/json, application/pdf, text/csv, application/xml, .xlsx) and a Content-Disposition: attachment header. Add ?inline=true to render in-browser instead of forcing a download.
| Parameter | Type | Description |
|---|---|---|
inline | boolean | true serves the file inline (Content-Disposition: inline); default false forces a download |
Export endpoints require the transaction.download_results permission on your API key. The download path is read from the manifest's filename field, not path (which carries the internal exports/ prefix the endpoint adds for you).
JSON Results Schema
The default JSON export ({transaction_id}_data.json) is a single transaction-level file. With the profile's default settings it looks like this:
{
"transaction_id": "550e8400-e29b-41d4-a716-446655440000",
"tenant_id": "your-tenant-id",
"project_id": "your-project-id",
"transaction": {
"exported_at": "2025-01-15T10:35:14Z",
"created_at": "2025-01-15T10:35:00Z",
"source_file_names": ["invoice.pdf"],
"project_version": 12,
"correlation_id": "my-tracking-id-001",
"source": "api",
"last_human_review": { "reviewed": false },
"metadata": { "batch": "daily-invoices" }
},
"validation": {
"overall_status": "success",
"transaction_rule_results": []
},
"documents": [
{
"document_id": "doc-uuid-001",
"document_index": 1,
"document_type": "Invoice",
"classification_confidence": 0.98,
"classification": {
"class_id": "3620f15d",
"type": "Invoice",
"confidence": 0.98,
"reasoning": "Matched layout and 'INVOICE' header.",
"alternatives": [],
"confirmed": null,
"review_required": false
},
"split": {},
"fields": {
"717dd8c1": {
"name": "Invoice Number",
"value": "INV-12345",
"confidence": 0.95,
"bounding_boxes": [
{ "coordinates": [571.0, 661.0, 1163.0, 699.0], "page_index": 0 }
],
"reasoning": "Shown top-right as 'Invoice No. INV-12345'.",
"review_required": false,
"is_confirmed": null
}
}
}
]
}
| Path | Description |
|---|---|
transaction | Transaction metadata: timing, source files, your correlation_id and metadata, and last_human_review (the final review action, or {"reviewed": false}) |
validation | Transaction-level business-rule results; overall_status is success, warning, or error |
documents[] | One entry per document. fields is keyed by field ID, where each value has name (display name), value, confidence, bounding_boxes, reasoning, review_required, and is_confirmed |
documents[].validation | Per-document rule results, present only when the document has warnings or errors |
This shape is controlled by the JSON export profile in Project Settings → Export. Common options: include_validation, include_pages, include_extraction_metadata (value-only vs. full field detail), field_ids (export a subset of fields), and scope (one file per transaction vs. one file per document). Adjusting the profile changes this file for every subsequent transaction: it is the single source of truth for your JSON results.
Listing Transactions
Retrieve transactions from a specific dataset:
GET /transactions?dataset_id={dataset_id}
# List production transactions
curl "https://app.docaifabric.com/transactions?dataset_id=production&limit=20" \
-H "X-API-Key: your-api-key"
# List playground transactions
curl "https://app.docaifabric.com/transactions?dataset_id=playground&limit=20" \
-H "X-API-Key: your-api-key"
Filtering
| Parameter | Description |
|---|---|
dataset_id | Dataset to list from (default: "production") |
status | Filter by status: created, queued, processing, awaiting_input, completed, failed |
limit | Maximum results to return (default: 100) |
offset | Pagination offset |
search | Free-text search across transaction ID, correlation ID, file names, document types, and index fields |
sort_by | Field to sort by (e.g., created_at, status, processing_time_seconds, transaction_id, or an index field like idx_vendor) |
sort_order | "asc" or "desc" |
step | Filter by current workflow step name |
date_from | ISO datetime: only return transactions created at or after this time |
date_to | ISO datetime: only return transactions created at or before this time |
index_filter | JSON filter on index fields, e.g., {"idx_vendor": ["Acme"]} |
filter_expression | Advanced expression tree (JSON). Overrides status, step, date_from, date_to, and index_filter when provided |
Complete Example: Python
End-to-end example using Python to submit a document and retrieve results:
import requests
import base64
import json
BASE_URL = "https://app.docaifabric.com"
API_KEY = "your-api-key"
TENANT_ID = "your-tenant-id"
PROJECT_ID = "your-project-id"
headers = {"X-API-Key": API_KEY}
# 1. Read and encode the file
with open("invoice.pdf", "rb") as f:
file_data = base64.b64encode(f.read()).decode("utf-8")
# 2. Submit for processing (one-step)
response = requests.post(
f"{BASE_URL}/tenants/{TENANT_ID}/projects/{PROJECT_ID}/transactions/process",
headers={**headers, "Content-Type": "application/json"},
json={
"source_files": [
{"filename": "invoice.pdf", "base64_data": file_data}
]
}
)
transaction_id = response.json()["transaction_id"]
print(f"Transaction created: {transaction_id}")
# 3. Poll for completion (status only, lightweight)
while True:
status = requests.get(
f"{BASE_URL}/transactions/{transaction_id}",
headers=headers,
params={"timeout": 30, "fields": "status,metadata.progress"}
).json()
print(f"Status: {status['status']}")
if status["status"] in ("completed", "failed", "awaiting_input"):
break
# 4. Retrieve results from the export files
if status["status"] == "completed":
manifest = requests.get(
f"{BASE_URL}/transactions/{transaction_id}/exports",
headers=headers
).json()
for profile in manifest.get("profiles", {}).values():
for file in profile.get("files", []):
content = requests.get(
f"{BASE_URL}/transactions/{transaction_id}"
f"/exports/download/{file['filename']}",
headers=headers
).content
with open(file["filename"], "wb") as out:
out.write(content)
print(f"Saved {file['filename']} ({file['size_bytes']} bytes)")
# Parse the default JSON export to read extracted fields
json_file = next(
(f["filename"]
for p in manifest["profiles"].values() if p["output_format"] == "json"
for f in p["files"]),
None
)
if json_file:
with open(json_file) as f:
data = json.load(f)
for doc in data["documents"]:
print(f"\nDocument: {doc.get('document_type', 'Unknown')}")
for field_id, field in doc["fields"].items():
print(f" {field['name']}: {field['value']} "
f"(confidence: {field['confidence']})")
Complete Example: Multi-Step with Dataset Selection
Submit a document to the Playground dataset using the multi-step API:
import requests
BASE_URL = "https://app.docaifabric.com"
API_KEY = "your-api-key"
TENANT_ID = "your-tenant-id"
PROJECT_ID = "your-project-id"
headers = {"X-API-Key": API_KEY}
# 1. Create transaction in the Playground dataset
tx = requests.post(
f"{BASE_URL}/tenants/{TENANT_ID}/projects/{PROJECT_ID}/transactions",
headers={**headers, "Content-Type": "application/json"},
json={
"dataset_id": "playground",
"correlation_id": "test-run-001"
}
).json()
transaction_id = tx["transaction_id"]
print(f"Created transaction: {transaction_id}")
# 2. Upload files (multipart/form-data)
with open("invoice.pdf", "rb") as f:
requests.post(
f"{BASE_URL}/transactions/{transaction_id}/source-files",
headers=headers,
files=[("files", ("invoice.pdf", f, "application/pdf"))]
)
# 3. Start processing
requests.post(
f"{BASE_URL}/transactions/{transaction_id}/start",
headers={**headers, "Content-Type": "application/json"},
json={}
)
print(f"Processing started. Poll status at: GET /transactions/{transaction_id}?timeout=30")
Tips
- Use correlation IDs: Set
correlation_idto your internal document ID for easy cross-referencing. - Make uploads retry-safe: Send a stable
upload_tokensvalue per file so a retry after a network drop doesn't duplicate the file. See Step 2. - Poll for status, not data:
GET /transactions/{id}tells you when a transaction is done; read the actual results from the export files. Add?timeout=30for long-polling and?fields=status,metadata.progressto keep poll responses small. - Start with Playground: Test your integration against the Playground dataset before switching to Production.
- Page limits for testing: Set
page_limitto process only the first few pages of large documents during development. - Batch status checks: When tracking many transactions, use
POST /transactions/batch-statusinstead of polling each one individually. - Exports are ready at
completed: Export files are produced by the final workflow step, so list them withGET /transactions/{id}/exportsas soon as the status iscompleted, with no separate trigger call needed. - Skip polling with webhooks: If your receiver is publicly reachable, configure a Notification activity in the project workflow and DocAI Fabric will POST the results manifest to you the moment the transaction completes. See Webhooks.