Process Documents via API

This guide covers the full document processing lifecycle: uploading documents, starting processing, polling for completion, and retrieving results from export files. You'll learn how to use both the one-step and multi-step workflows, and how to target different datasets.

Prerequisites

An API key with access to your tenant (see Authentication)
A project with workflow settings configured (extraction fields, document classes, etc.). See the Quick Start guide to set up your first project
Your tenant ID and project ID (you can find these in the UI):
- Tenant ID: Click your user icon in the sidebar and look under Current Tenant
- Project ID: Go to Library, find your project and look in Project Properties

One-Step API (Recommended for Simple Use Cases)

Submit documents and start processing in a single call. This is the easiest way to process documents.

POST /tenants/{tenant_id}/projects/{project_id}/transactions/process

Example: Process an Invoice

curl -X POST "https://app.docaifabric.com/tenants/{tenant_id}/projects/{project_id}/transactions/process" \
  -H "X-API-Key: your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "source_files": [
      {
        "filename": "invoice.pdf",
        "base64_data": "JVBERi0xLjQK..."
      }
    ],
    "correlation_id": "my-tracking-id-001",
    "metadata": {
      "source": "email_import",
      "batch": "daily-invoices"
    }
  }'

Response:

{
  "transaction_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "processing",
  "estimated_completion_seconds": 30,
  "status_url": "/transactions/550e8400-e29b-41d4-a716-446655440000",
  "poll_url": "/transactions/550e8400-e29b-41d4-a716-446655440000?timeout=30"
}

One-Step API Fields

Field	Type	Required	Description
`source_files`	array	Yes*	List of files with `filename` and `base64_data`
`correlation_id`	string	No	Your own tracking identifier for this transaction
`metadata`	object	No	Arbitrary key-value metadata attached to the transaction
`priority`	integer	No	Processing priority (higher = more urgent, default: `0`)

* At least one of source_files, documentText, or ocrData is required.

Multi-Step API (For Advanced Workflows)

The multi-step API separates transaction creation from processing, giving you more control. This approach lets you:

Upload large or multiple files separately
Set a page limit for partial processing
Attach metadata and a correlation ID for tracking
Choose which dataset to submit to

Step 1: Create the Transaction

POST /tenants/{tenant_id}/projects/{project_id}/transactions

curl -X POST "https://app.docaifabric.com/tenants/{tenant_id}/projects/{project_id}/transactions" \
  -H "X-API-Key: your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_id": "production",
    "correlation_id": "my-tracking-id-001",
    "metadata": {
      "source": "email_import",
      "batch": "daily-invoices"
    }
  }'

Response:

{
  "transaction_id": "550e8400-e29b-41d4-a716-446655440001",
  "tenant_id": "tenant-acme",
  "project_id": "project-invoices",
  "status": "created",
  "correlation_id": "my-tracking-id-001",
  "created_at": "2025-01-15T10:35:00Z",
  "files_url": "/transactions/550e.../source-files",
  "start_url": "/transactions/550e.../start"
}

Create Transaction Fields

Field	Type	Required	Description
`dataset_id`	string	No	Target dataset: `"production"` (default), `"playground"`, `"memory"`, or an evaluation dataset UUID
`correlation_id`	string	No	Your own tracking identifier for this transaction
`metadata`	object	No	Arbitrary key-value metadata attached to the transaction
`page_limit`	integer	No	Process only the first N pages (full file always stored)

Step 2: Upload Source Files

POST /transactions/{transaction_id}/source-files

This endpoint uses multipart/form-data: upload files directly under the files field:

curl -X POST "https://app.docaifabric.com/transactions/{transaction_id}/source-files" \
  -H "X-API-Key: your-api-key" \
  -F "files=@invoice_001.pdf" \
  -F "files=@invoice_002.pdf"

Response:

{
  "transaction_id": "550e8400-e29b-41d4-a716-446655440001",
  "total_files": 2,
  "uploaded_count": 2,
  "failed_count": 0,
  "uploaded_files": [
    {
      "file_name": "invoice_001.pdf",
      "filename": "invoice_001.pdf",
      "content_type": "application/pdf",
      "index": 0,
      "size": 125430,
      "uploaded_at": "2025-01-15T10:35:05Z"
    },
    {
      "file_name": "invoice_002.pdf",
      "filename": "invoice_002.pdf",
      "content_type": "application/pdf",
      "index": 1,
      "size": 98210,
      "uploaded_at": "2025-01-15T10:35:05Z"
    }
  ],
  "failed_files": []
}

info

Files are assigned indices (0-based) in the order they appear in the request. If a file with the same name already exists, it will be automatically renamed with a numeric suffix (e.g., file_1.pdf, file_2.pdf).

Upload Fields

Field	In	Type	Required	Description
`files`	form-data	file(s)	Yes	One or more files. Repeat the field to upload several in one request.
`upload_tokens`	form-data	string(s)	No	Idempotency token per file, aligned by position with `files`. Re-uploading the same token is a no-op (see below).
`file_index`	query	integer	No	0-based starting index for the files in this request. Use when uploading one file at a time to preserve ordering.

Idempotent retries

If an upload request times out or the connection drops, you can't tell whether the server stored the file. Retrying blindly would create a duplicate copy of the file in the transaction.

To make a retry safe, generate a stable token per file (e.g. a UUID) and send it as upload_tokens. If the server already recorded that token, it returns the original index and stores nothing new, so you can retry as many times as you like:

curl -X POST "https://app.docaifabric.com/transactions/{transaction_id}/source-files" \
  -H "X-API-Key: your-api-key" \
  -F "files=@invoice_001.pdf" \
  -F "upload_tokens=2f1c8e90-6a3d-4b1e-9c77-0a1b2c3d4e5f" \
  -F "files=@invoice_002.pdf" \
  -F "upload_tokens=8b7a6c54-3210-49fe-8dcb-a9f8e7d6c5b4"

Two genuinely different files keep distinct tokens, so the same-name auto-rename above still applies to them.

For a fully retry-safe submission flow, also handle the other two steps:

Create is not idempotent: a retried create makes a second transaction. Generate the transaction once and reuse its transaction_id on retries (correlation_id is for your own tracking only, not server-side dedup).
Start is not idempotent, but a repeated call returns 400 "Cannot start transaction in … state". Treat that specific error as success, since the prior call already started processing.

Step 3: Start Processing

POST /transactions/{transaction_id}/start

curl -X POST "https://app.docaifabric.com/transactions/{transaction_id}/start" \
  -H "X-API-Key: your-api-key" \
  -H "Content-Type: application/json" \
  -d '{}'

Response:

{
  "transaction_id": "550e8400-e29b-41d4-a716-446655440001",
  "status": "processing",
  "message": "Processing started",
  "status_url": "/transactions/550e8400-e29b-41d4-a716-446655440001",
  "poll_url": "/transactions/550e8400-e29b-41d4-a716-446655440001?timeout=30"
}

Datasets

By default, both the one-step and multi-step APIs create transactions in the Production dataset. To submit to Playground, Memory, or Evaluation datasets, use the dataset_id field in the Multi-Step API. See Datasets for details.

Getting Results: Poll or Webhook

Once a transaction is submitted, there are two ways to know when it's done and to fetch the results:

Polling (covered below): call GET /transactions/{id} until status is completed, then list and download the export files. Simple, works from any client, no inbound HTTP required.
Webhooks: DocAI Fabric POSTs a notification to a URL of your choice the moment the workflow finishes (or fails), with download links to the export files included. No polling, lower latency. Requires a publicly reachable HTTPS endpoint.

For high-volume or production integrations, webhooks are recommended. See Webhooks - Get Results Without Polling for the full guide. Every new project ships with a Notification activity already wired in at the end of its workflow; you just need to fill in the webhook URL.

The rest of this article covers the polling path.

Checking Transaction Status

After submitting a transaction, poll for completion using the transaction ID:

GET /transactions/{transaction_id}

Basic Status Check

curl "https://app.docaifabric.com/transactions/{transaction_id}" \
  -H "X-API-Key: your-api-key"

Long-Polling (Recommended)

Use the timeout parameter to wait for the transaction to complete instead of polling repeatedly:

# Wait up to 30 seconds for completion
curl "https://app.docaifabric.com/transactions/{transaction_id}?timeout=30" \
  -H "X-API-Key: your-api-key"

The server holds the connection and returns immediately when the status changes, or after the timeout expires.

Lightweight Polling with Field Filtering

Use the fields parameter to request only the fields you need. This reduces the response from 200+ KB to just 2-3 KB:

# Poll with minimal payload
curl "https://app.docaifabric.com/transactions/{transaction_id}?fields=status,created_at,metadata.step_progress" \
  -H "X-API-Key: your-api-key"

Batch Status Check

If you're tracking multiple transactions, use the batch endpoint instead of polling each one individually:

curl -X POST "https://app.docaifabric.com/transactions/batch-status" \
  -H "X-API-Key: your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "transaction_ids": ["id-001", "id-002", "id-003"],
    "fields": "status,created_at,metadata.progress"
  }'

Response:

{
  "statuses": {
    "id-001": { "status": "completed", "created_at": "..." },
    "id-002": { "status": "processing", "created_at": "..." },
    "id-003": { "status": "queued", "created_at": "..." }
  }
}

Up to 100 transaction IDs can be checked in a single call.

Query Parameters

Parameter	Type	Description
`timeout`	integer	Long-polling: server waits up to N seconds for a status change before responding. Recommended: `30`.
`fields`	string	Comma-separated list of fields to return (supports dot notation, e.g., `metadata.step_progress`). Reduces payload size significantly.
`include_request`	boolean	Include the original request data in the response (only applies to completed transactions)

Status Values

Status	Description
`created`	Transaction created, awaiting files or start
`queued`	Transaction is queued and waiting for a worker to pick it up
`processing`	Actively processing (conversion → OCR → split → extraction)
`awaiting_input`	Workflow is paused waiting for external input (e.g., human review)
`completed`	All processing phases completed successfully
`failed`	Processing failed (check `error_message` in the response)

Completed Response

This endpoint is for status, not data delivery. Once the transaction reaches completed, it confirms completion with timing and progress:

{
  "transaction_id": "550e8400-e29b-41d4-a716-446655440000",
  "tenant_id": "your-tenant-id",
  "project_id": "your-project-id",
  "status": "completed",
  "created_at": "2025-01-15T10:35:00Z",
  "started_at": "2025-01-15T10:35:01Z",
  "completed_at": "2025-01-15T10:35:13Z",
  "processing_time_seconds": 12.5,
  "metadata": {
    "progress": {
      "total_documents": 1,
      "completed_documents": 1,
      "completion_percentage": 100.0
    },
    "step_progress": {
      "current_step": "extract",
      "current_step_name": "Extraction",
      "steps": {}
    }
  }
}

Key fields to know

metadata.progress: Track overall completion with total_documents, completed_documents, and completion_percentage.
metadata.step_progress: Track which workflow step is currently running.
review_metadata: Present when status is awaiting_input, contains review step details like step_name, instructions, and timeout_hours.

Retrieving the extracted data

To read the results, don't parse the status response. Use the export files. Once status is completed, fetch the structured JSON (and any PDF / XLSX) as described in Retrieving Export Results below. The export file is the stable, canonical results format; the status endpoint only tells you when the data is ready.

Retrieving Export Results

Export files are how you retrieve transaction data. A project's export profiles render results into downloadable files; new projects ship with a JSON profile by default, and you can add PDF, XLSX, CSV, or XML profiles.

Exports run automatically as the final workflow step: when a transaction reaches completed, its export files are ready (no separate API call triggers them).

The default JSON file

Every project is created with a JSON export profile that produces a transaction-level file named {transaction_id}_data.json. This is the canonical, structured results file (see JSON Results Schema below). The profile is editable in Project Settings → Export: change its parameters to reshape the JSON, or remove it if you only need PDF/XLSX.

Step 1: List the Export Manifest

GET /transactions/{transaction_id}/exports

curl "https://app.docaifabric.com/transactions/{transaction_id}/exports" \
  -H "X-API-Key: your-api-key"

Response:

{
  "transaction_id": "550e8400-e29b-41d4-a716-446655440000",
  "profiles": {
    "invoice-json": {
      "profile_id": "invoice-json",
      "profile_name": "Invoice JSON",
      "output_format": "json",
      "scope_mode": "per_document",
      "files": [
        {
          "filename": "INV-12345_data.json",
          "path": "exports/INV-12345_data.json",
          "size_bytes": 4821,
          "document_ids": ["doc-uuid-001"],
          "created_at": "2025-01-15T10:35:13Z"
        }
      ]
    },
    "invoice-pdf": {
      "profile_id": "invoice-pdf",
      "profile_name": "Invoice PDF",
      "output_format": "pdf",
      "scope_mode": "per_document",
      "files": [
        {
          "filename": "INV-12345.pdf",
          "path": "exports/INV-12345.pdf",
          "size_bytes": 128430,
          "document_ids": ["doc-uuid-001"],
          "created_at": "2025-01-15T10:35:13Z"
        }
      ]
    }
  },
  "total_files": 2,
  "total_size_bytes": 133251
}

Each export profile appears under profiles, keyed by profile ID. The files array lists the generated files. If a profile fails, its entry has an empty files array and an error string. If the project has no export profiles, the response is {"profiles": {}, "total_files": 0, "total_size_bytes": 0, ...}.

Step 2: Download a File

GET /transactions/{transaction_id}/exports/download/{filename}

Use the filename value from the manifest (it may contain subfolders, e.g. inv123/scan.pdf):

# Download the JSON export
curl "https://app.docaifabric.com/transactions/{transaction_id}/exports/download/INV-12345_data.json" \
  -H "X-API-Key: your-api-key" -OJ

# Download the PDF export
curl "https://app.docaifabric.com/transactions/{transaction_id}/exports/download/INV-12345.pdf" \
  -H "X-API-Key: your-api-key" -OJ

The response is the raw file bytes with a Content-Type matching the extension (application/json, application/pdf, text/csv, application/xml, .xlsx) and a Content-Disposition: attachment header. Add ?inline=true to render in-browser instead of forcing a download.

Parameter	Type	Description
`inline`	boolean	`true` serves the file inline (`Content-Disposition: inline`); default `false` forces a download

note

Export endpoints require the transaction.download_results permission on your API key. The download path is read from the manifest's filename field, not path (which carries the internal exports/ prefix the endpoint adds for you).

JSON Results Schema

The default JSON export ({transaction_id}_data.json) is a single transaction-level file. With the profile's default settings it looks like this:

{
  "transaction_id": "550e8400-e29b-41d4-a716-446655440000",
  "tenant_id": "your-tenant-id",
  "project_id": "your-project-id",
  "transaction": {
    "exported_at": "2025-01-15T10:35:14Z",
    "created_at": "2025-01-15T10:35:00Z",
    "source_file_names": ["invoice.pdf"],
    "project_version": 12,
    "correlation_id": "my-tracking-id-001",
    "source": "api",
    "last_human_review": { "reviewed": false },
    "metadata": { "batch": "daily-invoices" }
  },
  "validation": {
    "overall_status": "success",
    "transaction_rule_results": []
  },
  "documents": [
    {
      "document_id": "doc-uuid-001",
      "document_index": 1,
      "document_type": "Invoice",
      "classification_confidence": 0.98,
      "classification": {
        "class_id": "3620f15d",
        "type": "Invoice",
        "confidence": 0.98,
        "reasoning": "Matched layout and 'INVOICE' header.",
        "alternatives": [],
        "confirmed": null,
        "review_required": false
      },
      "split": {},
      "fields": {
        "717dd8c1": {
          "name": "Invoice Number",
          "value": "INV-12345",
          "confidence": 0.95,
          "bounding_boxes": [
            { "coordinates": [571.0, 661.0, 1163.0, 699.0], "page_index": 0 }
          ],
          "reasoning": "Shown top-right as 'Invoice No. INV-12345'.",
          "review_required": false,
          "is_confirmed": null
        }
      }
    }
  ]
}

Path	Description
`transaction`	Transaction metadata: timing, source files, your `correlation_id` and `metadata`, and `last_human_review` (the final review action, or `{"reviewed": false}`)
`validation`	Transaction-level business-rule results; `overall_status` is `success`, `warning`, or `error`
`documents[]`	One entry per document. `fields` is keyed by field ID, where each value has `name` (display name), `value`, `confidence`, `bounding_boxes`, `reasoning`, `review_required`, and `is_confirmed`
`documents[].validation`	Per-document rule results, present only when the document has warnings or errors

This shape is controlled by the JSON export profile in Project Settings → Export. Common options: include_validation, include_pages, include_extraction_metadata (value-only vs. full field detail), field_ids (export a subset of fields), and scope (one file per transaction vs. one file per document). Adjusting the profile changes this file for every subsequent transaction: it is the single source of truth for your JSON results.

Listing Transactions

Retrieve transactions from a specific dataset:

GET /transactions?dataset_id={dataset_id}

# List production transactions
curl "https://app.docaifabric.com/transactions?dataset_id=production&limit=20" \
  -H "X-API-Key: your-api-key"

# List playground transactions
curl "https://app.docaifabric.com/transactions?dataset_id=playground&limit=20" \
  -H "X-API-Key: your-api-key"

Filtering

Parameter	Description
`dataset_id`	Dataset to list from (default: `"production"`)
`status`	Filter by status: `created`, `queued`, `processing`, `awaiting_input`, `completed`, `failed`
`limit`	Maximum results to return (default: 100)
`offset`	Pagination offset
`search`	Free-text search across transaction ID, correlation ID, file names, document types, and index fields
`sort_by`	Field to sort by (e.g., `created_at`, `status`, `processing_time_seconds`, `transaction_id`, or an index field like `idx_vendor`)
`sort_order`	`"asc"` or `"desc"`
`step`	Filter by current workflow step name
`date_from`	ISO datetime: only return transactions created at or after this time
`date_to`	ISO datetime: only return transactions created at or before this time
`index_filter`	JSON filter on index fields, e.g., `{"idx_vendor": ["Acme"]}`
`filter_expression`	Advanced expression tree (JSON). Overrides `status`, `step`, `date_from`, `date_to`, and `index_filter` when provided

Complete Example: Python

End-to-end example using Python to submit a document and retrieve results:

import requests
import base64
import json

BASE_URL = "https://app.docaifabric.com"
API_KEY = "your-api-key"
TENANT_ID = "your-tenant-id"
PROJECT_ID = "your-project-id"

headers = {"X-API-Key": API_KEY}

# 1. Read and encode the file
with open("invoice.pdf", "rb") as f:
    file_data = base64.b64encode(f.read()).decode("utf-8")

# 2. Submit for processing (one-step)
response = requests.post(
    f"{BASE_URL}/tenants/{TENANT_ID}/projects/{PROJECT_ID}/transactions/process",
    headers={**headers, "Content-Type": "application/json"},
    json={
        "source_files": [
            {"filename": "invoice.pdf", "base64_data": file_data}
        ]
    }
)
transaction_id = response.json()["transaction_id"]
print(f"Transaction created: {transaction_id}")

# 3. Poll for completion (status only, lightweight)
while True:
    status = requests.get(
        f"{BASE_URL}/transactions/{transaction_id}",
        headers=headers,
        params={"timeout": 30, "fields": "status,metadata.progress"}
    ).json()

    print(f"Status: {status['status']}")

    if status["status"] in ("completed", "failed", "awaiting_input"):
        break

# 4. Retrieve results from the export files
if status["status"] == "completed":
    manifest = requests.get(
        f"{BASE_URL}/transactions/{transaction_id}/exports",
        headers=headers
    ).json()

    for profile in manifest.get("profiles", {}).values():
        for file in profile.get("files", []):
            content = requests.get(
                f"{BASE_URL}/transactions/{transaction_id}"
                f"/exports/download/{file['filename']}",
                headers=headers
            ).content
            with open(file["filename"], "wb") as out:
                out.write(content)
            print(f"Saved {file['filename']} ({file['size_bytes']} bytes)")

    # Parse the default JSON export to read extracted fields
    json_file = next(
        (f["filename"]
         for p in manifest["profiles"].values() if p["output_format"] == "json"
         for f in p["files"]),
        None
    )
    if json_file:
        with open(json_file) as f:
            data = json.load(f)
        for doc in data["documents"]:
            print(f"\nDocument: {doc.get('document_type', 'Unknown')}")
            for field_id, field in doc["fields"].items():
                print(f"  {field['name']}: {field['value']} "
                      f"(confidence: {field['confidence']})")

Complete Example: Multi-Step with Dataset Selection

Submit a document to the Playground dataset using the multi-step API:

import requests

BASE_URL = "https://app.docaifabric.com"
API_KEY = "your-api-key"
TENANT_ID = "your-tenant-id"
PROJECT_ID = "your-project-id"

headers = {"X-API-Key": API_KEY}

# 1. Create transaction in the Playground dataset
tx = requests.post(
    f"{BASE_URL}/tenants/{TENANT_ID}/projects/{PROJECT_ID}/transactions",
    headers={**headers, "Content-Type": "application/json"},
    json={
        "dataset_id": "playground",
        "correlation_id": "test-run-001"
    }
).json()

transaction_id = tx["transaction_id"]
print(f"Created transaction: {transaction_id}")

# 2. Upload files (multipart/form-data)
with open("invoice.pdf", "rb") as f:
    requests.post(
        f"{BASE_URL}/transactions/{transaction_id}/source-files",
        headers=headers,
        files=[("files", ("invoice.pdf", f, "application/pdf"))]
    )

# 3. Start processing
requests.post(
    f"{BASE_URL}/transactions/{transaction_id}/start",
    headers={**headers, "Content-Type": "application/json"},
    json={}
)

print(f"Processing started. Poll status at: GET /transactions/{transaction_id}?timeout=30")

Tips

Use correlation IDs: Set correlation_id to your internal document ID for easy cross-referencing.
Make uploads retry-safe: Send a stable upload_tokens value per file so a retry after a network drop doesn't duplicate the file. See Step 2.
Poll for status, not data: GET /transactions/{id} tells you when a transaction is done; read the actual results from the export files. Add ?timeout=30 for long-polling and ?fields=status,metadata.progress to keep poll responses small.
Start with Playground: Test your integration against the Playground dataset before switching to Production.
Page limits for testing: Set page_limit to process only the first few pages of large documents during development.
Batch status checks: When tracking many transactions, use POST /transactions/batch-status instead of polling each one individually.
Exports are ready at completed: Export files are produced by the final workflow step, so list them with GET /transactions/{id}/exports as soon as the status is completed, with no separate trigger call needed.
Skip polling with webhooks: If your receiver is publicly reachable, configure a Notification activity in the project workflow and DocAI Fabric will POST the results manifest to you the moment the transaction completes. See Webhooks.

Prerequisites​

One-Step API (Recommended for Simple Use Cases)​

Example: Process an Invoice​

One-Step API Fields​

Multi-Step API (For Advanced Workflows)​

Step 1: Create the Transaction​

Create Transaction Fields​

Step 2: Upload Source Files​

Upload Fields​

Step 3: Start Processing​

Getting Results: Poll or Webhook​

Checking Transaction Status​

Basic Status Check​

Long-Polling (Recommended)​

Lightweight Polling with Field Filtering​

Batch Status Check​

Query Parameters​

Status Values​

Completed Response​

Retrieving Export Results​

Step 1: List the Export Manifest​

Step 2: Download a File​

JSON Results Schema​

Listing Transactions​

Filtering​

Complete Example: Python​

Complete Example: Multi-Step with Dataset Selection​

Tips​

Prerequisites

One-Step API (Recommended for Simple Use Cases)

Example: Process an Invoice

One-Step API Fields

Multi-Step API (For Advanced Workflows)

Step 1: Create the Transaction

Create Transaction Fields

Step 2: Upload Source Files

Upload Fields

Step 3: Start Processing

Getting Results: Poll or Webhook

Checking Transaction Status

Basic Status Check

Long-Polling (Recommended)

Lightweight Polling with Field Filtering

Batch Status Check

Query Parameters

Status Values

Completed Response

Retrieving Export Results

Step 1: List the Export Manifest

Step 2: Download a File

JSON Results Schema

Listing Transactions

Filtering

Complete Example: Python

Complete Example: Multi-Step with Dataset Selection

Tips