Document Processing

Under Review

This article is currently under review. Some content may be incomplete or inaccurate.

Document processing is the core function of DocAI Fabric. This guide explains how documents flow through the system and how each processing step works.

Processing Pipeline

Upload → Import → Convert → OCR → Split → Classify → Extract

Import

The import activity ingests uploaded files into the system. Files are stored and registered in the transaction's document index.

Conversion

Converts uploaded files into a standard format for processing:

PDF files → Individual page images
Images → Normalized and optimized for OCR
Multi-page documents → Split into individual page images

Uses poppler-utils (pdf2image) for PDF conversion.

Page Limit

If a page_limit was set when creating the transaction, only the first N pages of each source file are converted to images. All downstream steps (OCR, split, classification, extraction) only see those pages. The original source file is always preserved in full.

OCR (Optical Character Recognition)

Extracts text from page images using Azure AI Vision (Microsoft OCR):

Full page text extraction
Word-level bounding boxes for precise data location
Support for multiple languages
Configurable parallel processing (controlled by MAX_PARALLEL_OCR)

Split (Document Boundary Detection)

Detects where one document ends and another begins in multi-document uploads:

Uses AI to analyze page content and detect boundaries
Creates individual documents from a batch upload
Assigns pages to the correct document

Classification

Categorizes each document by type using Azure OpenAI:

Compares document content against project-defined document classes
Returns a classification with a confidence score
Supports multi-class classification

Extraction

Extracts structured data from each document using Azure OpenAI:

Uses the document class's field definitions as a schema
Returns field values with confidence scores
Provides bounding box coordinates showing where data was found on the page
Supports complex field types including tables and lists

Parallel Processing

DocAI Fabric processes documents in parallel where possible:

OCR runs in parallel across pages (configurable via MAX_PARALLEL_OCR)
Classification & Extraction run in parallel across documents (configurable via MAX_PARALLEL_LLM)
Activities within a workflow run sequentially (each depends on the previous)

Error Handling

Failed activities can be retried individually
The system includes circuit breakers and rate limiting for external API calls
Detailed error information is stored with each activity execution

Processing Pipeline​

Import​

Conversion​

OCR (Optical Character Recognition)​

Split (Document Boundary Detection)​

Classification​

Extraction​

Parallel Processing​

Error Handling​