Document Processing
This article is currently under review. Some content may be incomplete or inaccurate.
Document processing is the core function of DocAI Fabric. This guide explains how documents flow through the system and how each processing step works.
Processing Pipeline
Upload → Import → Convert → OCR → Split → Classify → Extract
Import
The import activity ingests uploaded files into the system. Files are stored and registered in the transaction's document index.
Conversion
Converts uploaded files into a standard format for processing:
- PDF files → Individual page images
- Images → Normalized and optimized for OCR
- Multi-page documents → Split into individual page images
Uses poppler-utils (pdf2image) for PDF conversion.
If a page_limit was set when creating the transaction, only the first N pages of each source file are converted to images. All downstream steps (OCR, split, classification, extraction) only see those pages. The original source file is always preserved in full.
OCR (Optical Character Recognition)
Extracts text from page images using Azure AI Vision (Microsoft OCR):
- Full page text extraction
- Word-level bounding boxes for precise data location
- Support for multiple languages
- Configurable parallel processing (controlled by
MAX_PARALLEL_OCR)
Split (Document Boundary Detection)
Detects where one document ends and another begins in multi-document uploads:
- Uses AI to analyze page content and detect boundaries
- Creates individual documents from a batch upload
- Assigns pages to the correct document
Classification
Categorizes each document by type using Azure OpenAI:
- Compares document content against project-defined document classes
- Returns a classification with a confidence score
- Supports multi-class classification
Extraction
Extracts structured data from each document using Azure OpenAI:
- Uses the document class's field definitions as a schema
- Returns field values with confidence scores
- Provides bounding box coordinates showing where data was found on the page
- Supports complex field types including tables and lists
Parallel Processing
DocAI Fabric processes documents in parallel where possible:
- OCR runs in parallel across pages (configurable via
MAX_PARALLEL_OCR) - Classification & Extraction run in parallel across documents (configurable via
MAX_PARALLEL_LLM) - Activities within a workflow run sequentially (each depends on the previous)
Error Handling
- Failed activities can be retried individually
- The system includes circuit breakers and rate limiting for external API calls
- Detailed error information is stored with each activity execution