Skip to main content

Document Processing

Under Review

This article is currently under review. Some content may be incomplete or inaccurate.

Document processing is the core function of DocAI Fabric. This guide explains how documents flow through the system and how each processing step works.

Processing Pipeline

Upload → Import → Convert → OCR → Split → Classify → Extract

Import

The import activity ingests uploaded files into the system. Files are stored and registered in the transaction's document index.

Conversion

Converts uploaded files into a standard format for processing:

  • PDF files → Individual page images
  • Images → Normalized and optimized for OCR
  • Multi-page documents → Split into individual page images

Uses poppler-utils (pdf2image) for PDF conversion.

Page Limit

If a page_limit was set when creating the transaction, only the first N pages of each source file are converted to images. All downstream steps (OCR, split, classification, extraction) only see those pages. The original source file is always preserved in full.

OCR (Optical Character Recognition)

Extracts text from page images using Azure AI Vision (Microsoft OCR):

  • Full page text extraction
  • Word-level bounding boxes for precise data location
  • Support for multiple languages
  • Configurable parallel processing (controlled by MAX_PARALLEL_OCR)

Split (Document Boundary Detection)

Detects where one document ends and another begins in multi-document uploads:

  • Uses AI to analyze page content and detect boundaries
  • Creates individual documents from a batch upload
  • Assigns pages to the correct document

Classification

Categorizes each document by type using Azure OpenAI:

  • Compares document content against project-defined document classes
  • Returns a classification with a confidence score
  • Supports multi-class classification

Extraction

Extracts structured data from each document using Azure OpenAI:

  • Uses the document class's field definitions as a schema
  • Returns field values with confidence scores
  • Provides bounding box coordinates showing where data was found on the page
  • Supports complex field types including tables and lists

Parallel Processing

DocAI Fabric processes documents in parallel where possible:

  • OCR runs in parallel across pages (configurable via MAX_PARALLEL_OCR)
  • Classification & Extraction run in parallel across documents (configurable via MAX_PARALLEL_LLM)
  • Activities within a workflow run sequentially (each depends on the previous)

Error Handling

  • Failed activities can be retried individually
  • The system includes circuit breakers and rate limiting for external API calls
  • Detailed error information is stored with each activity execution