Document Ingestion Pipeline (PDFs, Tables, OCR)

Definition

A document ingestion pipeline is a specialized implementation of the general data ingestion architecture (Normalize → Chunk → Embed) designed for unstructured file formats such as PDFs, images, and office documents.

While the abstract stages are identical to plain-text pipelines, the implementation requires "layout-to-logic" transformation to preserve semantic meaning.

Structural Normalization

Normalization for documents is the process of converting visual layout into a machine-readable semantic format, typically Markdown or JSON.

Unlike plain-text normalization (whitespace stripping), document normalization must:

De-noising: Removing artifacts such as page numbers, running headers, footers, and watermarks.
Reading Order Reconstruction: Reordering text blocks from the raw PDF stream into a logical human reading sequence.
Format Conversion: Transforming complex elements like tables, mathematical formulas, and lists into structured representations (e.g., Markdown tables or LaTeX).

Layout-Aware Chunking

Chunking for documents must respect structural boundaries to prevent semantic loss.

Strategies

Structural Chunking: Using document metadata (H1, H2 tags) or layout analysis to ensure chunks do not break across sections or chapters.
Table-Preserving Chunking: Treating a table as a single atomic unit or chunking it by row while prepending the header to each row vector to maintain context.
Overlap with Context: Including the section title or previous paragraph summary in every chunk's metadata to ensure the embedding model captures the broader context.

Vector Representation of Large Documents

Large documents (multi-page, multi-topic) extend the general geometric theory by introducing Point Clouds and Manifolds into the vector space.

The Point Cloud Concept

A large document is not a single point; it is a set of discrete vectors representing its constituent chunks.

Example: In an Annual Report PDF, the "CEO Letter" chunk will land in a region of the vector space associated with corporate strategy, while the "Balance Sheet" chunk will land in a region associated with quantitative accounting data.
Topological Spread: These chunks are distributed across the vector space based on their specific content, even though they belong to the same file.

The Document as a Manifold

In sequential documents, the sequence of chunk vectors forms a manifold—a "path" through the vector space.

Example: In a legal contract, Clause 1.1 (Definitions) provides the semantic foundation for Clause 4.2 (Indemnification).
Sequential Retrieval: While standard KNN identifies the specific Indemnification coordinate, the system often retrieves the "neighboring" Definition chunk from the document sequence to ensure the LLM has the necessary context to interpret the result.

Retrieval Dynamics for Documents

Retrieval does not return "the document" but rather the subset of the point cloud that satisfies the query vector.

Input: Single query vector (Point).
Output: Selection of the top-K closest document points (Point Set).
Synthesis: The original text from these points is aggregated to form the context for the LLM response.

PDF Parsing

... PDF parsing is the technical execution of identifying text blocks and visual hierarchy.

Extraction Challenges

Multi-column layouts: Distinguishing between independent columns vs. full-width blocks.
Layered Objects: Handling PDFs where text is hidden behind images or where the text layer does not match the visual representation.

Table Extraction

Table extraction converts visual grids into structured data.

Methodologies

Vision-based: Using deep learning models (e.g., Table Transformer) to detect boundaries.
LLM-mediated: Passing image crops to Vision-Language Models (VLMs) for direct structured output.

OCR (Optical Character Recognition)

OCR is the transformation of pixel data into text, required for scans or documents without a text layer.

Layout Analysis and Vision-to-Text

Layout analysis identifies structural components (title, paragraph, list, image, caption).

Emerging Patterns

Vision-First Parsing: Models like Docling or Marker process the document as an image to output structured Markdown directly.
Multimodal Ingestion: Passing entire pages to multimodal models to skip intermediate parsing steps.

Examples and Tools

Node.js/TypeScript: pdf-parse, unstructured-client, tesseract.js.
Python-based (Local/Server): Docling, Marker, PyMuPDF.
Enterprise APIs: Google Document AI, AWS Textract, Azure AI Document Intelligence.