docpipe
Capabilities
GitHubPyPI v0.5.2
Docling · GLM-OCR · LangExtract · LangChain

Unstructured docs to
answers
in one pipeline

Parse, extract, ingest, and query with six RAG strategies. Run docpipe serve for /health, /metrics, and optional OTEL — composable pipelines, no lock-in.
GitHubQuickstart
$ pip install docpipe-sdk[all]
click to copy
Python 3.10+·MIT·PyPI v0.5.2·GitHub v0.5.2·OpenAPI
Python 3.10+
7 workflows · 6 RAG strategies
OTEL · /health · /metrics
MIT · zero vendor lock-in
API
POST/ingest· chunk + embedPOST/search· similarityPOST/rag/query· 6 strategiesPOST/rag/stream· SSE tokensDELETE/ingest· by sourceGET/health· deps + pluginsGET/metrics· PrometheusPOST/parse· Docling / GLM-OCR
API
POST/ingest· chunk + embedPOST/search· similarityPOST/rag/query· 6 strategiesPOST/rag/stream· SSE tokensDELETE/ingest· by sourceGET/health· deps + pluginsGET/metrics· PrometheusPOST/parse· Docling / GLM-OCRPOST/ingest· chunk + embedPOST/search· similarityPOST/rag/query· 6 strategiesPOST/rag/stream· SSE tokensDELETE/ingest· by sourceGET/health· deps + pluginsGET/metrics· PrometheusPOST/parse· Docling / GLM-OCR

Quickstart

Parse, ingest, and query from the CLI — or run the API server with Docker.

# Parse a document
$ docpipe parse invoice.pdf --format markdown

# Ingest into your vector DB
$ docpipe ingest report.pdf \
    --db "postgresql://..." \
    --table docs \
    --embedding-provider openai \
    --embedding-model text-embedding-3-small \
    --incremental

# Start API server (install [server] or [server,observability] for OTEL)
$ docpipe serve --port 8000

# Health & metrics (no auth on /metrics)
# curl http://localhost:8000/health
# curl http://localhost:8000/metrics

Capabilities at a glance

Parse through evaluate — composable pipelines orbiting a single SDK. Hover or tap to pause; click any node to jump in.

Orbit paused · tap wheel to resume · tap a node to explore

Composable Pipelines

Seven workflows — four core stages plus extract-only, full chain, and observability. Use each independently or chain them together. Your data, your DB, your LLM.

Documents

PDF, DOCX, images...

Parse

Docling · GLM-OCR

Extract

LangExtract · LangChain

Ingest

pgvector · turbovec opt.

RAG Query

6 strategies · stream

Observe

OTEL · /health · metrics

Documents

PDF, DOCX, images...

Parse

Docling · GLM-OCR

Extract

LangExtract · LangChain

Ingest

pgvector · turbovec opt.

RAG Query

6 strategies · stream

Observe

OTEL · /health · metrics
1. Parse Only

Convert any document to clean text or markdown. Choose Docling or GLM-OCR.

import docpipe

# Default: Docling
doc = docpipe.parse("report.pdf")

# GLM-OCR: state-of-the-art OCR
doc = docpipe.parse("scan.pdf", parser="glm-ocr")
print(doc.markdown)
2. Extract Only (LangExtract)

Extract structured entities from any text with LLMs.

schema = docpipe.ExtractionSchema(
    description="Extract people and ages",
    model_id="gemini-2.5-flash",
)
results = docpipe.extract(text, schema)
3. Parse + Extract

Full pipeline: document to structured data in one call.

result = docpipe.run(
    "invoice.pdf", schema
)
print(result.extractions)
4. Parse + Ingest

Parse a document and ingest vectors into pgvector (default) or local turbovec file indices.

config = docpipe.IngestionConfig(
    connection_string="postgresql://...",
    table_name="docs",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
)
docpipe.ingest("report.pdf", config=config)

# Optional: local turbovec index (pip install "docpipe-sdk[turbovec]")
# config.vector_backend = "turbovec"  # → .docpipe/indices/docs/
5. Full Pipeline

Parse, extract, and ingest - all in one call.

result = docpipe.run(
    "contract.pdf", schema,
    ingestion_config=config,
)
6. RAG Query

Ask questions against your ingested documents with grounded answers and source citations.

rag_cfg = docpipe.RAGConfig(
    connection_string="postgresql://...",
    table_name="docs",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
    llm_provider="openai",
    llm_model="gpt-4o",
    strategy="hyde",
)
result = docpipe.query(
    "What is the invoice total?",
    config=rag_cfg,
)
print(result.answer)   # grounded answer with citations
print(result.sources)  # ["invoice.pdf"]
print(result.usage)    # TokenUsage when provider reports counts
7. Observability

OTLP traces, JSON logs, /health dependency checks, Prometheus /metrics, and token usage on RAG responses.

# pip install "docpipe-sdk[server,observability]"
export DOCPIPE_OTEL_ENABLED=true
export DOCPIPE_OTEL_SERVICE_NAME=docpipe
export DOCPIPE_OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318/v1/traces
export DOCPIPE_LOG_FORMAT=json
export DOCPIPE_HEALTH_CHECK_DB=true

docpipe serve

curl http://localhost:8000/health    # plugins + DB status
curl http://localhost:8000/metrics  # Prometheus (no auth on /metrics)

6 Retrieval Strategies — Pick What Fits

Switch strategy with one config field. Stream answers, capture token usage, and monitor the API server with OTEL, /health, and /metrics.

Pick your retrieval strategy

Six strategies — hover or tap a node to see when to use it.

auto

LLM classifies the question and dispatches to the optimal strategy automatically.

When to use: Mixed workloads, zero tuning
POST /rag/queryidle

{ "query": "What is the invoice total?", "strategy": "hyde" }

Standard cosine similarity search. Fast, reliable baseline for well-formed queries.

Best for: well-formed queries, fast responses

LLM generates a hypothetical answer first, embeds it, then retrieves real matching docs. Highest accuracy in benchmarks.

Best for: complex / technical queries

Expands your query into N variants via LLM, retrieves for each, then deduplicates and ranks results.

Best for: vague or short queries

Retrieves seed chunks, then expands context by fetching additional chunks from the same source documents.

Best for: long documents, context coherence

Combines dense vector search with sparse BM25 keyword retrieval via EnsembleRetriever. Best of both worlds.

Best for: exact terms, proper nouns, IDs

LLM classifies your question and dispatches to the optimal strategy automatically. Best accuracy with zero tuning.

Best for: mixed workloads, unknown query types
Optional reranking
rag_cfg = docpipe.RAGConfig(
    ...,
    strategy="naive",
    reranker="flashrank",  # local, no API key
    rerank_top_n=5,
)
# Retrieve top-50, rerank, keep top-5
Structured RAG output
class Invoice(BaseModel):
    total: float
    currency: str

result = docpipe.query(
    "What is the total?",
    config=docpipe.RAGConfig(
        ..., output_model=Invoice
    ),
)
invoice = result.structured
# Invoice(total=4250.0, currency='USD')
Streaming (SSE)
# Stream tokens via SDK or POST /rag/stream (SSE)
for token in docpipe.stream_query(
    "What is the total?",
    config=rag_config,  # stream=True
):
    print(token, end="", flush=True)

# Before data: [DONE], optional metadata event:
# event: metadata
# data: {"type":"usage","usage":{"input_tokens":123,...}}
Token usage
result = docpipe.query("Summarize the invoice", config=rag_cfg)
print(result.answer)
if result.usage:
    print(result.usage)  # input/output/total when provider reports counts

# Same usage object on POST /rag/query JSON responses
POST /rag/stream — Server-Sent Events

Built for Production

Everything you need to go from raw documents to grounded answers at scale.

Plugin Architecture

Add custom parsers and extractors via Python entry points. Third-party packages auto-discovered on install.

CLI + API Server

Full CLI for scripting, FastAPI with /health and /metrics, Docker image for deployment. OTEL via [server,observability].

Observability

OTLP traces, JSON logs, /health dependency checks, Prometheus /metrics, and token usage on RAG responses. Install with [server,observability].

Fully Configurable

No magic defaults. Explicit LLM provider, embedding model, and DB connection. YAML + env vars.

LangChain Backbone

Built on LangChain for embeddings, text splitting, and vector stores. Supports OpenAI, Gemini, Ollama, HuggingFace.

Optional Turbovec Backend

Default pgvector in PostgreSQL, or install [turbovec] for compressed on-disk indices — local prototypes, air-gapped RAG, no pgvector required.

20+ Document Formats

PDF, DOCX, XLSX, PPTX, HTML, images — choose between IBM Docling or GLM-OCR (state-of-the-art multimodal OCR).

6 RAG Strategies

naive, HyDE, multi-query, parent-document, hybrid, auto — swap with one config field. Reranking and token usage when providers support it.

Built-in Evaluation

Measure hit rate, MRR, faithfulness, and answer similarity. Know if your RAG is actually working.

Zero Vendor Lock-in

docpipe never stores your data. It connects to your DB, calls your LLM API, then gets out of the way.

Pipeline modes — one card, four shapes

Parse pipeline

Docling or GLM-OCR → markdown

PDF → answer

One document’s journey through the pipeline.

invoice.pdf

Line items, tables, headers preserved by Docling.