docpipe

Getting Started

Documentation

SDK reference for parse, extract, ingest, RAG, observability, and deployment — aligned with the docpipe README.

Getting Started

docpipe connects document parsing (Docling / GLM-OCR), structured extraction, vector ingestion, and RAG into composable pipelines. It never stores your data — it connects to your DB and LLM APIs.

Four pipelines you can use independently or together: Parse → Extract → Ingest → RAG. Install extras for only what you need, then use the Python SDK, CLI, or FastAPI server.

Parse — unstructured docs → markdown/text
Extract — text → structured entities (LangExtract / LangChain)
Ingest — chunks → embeddings → pgvector or turbovec
RAG — questions → grounded answers with citations

Minimal flow

import docpipe

doc = docpipe.parse("invoice.pdf")
config = docpipe.IngestionConfig(
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    table_name="docs",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
)
docpipe.ingest("invoice.pdf", config=config)
result = docpipe.query("What is the invoice total?", config=docpipe.RAGConfig(
    connection_string=config.connection_string,
    table_name=config.table_name,
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
    llm_provider="openai",
    llm_model="gpt-4o",
))
print(result.answer)

Install

Install from PyPI with optional extras. Match extras to your pipeline (parser, embeddings, server, observability).

pip extras

pip install docpipe-sdk                   # Core only
pip install "docpipe-sdk[docling]"        # + Docling parser (PDF, DOCX, images, ...)
pip install "docpipe-sdk[glm-ocr]"        # + GLM-OCR parser (state-of-the-art OCR)
pip install "docpipe-sdk[langextract]"    # + Google LangExtract
pip install "docpipe-sdk[openai]"         # + OpenAI embeddings & LLM
pip install "docpipe-sdk[anthropic]"      # + Anthropic Claude
pip install "docpipe-sdk[google]"         # + Google Gemini
pip install "docpipe-sdk[ollama]"         # + Ollama (local models)
pip install "docpipe-sdk[huggingface]"    # + HuggingFace embeddings
pip install "docpipe-sdk[pgvector]"       # + PostgreSQL vector store (default)
pip install "docpipe-sdk[turbovec]"       # + Optional local turbovec file indices
pip install "docpipe-sdk[rag]"            # + Hybrid search (BM25 + langchain-classic)
pip install "docpipe-sdk[rerank]"         # + Local reranking (FlashRank)
pip install "docpipe-sdk[server]"         # + FastAPI server (/health, /metrics)
pip install "docpipe-sdk[observability]"  # + OpenTelemetry traces + JSON logs
pip install "docpipe-sdk[http]"           # + Python HTTP client for the API
pip install "docpipe-sdk[all]"            # Everything

For API server + OpenTelemetry traces and JSON logs, use pip install "docpipe-sdk[server,observability]".

Parse

Convert PDFs, Office files, HTML, and images to markdown or plain text. Default parser is Docling; use glm-ocr for scanned or image-heavy documents.

Python

import docpipe

# Default: Docling parser
doc = docpipe.parse("invoice.pdf")
print(doc.markdown)
print(doc.text)

# GLM-OCR (scanned / image-heavy documents)
doc = docpipe.parse("scanned_report.pdf", parser="glm-ocr")
print(doc.markdown)

CLI: docpipe parse invoice.pdf --format markdown
API: POST /parse with source URL or path
Parsers: docling (broad format support), glm-ocr (multimodal OCR)

Extract

Pull structured entities from text using LangExtract or LangChain with_structured_output. Define a schema describing fields to extract.

Python

import docpipe

schema = docpipe.ExtractionSchema(
    description="Extract invoice line items with amounts",
    model_id="gemini-2.5-flash",
)
results = docpipe.extract(doc.text, schema)
for r in results:
    print(r.entity_class, r.text, r.attributes)

# Full parse + extract
result = docpipe.run("invoice.pdf", schema)
print(result.parsed.markdown)
print(result.extractions)

CLI: docpipe extract "text" --schema schema.yaml --model gemini-2.5-flash
API: POST /extract, POST /run (parse + extract)

Ingest

Chunk documents, embed with your chosen provider, and store in PostgreSQL pgvector (default) or optional turbovec on-disk indices.

Python

import docpipe

config = docpipe.IngestionConfig(
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    table_name="invoices",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
    incremental=True,  # skip unchanged files by SHA-256 hash
)
docpipe.ingest("invoice.pdf", config=config)

Set incremental=True to skip files already ingested with the same SHA-256 hash. DELETE /ingest removes chunks by exact source or path fragment (match_mode: contains).

CLI: docpipe ingest report.pdf --db ... --table docs --incremental
API: POST /ingest, DELETE /ingest
Embeddings: OpenAI, Google Gemini, Ollama, HuggingFace

RAG

Ask questions against ingested documents. Six retrieval strategies, optional reranking, conversation history, metadata filters, structured output, and SSE streaming.

Python query

import docpipe

rag_config = docpipe.RAGConfig(
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    table_name="invoices",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
    llm_provider="openai",
    llm_model="gpt-4o",
    strategy="hyde",
    reranker="flashrank",
)
result = docpipe.query("What is the total amount on the invoice?", config=rag_config)
print(result.answer)
print(result.sources)
print(result.chunks)

Strategy	Description
naive	Simple cosine similarity search. Fast and reliable for well-formed queries.
hyde	LLM generates a hypothetical answer, embeds it for retrieval. Highest accuracy on complex questions.
multi_query	Expands query into N variants, merges and deduplicates results. Best for vague or short queries.
parent_document	Retrieves seed chunks then expands context window per source. Best for long documents.
hybrid	Combines dense vector search with BM25 keyword matching. Best for exact terms, IDs, and proper nouns.
auto	LLM classifies the question and dispatches to the optimal strategy automatically.

CLI: docpipe rag query "..." --strategy hyde --reranker flashrank
API: POST /rag/query (JSON), POST /rag/stream (SSE)
Multi-turn: pass history: [{role, content}, ...] on query/stream
Filters: filters: {"source": "report.pdf"} on search/RAG

Observability

OpenTelemetry traces, JSON logs, health checks with dependency probes, and Prometheus metrics on GET /metrics (no auth).

Environment

# pip install "docpipe-sdk[server,observability]"
DOCPIPE_OTEL_ENABLED=true
DOCPIPE_OTEL_SERVICE_NAME=docpipe
DOCPIPE_OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318/v1/traces
DOCPIPE_OTEL_TRACES_SAMPLER_ARG=1.0
DOCPIPE_LOG_FORMAT=json
DOCPIPE_HEALTH_CHECK_DB=true

$ docpipe serve

# curl http://localhost:8000/health   # plugins + dependency status
# curl http://localhost:8000/metrics # Prometheus (no auth on /metrics)

Variable	Default	Purpose
DOCPIPE_OTEL_ENABLED	false	Export traces via OTLP/HTTP
DOCPIPE_OTEL_SERVICE_NAME	docpipe	service.name resource
DOCPIPE_OTEL_EXPORTER_OTLP_ENDPOINT	—	e.g. http://localhost:4318/v1/traces
DOCPIPE_LOG_FORMAT	text	json for structured logs
DOCPIPE_HEALTH_CHECK_DB	true	SELECT 1 when DB URL set

Turbovec

Optional compressed on-disk vector indices when you do not want pgvector in Postgres. Good for local prototypes and air-gapped RAG; production Postgres deployments should use pgvector.

Setup

# pip install "docpipe-sdk[turbovec,openai]"   # + your embedding provider
export DOCPIPE_VECTOR_BACKEND=turbovec
export DOCPIPE_TURBVEC_INDEX_DIR=./.docpipe/indices   # default on-disk index root

# Per-request override on ingest / search / RAG API bodies:
# { "vector_backend": "turbovec", "table_name": "my_library", ... }

import docpipe

config = docpipe.IngestionConfig(
    connection_string="postgresql://unused",  # accepted; vectors use local files
    table_name="my_library",                  # index folder name under TURBVEC_INDEX_DIR
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
    vector_backend="turbovec",
)
docpipe.ingest("invoice.pdf", config=config)
# → ./.docpipe/indices/my_library/index.tvim + docstore.json

# Default pgvector in PostgreSQL is recommended for production deployments.

API Reference

FastAPI server: docpipe serve --host 0.0.0.0 --port 8000. HTTP Basic Auth on all routes except GET /health and GET /metrics.

Method	Path	Description
GET	/health	Health check, plugins, dependency status
GET	/metrics	Prometheus metrics (no auth)
POST	/parse	Parse a document
POST	/extract	Extract structured data
POST	/run	Parse + extract
POST	/ingest	Ingest into vector DB
DELETE	/ingest	Remove chunks for a source
POST	/search	Vector similarity search (filters)
POST	/rag/query	RAG Q&A (history, filters, usage)
POST	/rag/stream	Streaming RAG (SSE)
POST	/generate	Plain LLM completion (no retrieval)
POST	/evaluate/run	Evaluate RAG quality
GET	/plugins	List registered plugins

Python HTTP client

from docpipe.http import DocpipeClient

with DocpipeClient("http://localhost:8000", username="admin", password="docpipe") as client:
    print(client.health())
    result = client.rag_query({...})
    print(result.get("usage"))

Docker / Production

Official GHCR image, standalone compose with pgvector, and production sidecar deployment notes.

Built from docpipe/Dockerfile on python:3.12-slim; default entrypoint runs docpipe serve on port 8000.

Pull & run API

# Pull official image (GHCR)
docker pull ghcr.io/thesunnysinha/docpipe:latest

# API server — .env: provider API keys + optional DOCPIPE_OTEL_* (see Observability)
docker run -p 8000:8000 --env-file .env ghcr.io/thesunnysinha/docpipe:latest

# curl http://localhost:8000/health
# curl http://localhost:8000/metrics

# One-off parse / ingest
docker run -v ./data:/data ghcr.io/thesunnysinha/docpipe:latest parse /data/invoice.pdf
docker run --env-file .env -v ./data:/data ghcr.io/thesunnysinha/docpipe:latest \
  ingest /data/report.pdf --db "postgresql://..." --table docs

Standalone Docker Compose

# docker-compose.yml — docpipe API + pgvector (standalone)
services:
  docpipe:
    image: ghcr.io/thesunnysinha/docpipe:latest
    ports:
      - "8000:8000"
    env_file: .env
    volumes:
      - ./data:/data
    depends_on:
      db:
        condition: service_healthy
    restart: unless-stopped

  db:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_USER: docpipe
      POSTGRES_PASSWORD: docpipe
      POSTGRES_DB: docpipe
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U docpipe"]
      interval: 5s
      timeout: 5s
      retries: 5
    restart: unless-stopped

volumes:
  pgdata:

# cp .env.example .env && docker compose up -d

Variable	Example / default	Purpose
OPENAI_API_KEY	sk-...	Embedding / LLM when using OpenAI
DB_CONNECTION_STRING	postgresql://docpipe:docpipe@db:5432/docpipe	pgvector DB (compose db service)
DB_TABLE_NAME	documents	Default collection name
EMBEDDING_PROVIDER	openai	Embedding vendor
EMBEDDING_MODEL	text-embedding-3-small	Embedding model id
LLM_PROVIDER	openai	RAG LLM vendor
LLM_MODEL	gpt-4o	RAG LLM model id
DOCPIPE_OTEL_ENABLED	false	OpenTelemetry traces (optional)
DOCPIPE_OTEL_EXPORTER_OTLP_ENDPOINT	http://localhost:4318/v1/traces	OTLP HTTP endpoint
DOCPIPE_LOG_FORMAT	text	text or json logs
DOCPIPE_HEALTH_CHECK_DB	true	Probe Postgres on /health
DOCPIPE_ALLOW_PRIVATE_URLS	false	Allow ingest sources on private IPs (sidecar / internal object storage)

Image: ghcr.io/thesunnysinha/docpipe — tags: latest, semver (e.g. 0.5.2), sha-<commit>.
Pin a semver tag or digest in production if you need reproducible deploys; use pull_policy: always with :latest in dev.
Sidecar pattern: run docpipe on your app Docker network with no host port; callers use http://docpipe:8000 (service name).
Share an existing pgvector Postgres by setting DATABASE_URL on the docpipe service to your DB connection string.
Presigned URLs on private compose networks: set DOCPIPE_ALLOW_PRIVATE_URLS=true on the docpipe container.
HTTP Basic Auth on API routes (except /health and /metrics); configure DOCPIPE_USERNAME / DOCPIPE_PASSWORD to match your client.
Full stack with Adminer: docker-compose.full.yml in the docpipe repo.
Optional OTEL: DOCPIPE_OTEL_* — see Observability section. Scrape GET /metrics (no auth) for Prometheus.