docpipe
Capabilities
GitHubPyPI v0.5.2

Getting Started

Documentation

SDK reference for parse, extract, ingest, RAG, observability, and deployment — aligned with the docpipe README.

Getting Started

docpipe connects document parsing (Docling / GLM-OCR), structured extraction, vector ingestion, and RAG into composable pipelines. It never stores your data — it connects to your DB and LLM APIs.

Four pipelines you can use independently or together: Parse → Extract → Ingest → RAG. Install extras for only what you need, then use the Python SDK, CLI, or FastAPI server.

  • Parse — unstructured docs → markdown/text
  • Extract — text → structured entities (LangExtract / LangChain)
  • Ingest — chunks → embeddings → pgvector or turbovec
  • RAG — questions → grounded answers with citations
Minimal flow
import docpipe

doc = docpipe.parse("invoice.pdf")
config = docpipe.IngestionConfig(
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    table_name="docs",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
)
docpipe.ingest("invoice.pdf", config=config)
result = docpipe.query("What is the invoice total?", config=docpipe.RAGConfig(
    connection_string=config.connection_string,
    table_name=config.table_name,
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
    llm_provider="openai",
    llm_model="gpt-4o",
))
print(result.answer)

Install

Install from PyPI with optional extras. Match extras to your pipeline (parser, embeddings, server, observability).

pip extras
pip install docpipe-sdk                   # Core only
pip install "docpipe-sdk[docling]"        # + Docling parser (PDF, DOCX, images, ...)
pip install "docpipe-sdk[glm-ocr]"        # + GLM-OCR parser (state-of-the-art OCR)
pip install "docpipe-sdk[langextract]"    # + Google LangExtract
pip install "docpipe-sdk[openai]"         # + OpenAI embeddings & LLM
pip install "docpipe-sdk[anthropic]"      # + Anthropic Claude
pip install "docpipe-sdk[google]"         # + Google Gemini
pip install "docpipe-sdk[ollama]"         # + Ollama (local models)
pip install "docpipe-sdk[huggingface]"    # + HuggingFace embeddings
pip install "docpipe-sdk[pgvector]"       # + PostgreSQL vector store (default)
pip install "docpipe-sdk[turbovec]"       # + Optional local turbovec file indices
pip install "docpipe-sdk[rag]"            # + Hybrid search (BM25 + langchain-classic)
pip install "docpipe-sdk[rerank]"         # + Local reranking (FlashRank)
pip install "docpipe-sdk[server]"         # + FastAPI server (/health, /metrics)
pip install "docpipe-sdk[observability]"  # + OpenTelemetry traces + JSON logs
pip install "docpipe-sdk[http]"           # + Python HTTP client for the API
pip install "docpipe-sdk[all]"            # Everything

For API server + OpenTelemetry traces and JSON logs, use pip install "docpipe-sdk[server,observability]".

Parse

Convert PDFs, Office files, HTML, and images to markdown or plain text. Default parser is Docling; use glm-ocr for scanned or image-heavy documents.

Python
import docpipe

# Default: Docling parser
doc = docpipe.parse("invoice.pdf")
print(doc.markdown)
print(doc.text)

# GLM-OCR (scanned / image-heavy documents)
doc = docpipe.parse("scanned_report.pdf", parser="glm-ocr")
print(doc.markdown)
  • CLI: docpipe parse invoice.pdf --format markdown
  • API: POST /parse with source URL or path
  • Parsers: docling (broad format support), glm-ocr (multimodal OCR)

Extract

Pull structured entities from text using LangExtract or LangChain with_structured_output. Define a schema describing fields to extract.

Python
import docpipe

schema = docpipe.ExtractionSchema(
    description="Extract invoice line items with amounts",
    model_id="gemini-2.5-flash",
)
results = docpipe.extract(doc.text, schema)
for r in results:
    print(r.entity_class, r.text, r.attributes)

# Full parse + extract
result = docpipe.run("invoice.pdf", schema)
print(result.parsed.markdown)
print(result.extractions)
  • CLI: docpipe extract "text" --schema schema.yaml --model gemini-2.5-flash
  • API: POST /extract, POST /run (parse + extract)

Ingest

Chunk documents, embed with your chosen provider, and store in PostgreSQL pgvector (default) or optional turbovec on-disk indices.

Python
import docpipe

config = docpipe.IngestionConfig(
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    table_name="invoices",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
    incremental=True,  # skip unchanged files by SHA-256 hash
)
docpipe.ingest("invoice.pdf", config=config)

Set incremental=True to skip files already ingested with the same SHA-256 hash. DELETE /ingest removes chunks by exact source or path fragment (match_mode: contains).

  • CLI: docpipe ingest report.pdf --db ... --table docs --incremental
  • API: POST /ingest, DELETE /ingest
  • Embeddings: OpenAI, Google Gemini, Ollama, HuggingFace

RAG

Ask questions against ingested documents. Six retrieval strategies, optional reranking, conversation history, metadata filters, structured output, and SSE streaming.

Python query
import docpipe

rag_config = docpipe.RAGConfig(
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    table_name="invoices",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
    llm_provider="openai",
    llm_model="gpt-4o",
    strategy="hyde",
    reranker="flashrank",
)
result = docpipe.query("What is the total amount on the invoice?", config=rag_config)
print(result.answer)
print(result.sources)
print(result.chunks)
StrategyDescription
naiveSimple cosine similarity search. Fast and reliable for well-formed queries.
hydeLLM generates a hypothetical answer, embeds it for retrieval. Highest accuracy on complex questions.
multi_queryExpands query into N variants, merges and deduplicates results. Best for vague or short queries.
parent_documentRetrieves seed chunks then expands context window per source. Best for long documents.
hybridCombines dense vector search with BM25 keyword matching. Best for exact terms, IDs, and proper nouns.
autoLLM classifies the question and dispatches to the optimal strategy automatically.
  • CLI: docpipe rag query "..." --strategy hyde --reranker flashrank
  • API: POST /rag/query (JSON), POST /rag/stream (SSE)
  • Multi-turn: pass history: [{role, content}, ...] on query/stream
  • Filters: filters: {"source": "report.pdf"} on search/RAG

Observability

OpenTelemetry traces, JSON logs, health checks with dependency probes, and Prometheus metrics on GET /metrics (no auth).

Environment
# pip install "docpipe-sdk[server,observability]"
DOCPIPE_OTEL_ENABLED=true
DOCPIPE_OTEL_SERVICE_NAME=docpipe
DOCPIPE_OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318/v1/traces
DOCPIPE_OTEL_TRACES_SAMPLER_ARG=1.0
DOCPIPE_LOG_FORMAT=json
DOCPIPE_HEALTH_CHECK_DB=true

$ docpipe serve

# curl http://localhost:8000/health   # plugins + dependency status
# curl http://localhost:8000/metrics # Prometheus (no auth on /metrics)
VariableDefaultPurpose
DOCPIPE_OTEL_ENABLEDfalseExport traces via OTLP/HTTP
DOCPIPE_OTEL_SERVICE_NAMEdocpipeservice.name resource
DOCPIPE_OTEL_EXPORTER_OTLP_ENDPOINTe.g. http://localhost:4318/v1/traces
DOCPIPE_LOG_FORMATtextjson for structured logs
DOCPIPE_HEALTH_CHECK_DBtrueSELECT 1 when DB URL set

Turbovec

Optional compressed on-disk vector indices when you do not want pgvector in Postgres. Good for local prototypes and air-gapped RAG; production Postgres deployments should use pgvector.

Setup
# pip install "docpipe-sdk[turbovec,openai]"   # + your embedding provider
export DOCPIPE_VECTOR_BACKEND=turbovec
export DOCPIPE_TURBVEC_INDEX_DIR=./.docpipe/indices   # default on-disk index root

# Per-request override on ingest / search / RAG API bodies:
# { "vector_backend": "turbovec", "table_name": "my_library", ... }

import docpipe

config = docpipe.IngestionConfig(
    connection_string="postgresql://unused",  # accepted; vectors use local files
    table_name="my_library",                  # index folder name under TURBVEC_INDEX_DIR
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
    vector_backend="turbovec",
)
docpipe.ingest("invoice.pdf", config=config)
# → ./.docpipe/indices/my_library/index.tvim + docstore.json

# Default pgvector in PostgreSQL is recommended for production deployments.

API Reference

FastAPI server: docpipe serve --host 0.0.0.0 --port 8000. HTTP Basic Auth on all routes except GET /health and GET /metrics.

MethodPathDescription
GET/healthHealth check, plugins, dependency status
GET/metricsPrometheus metrics (no auth)
POST/parseParse a document
POST/extractExtract structured data
POST/runParse + extract
POST/ingestIngest into vector DB
DELETE/ingestRemove chunks for a source
POST/searchVector similarity search (filters)
POST/rag/queryRAG Q&A (history, filters, usage)
POST/rag/streamStreaming RAG (SSE)
POST/generatePlain LLM completion (no retrieval)
POST/evaluate/runEvaluate RAG quality
GET/pluginsList registered plugins
Python HTTP client
from docpipe.http import DocpipeClient

with DocpipeClient("http://localhost:8000", username="admin", password="docpipe") as client:
    print(client.health())
    result = client.rag_query({...})
    print(result.get("usage"))

Docker / Production

Official GHCR image, standalone compose with pgvector, and production sidecar deployment notes.

Built from docpipe/Dockerfile on python:3.12-slim; default entrypoint runs docpipe serve on port 8000.

Pull & run API
# Pull official image (GHCR)
docker pull ghcr.io/thesunnysinha/docpipe:latest

# API server — .env: provider API keys + optional DOCPIPE_OTEL_* (see Observability)
docker run -p 8000:8000 --env-file .env ghcr.io/thesunnysinha/docpipe:latest

# curl http://localhost:8000/health
# curl http://localhost:8000/metrics

# One-off parse / ingest
docker run -v ./data:/data ghcr.io/thesunnysinha/docpipe:latest parse /data/invoice.pdf
docker run --env-file .env -v ./data:/data ghcr.io/thesunnysinha/docpipe:latest \
  ingest /data/report.pdf --db "postgresql://..." --table docs
Standalone Docker Compose
# docker-compose.yml — docpipe API + pgvector (standalone)
services:
  docpipe:
    image: ghcr.io/thesunnysinha/docpipe:latest
    ports:
      - "8000:8000"
    env_file: .env
    volumes:
      - ./data:/data
    depends_on:
      db:
        condition: service_healthy
    restart: unless-stopped

  db:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_USER: docpipe
      POSTGRES_PASSWORD: docpipe
      POSTGRES_DB: docpipe
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U docpipe"]
      interval: 5s
      timeout: 5s
      retries: 5
    restart: unless-stopped

volumes:
  pgdata:

# cp .env.example .env && docker compose up -d
VariableExample / defaultPurpose
OPENAI_API_KEYsk-...Embedding / LLM when using OpenAI
DB_CONNECTION_STRINGpostgresql://docpipe:docpipe@db:5432/docpipepgvector DB (compose db service)
DB_TABLE_NAMEdocumentsDefault collection name
EMBEDDING_PROVIDERopenaiEmbedding vendor
EMBEDDING_MODELtext-embedding-3-smallEmbedding model id
LLM_PROVIDERopenaiRAG LLM vendor
LLM_MODELgpt-4oRAG LLM model id
DOCPIPE_OTEL_ENABLEDfalseOpenTelemetry traces (optional)
DOCPIPE_OTEL_EXPORTER_OTLP_ENDPOINThttp://localhost:4318/v1/tracesOTLP HTTP endpoint
DOCPIPE_LOG_FORMATtexttext or json logs
DOCPIPE_HEALTH_CHECK_DBtrueProbe Postgres on /health
DOCPIPE_ALLOW_PRIVATE_URLSfalseAllow ingest sources on private IPs (sidecar / internal object storage)
  • Image: ghcr.io/thesunnysinha/docpipe — tags: latest, semver (e.g. 0.5.2), sha-<commit>.
  • Pin a semver tag or digest in production if you need reproducible deploys; use pull_policy: always with :latest in dev.
  • Sidecar pattern: run docpipe on your app Docker network with no host port; callers use http://docpipe:8000 (service name).
  • Share an existing pgvector Postgres by setting DATABASE_URL on the docpipe service to your DB connection string.
  • Presigned URLs on private compose networks: set DOCPIPE_ALLOW_PRIVATE_URLS=true on the docpipe container.
  • HTTP Basic Auth on API routes (except /health and /metrics); configure DOCPIPE_USERNAME / DOCPIPE_PASSWORD to match your client.
  • Full stack with Adminer: docker-compose.full.yml in the docpipe repo.
  • Optional OTEL: DOCPIPE_OTEL_* — see Observability section. Scrape GET /metrics (no auth) for Prometheus.