Getting Started
Documentation
SDK reference for parse, extract, ingest, RAG, observability, and deployment — aligned with the docpipe README.
Getting Started
docpipe connects document parsing (Docling / GLM-OCR), structured extraction, vector ingestion, and RAG into composable pipelines. It never stores your data — it connects to your DB and LLM APIs.
Four pipelines you can use independently or together: Parse → Extract → Ingest → RAG. Install extras for only what you need, then use the Python SDK, CLI, or FastAPI server.
- Parse — unstructured docs → markdown/text
- Extract — text → structured entities (LangExtract / LangChain)
- Ingest — chunks → embeddings → pgvector or turbovec
- RAG — questions → grounded answers with citations
import docpipe
doc = docpipe.parse("invoice.pdf")
config = docpipe.IngestionConfig(
connection_string="postgresql://user:pass@localhost:5432/mydb",
table_name="docs",
embedding_provider="openai",
embedding_model="text-embedding-3-small",
)
docpipe.ingest("invoice.pdf", config=config)
result = docpipe.query("What is the invoice total?", config=docpipe.RAGConfig(
connection_string=config.connection_string,
table_name=config.table_name,
embedding_provider="openai",
embedding_model="text-embedding-3-small",
llm_provider="openai",
llm_model="gpt-4o",
))
print(result.answer)Install
Install from PyPI with optional extras. Match extras to your pipeline (parser, embeddings, server, observability).
pip install docpipe-sdk # Core only
pip install "docpipe-sdk[docling]" # + Docling parser (PDF, DOCX, images, ...)
pip install "docpipe-sdk[glm-ocr]" # + GLM-OCR parser (state-of-the-art OCR)
pip install "docpipe-sdk[langextract]" # + Google LangExtract
pip install "docpipe-sdk[openai]" # + OpenAI embeddings & LLM
pip install "docpipe-sdk[anthropic]" # + Anthropic Claude
pip install "docpipe-sdk[google]" # + Google Gemini
pip install "docpipe-sdk[ollama]" # + Ollama (local models)
pip install "docpipe-sdk[huggingface]" # + HuggingFace embeddings
pip install "docpipe-sdk[pgvector]" # + PostgreSQL vector store (default)
pip install "docpipe-sdk[turbovec]" # + Optional local turbovec file indices
pip install "docpipe-sdk[rag]" # + Hybrid search (BM25 + langchain-classic)
pip install "docpipe-sdk[rerank]" # + Local reranking (FlashRank)
pip install "docpipe-sdk[server]" # + FastAPI server (/health, /metrics)
pip install "docpipe-sdk[observability]" # + OpenTelemetry traces + JSON logs
pip install "docpipe-sdk[http]" # + Python HTTP client for the API
pip install "docpipe-sdk[all]" # EverythingFor API server + OpenTelemetry traces and JSON logs, use pip install "docpipe-sdk[server,observability]".
Parse
Convert PDFs, Office files, HTML, and images to markdown or plain text. Default parser is Docling; use glm-ocr for scanned or image-heavy documents.
import docpipe
# Default: Docling parser
doc = docpipe.parse("invoice.pdf")
print(doc.markdown)
print(doc.text)
# GLM-OCR (scanned / image-heavy documents)
doc = docpipe.parse("scanned_report.pdf", parser="glm-ocr")
print(doc.markdown)- CLI: docpipe parse invoice.pdf --format markdown
- API: POST /parse with source URL or path
- Parsers: docling (broad format support), glm-ocr (multimodal OCR)
Extract
Pull structured entities from text using LangExtract or LangChain with_structured_output. Define a schema describing fields to extract.
import docpipe
schema = docpipe.ExtractionSchema(
description="Extract invoice line items with amounts",
model_id="gemini-2.5-flash",
)
results = docpipe.extract(doc.text, schema)
for r in results:
print(r.entity_class, r.text, r.attributes)
# Full parse + extract
result = docpipe.run("invoice.pdf", schema)
print(result.parsed.markdown)
print(result.extractions)- CLI: docpipe extract "text" --schema schema.yaml --model gemini-2.5-flash
- API: POST /extract, POST /run (parse + extract)
Ingest
Chunk documents, embed with your chosen provider, and store in PostgreSQL pgvector (default) or optional turbovec on-disk indices.
import docpipe
config = docpipe.IngestionConfig(
connection_string="postgresql://user:pass@localhost:5432/mydb",
table_name="invoices",
embedding_provider="openai",
embedding_model="text-embedding-3-small",
incremental=True, # skip unchanged files by SHA-256 hash
)
docpipe.ingest("invoice.pdf", config=config)Set incremental=True to skip files already ingested with the same SHA-256 hash. DELETE /ingest removes chunks by exact source or path fragment (match_mode: contains).
- CLI: docpipe ingest report.pdf --db ... --table docs --incremental
- API: POST /ingest, DELETE /ingest
- Embeddings: OpenAI, Google Gemini, Ollama, HuggingFace
RAG
Ask questions against ingested documents. Six retrieval strategies, optional reranking, conversation history, metadata filters, structured output, and SSE streaming.
import docpipe
rag_config = docpipe.RAGConfig(
connection_string="postgresql://user:pass@localhost:5432/mydb",
table_name="invoices",
embedding_provider="openai",
embedding_model="text-embedding-3-small",
llm_provider="openai",
llm_model="gpt-4o",
strategy="hyde",
reranker="flashrank",
)
result = docpipe.query("What is the total amount on the invoice?", config=rag_config)
print(result.answer)
print(result.sources)
print(result.chunks)| Strategy | Description |
|---|---|
| naive | Simple cosine similarity search. Fast and reliable for well-formed queries. |
| hyde | LLM generates a hypothetical answer, embeds it for retrieval. Highest accuracy on complex questions. |
| multi_query | Expands query into N variants, merges and deduplicates results. Best for vague or short queries. |
| parent_document | Retrieves seed chunks then expands context window per source. Best for long documents. |
| hybrid | Combines dense vector search with BM25 keyword matching. Best for exact terms, IDs, and proper nouns. |
| auto | LLM classifies the question and dispatches to the optimal strategy automatically. |
- CLI: docpipe rag query "..." --strategy hyde --reranker flashrank
- API: POST /rag/query (JSON), POST /rag/stream (SSE)
- Multi-turn: pass history: [{role, content}, ...] on query/stream
- Filters: filters: {"source": "report.pdf"} on search/RAG
Observability
OpenTelemetry traces, JSON logs, health checks with dependency probes, and Prometheus metrics on GET /metrics (no auth).
# pip install "docpipe-sdk[server,observability]"
DOCPIPE_OTEL_ENABLED=true
DOCPIPE_OTEL_SERVICE_NAME=docpipe
DOCPIPE_OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318/v1/traces
DOCPIPE_OTEL_TRACES_SAMPLER_ARG=1.0
DOCPIPE_LOG_FORMAT=json
DOCPIPE_HEALTH_CHECK_DB=true
$ docpipe serve
# curl http://localhost:8000/health # plugins + dependency status
# curl http://localhost:8000/metrics # Prometheus (no auth on /metrics)| Variable | Default | Purpose |
|---|---|---|
| DOCPIPE_OTEL_ENABLED | false | Export traces via OTLP/HTTP |
| DOCPIPE_OTEL_SERVICE_NAME | docpipe | service.name resource |
| DOCPIPE_OTEL_EXPORTER_OTLP_ENDPOINT | — | e.g. http://localhost:4318/v1/traces |
| DOCPIPE_LOG_FORMAT | text | json for structured logs |
| DOCPIPE_HEALTH_CHECK_DB | true | SELECT 1 when DB URL set |
Turbovec
Optional compressed on-disk vector indices when you do not want pgvector in Postgres. Good for local prototypes and air-gapped RAG; production Postgres deployments should use pgvector.
# pip install "docpipe-sdk[turbovec,openai]" # + your embedding provider
export DOCPIPE_VECTOR_BACKEND=turbovec
export DOCPIPE_TURBVEC_INDEX_DIR=./.docpipe/indices # default on-disk index root
# Per-request override on ingest / search / RAG API bodies:
# { "vector_backend": "turbovec", "table_name": "my_library", ... }
import docpipe
config = docpipe.IngestionConfig(
connection_string="postgresql://unused", # accepted; vectors use local files
table_name="my_library", # index folder name under TURBVEC_INDEX_DIR
embedding_provider="openai",
embedding_model="text-embedding-3-small",
vector_backend="turbovec",
)
docpipe.ingest("invoice.pdf", config=config)
# → ./.docpipe/indices/my_library/index.tvim + docstore.json
# Default pgvector in PostgreSQL is recommended for production deployments.API Reference
FastAPI server: docpipe serve --host 0.0.0.0 --port 8000. HTTP Basic Auth on all routes except GET /health and GET /metrics.
| Method | Path | Description |
|---|---|---|
| GET | /health | Health check, plugins, dependency status |
| GET | /metrics | Prometheus metrics (no auth) |
| POST | /parse | Parse a document |
| POST | /extract | Extract structured data |
| POST | /run | Parse + extract |
| POST | /ingest | Ingest into vector DB |
| DELETE | /ingest | Remove chunks for a source |
| POST | /search | Vector similarity search (filters) |
| POST | /rag/query | RAG Q&A (history, filters, usage) |
| POST | /rag/stream | Streaming RAG (SSE) |
| POST | /generate | Plain LLM completion (no retrieval) |
| POST | /evaluate/run | Evaluate RAG quality |
| GET | /plugins | List registered plugins |
from docpipe.http import DocpipeClient
with DocpipeClient("http://localhost:8000", username="admin", password="docpipe") as client:
print(client.health())
result = client.rag_query({...})
print(result.get("usage"))Docker / Production
Official GHCR image, standalone compose with pgvector, and production sidecar deployment notes.
Built from docpipe/Dockerfile on python:3.12-slim; default entrypoint runs docpipe serve on port 8000.
# Pull official image (GHCR)
docker pull ghcr.io/thesunnysinha/docpipe:latest
# API server — .env: provider API keys + optional DOCPIPE_OTEL_* (see Observability)
docker run -p 8000:8000 --env-file .env ghcr.io/thesunnysinha/docpipe:latest
# curl http://localhost:8000/health
# curl http://localhost:8000/metrics
# One-off parse / ingest
docker run -v ./data:/data ghcr.io/thesunnysinha/docpipe:latest parse /data/invoice.pdf
docker run --env-file .env -v ./data:/data ghcr.io/thesunnysinha/docpipe:latest \
ingest /data/report.pdf --db "postgresql://..." --table docs# docker-compose.yml — docpipe API + pgvector (standalone)
services:
docpipe:
image: ghcr.io/thesunnysinha/docpipe:latest
ports:
- "8000:8000"
env_file: .env
volumes:
- ./data:/data
depends_on:
db:
condition: service_healthy
restart: unless-stopped
db:
image: pgvector/pgvector:pg16
environment:
POSTGRES_USER: docpipe
POSTGRES_PASSWORD: docpipe
POSTGRES_DB: docpipe
ports:
- "5432:5432"
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U docpipe"]
interval: 5s
timeout: 5s
retries: 5
restart: unless-stopped
volumes:
pgdata:
# cp .env.example .env && docker compose up -d| Variable | Example / default | Purpose |
|---|---|---|
| OPENAI_API_KEY | sk-... | Embedding / LLM when using OpenAI |
| DB_CONNECTION_STRING | postgresql://docpipe:docpipe@db:5432/docpipe | pgvector DB (compose db service) |
| DB_TABLE_NAME | documents | Default collection name |
| EMBEDDING_PROVIDER | openai | Embedding vendor |
| EMBEDDING_MODEL | text-embedding-3-small | Embedding model id |
| LLM_PROVIDER | openai | RAG LLM vendor |
| LLM_MODEL | gpt-4o | RAG LLM model id |
| DOCPIPE_OTEL_ENABLED | false | OpenTelemetry traces (optional) |
| DOCPIPE_OTEL_EXPORTER_OTLP_ENDPOINT | http://localhost:4318/v1/traces | OTLP HTTP endpoint |
| DOCPIPE_LOG_FORMAT | text | text or json logs |
| DOCPIPE_HEALTH_CHECK_DB | true | Probe Postgres on /health |
| DOCPIPE_ALLOW_PRIVATE_URLS | false | Allow ingest sources on private IPs (sidecar / internal object storage) |
- Image: ghcr.io/thesunnysinha/docpipe — tags: latest, semver (e.g. 0.5.2), sha-<commit>.
- Pin a semver tag or digest in production if you need reproducible deploys; use pull_policy: always with :latest in dev.
- Sidecar pattern: run docpipe on your app Docker network with no host port; callers use http://docpipe:8000 (service name).
- Share an existing pgvector Postgres by setting DATABASE_URL on the docpipe service to your DB connection string.
- Presigned URLs on private compose networks: set DOCPIPE_ALLOW_PRIVATE_URLS=true on the docpipe container.
- HTTP Basic Auth on API routes (except /health and /metrics); configure DOCPIPE_USERNAME / DOCPIPE_PASSWORD to match your client.
- Full stack with Adminer: docker-compose.full.yml in the docpipe repo.
- Optional OTEL: DOCPIPE_OTEL_* — see Observability section. Scrape GET /metrics (no auth) for Prometheus.