Unstructured docs to
answers
in one pipeline
Parse documents, extract structured data with LLMs, ingest into your vector DB, and ask questions with 5 RAG strategies. Composable pipelines, no vendor lock-in.
$ pip install docpipe-sdk[all]Four Pipelines, Fully Composable
Use each independently or chain them together. Your data, your DB, your LLM.
Documents
PDF, DOCX, images...Parse
DoclingExtract
LangExtract / LangChainIngest
pgvector + your DBRAG Query
5 strategiesDocuments
PDF, DOCX, images...Parse
DoclingExtract
LangExtract / LangChainIngest
pgvector + your DBRAG Query
5 strategies1. Parse Only (Docling)
Convert any document to clean text or markdown.
import docpipe
doc = docpipe.parse("report.pdf")
print(doc.markdown)2. Extract Only (LangExtract)
Extract structured entities from any text with LLMs.
schema = docpipe.ExtractionSchema(
description="Extract people and ages",
model_id="gemini-2.5-flash",
)
results = docpipe.extract(text, schema)3. Parse + Extract
Full pipeline: document to structured data in one call.
result = docpipe.run(
"invoice.pdf", schema
)
print(result.extractions)4. Parse + Ingest
Parse a document and ingest vectors into your DB.
config = docpipe.IngestionConfig(
connection_string="postgresql://...",
table_name="docs",
embedding_provider="openai",
embedding_model="text-embedding-3-small",
)
docpipe.ingest("report.pdf", config=config)5. Full Pipeline
Parse, extract, and ingest - all in one call.
result = docpipe.run(
"contract.pdf", schema,
ingestion_config=config,
)6. RAG Query
Ask questions against your ingested documents with grounded answers and source citations.
rag_cfg = docpipe.RAGConfig(
connection_string="postgresql://...",
table_name="docs",
embedding_provider="openai",
embedding_model="text-embedding-3-small",
llm_provider="openai",
llm_model="gpt-4o",
strategy="hyde",
)
result = docpipe.rag(
"What is the invoice total?",
config=rag_cfg,
)
print(result.answer) # grounded answer with citations
print(result.sources) # ["invoice.pdf"]5 Retrieval Strategies — Pick What Fits
Switch strategy with one config field. Mix with reranking and structured output.
Standard cosine similarity search. Fast, reliable baseline for well-formed queries.
Best for: well-formed queries, fast responsesExpands your query into N variants via LLM, retrieves for each, then deduplicates and ranks results.
Best for: vague or short queriesRetrieves seed chunks, then expands context by fetching additional chunks from the same source documents.
Best for: long documents, context coherenceCombines dense vector search with sparse BM25 keyword retrieval via EnsembleRetriever. Best of both worlds.
Best for: exact terms, proper nouns, IDsrag_cfg = docpipe.RAGConfig(
...,
strategy="naive",
reranker="flashrank", # local, no API key
rerank_top_n=5,
)
# Retrieve top-50, rerank, keep top-5class Invoice(BaseModel):
total: float
currency: str
result = docpipe.rag(
"What is the total?",
config=docpipe.RAGConfig(
..., output_model=Invoice
),
)
invoice = result.structured
# Invoice(total=4250.0, currency='USD')Built for Production
Everything you need to go from raw documents to grounded answers at scale.
Plugin Architecture
Add custom parsers and extractors via Python entry points. Third-party packages auto-discovered on install.
CLI + API Server
Full CLI for scripting, FastAPI server for microservices, Docker image for deployment.
Fully Configurable
No magic defaults. Explicit LLM provider, embedding model, and DB connection. YAML + env vars.
LangChain Backbone
Built on LangChain for embeddings, text splitting, and vector stores. Supports OpenAI, Gemini, Ollama, HuggingFace.
20+ Document Formats
PDF, DOCX, XLSX, PPTX, HTML, images, audio, video - powered by IBM Docling's advanced parsing.
5 RAG Strategies
naive, HyDE, multi-query, parent-document, hybrid — swap with one config field. Add reranking optionally.
Built-in Evaluation
Measure hit rate, MRR, faithfulness, and answer similarity. Know if your RAG is actually working.
Zero Vendor Lock-in
docpipe never stores your data. It connects to your DB, calls your LLM API, then gets out of the way.
Use It Your Way
CLI for quick tasks, Python API for integration, Docker for deployment.
# Parse a document
$ docpipe parse invoice.pdf --format markdown
# Ingest into your vector DB
$ docpipe ingest report.pdf \
--db "postgresql://..." \
--table docs \
--embedding-provider openai \
--embedding-model text-embedding-3-small \
--incremental
# Start API server
$ docpipe serve --port 8000pip install docpipe-sdk # Core only
pip install docpipe-sdk[docling] # + Document parsing (20+ formats)
pip install docpipe-sdk[langextract] # + Google LangExtract
pip install docpipe-sdk[openai] # + OpenAI embeddings & LLM
pip install docpipe-sdk[google] # + Google Gemini
pip install docpipe-sdk[pgvector] # + PostgreSQL vector store
pip install docpipe-sdk[rag] # + Hybrid search (BM25)
pip install docpipe-sdk[rerank] # + Local reranking (FlashRank)
pip install docpipe-sdk[server] # + FastAPI server
pip install docpipe-sdk[all] # Everything