docpipe
PipelinesRagFeaturesUsageLoginGitHubPyPI
Powered by Docling · LangExtract · LangChain · RAGPipeline

Unstructured docs to
answers
in one pipeline

Parse documents, extract structured data with LLMs, ingest into your vector DB, and ask questions with 5 RAG strategies. Composable pipelines, no vendor lock-in.
View on GitHubPyPI Package
$ pip install docpipe-sdk[all]
click to copy
MIT License
Python 3.10+
4 Pipelines
5 RAG Strategies
Zero Vendor Lock-in

Four Pipelines, Fully Composable

Use each independently or chain them together. Your data, your DB, your LLM.

📄

Documents

PDF, DOCX, images...
🔍

Parse

Docling

Extract

LangExtract / LangChain
🗃

Ingest

pgvector + your DB
🤖

RAG Query

5 strategies
📄

Documents

PDF, DOCX, images...
🔍

Parse

Docling

Extract

LangExtract / LangChain
🗃

Ingest

pgvector + your DB
🤖

RAG Query

5 strategies
1. Parse Only (Docling)

Convert any document to clean text or markdown.

import docpipe

doc = docpipe.parse("report.pdf")
print(doc.markdown)
2. Extract Only (LangExtract)

Extract structured entities from any text with LLMs.

schema = docpipe.ExtractionSchema(
    description="Extract people and ages",
    model_id="gemini-2.5-flash",
)
results = docpipe.extract(text, schema)
3. Parse + Extract

Full pipeline: document to structured data in one call.

result = docpipe.run(
    "invoice.pdf", schema
)
print(result.extractions)
4. Parse + Ingest

Parse a document and ingest vectors into your DB.

config = docpipe.IngestionConfig(
    connection_string="postgresql://...",
    table_name="docs",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
)
docpipe.ingest("report.pdf", config=config)
5. Full Pipeline

Parse, extract, and ingest - all in one call.

result = docpipe.run(
    "contract.pdf", schema,
    ingestion_config=config,
)
6. RAG Query

Ask questions against your ingested documents with grounded answers and source citations.

rag_cfg = docpipe.RAGConfig(
    connection_string="postgresql://...",
    table_name="docs",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
    llm_provider="openai",
    llm_model="gpt-4o",
    strategy="hyde",
)
result = docpipe.rag(
    "What is the invoice total?",
    config=rag_cfg,
)
print(result.answer)   # grounded answer with citations
print(result.sources)  # ["invoice.pdf"]

5 Retrieval Strategies — Pick What Fits

Switch strategy with one config field. Mix with reranking and structured output.

Standard cosine similarity search. Fast, reliable baseline for well-formed queries.

Best for: well-formed queries, fast responses

LLM generates a hypothetical answer first, embeds it, then retrieves real matching docs. Highest accuracy in benchmarks.

Best for: complex / technical queries

Expands your query into N variants via LLM, retrieves for each, then deduplicates and ranks results.

Best for: vague or short queries

Retrieves seed chunks, then expands context by fetching additional chunks from the same source documents.

Best for: long documents, context coherence

Combines dense vector search with sparse BM25 keyword retrieval via EnsembleRetriever. Best of both worlds.

Best for: exact terms, proper nouns, IDs
📈 Optional Reranking
rag_cfg = docpipe.RAGConfig(
    ...,
    strategy="naive",
    reranker="flashrank",  # local, no API key
    rerank_top_n=5,
)
# Retrieve top-50, rerank, keep top-5
🎯 Structured RAG Output
class Invoice(BaseModel):
    total: float
    currency: str

result = docpipe.rag(
    "What is the total?",
    config=docpipe.RAGConfig(
        ..., output_model=Invoice
    ),
)
invoice = result.structured
# Invoice(total=4250.0, currency='USD')

Built for Production

Everything you need to go from raw documents to grounded answers at scale.

Plugin Architecture

Add custom parsers and extractors via Python entry points. Third-party packages auto-discovered on install.

CLI + API Server

Full CLI for scripting, FastAPI server for microservices, Docker image for deployment.

Fully Configurable

No magic defaults. Explicit LLM provider, embedding model, and DB connection. YAML + env vars.

LangChain Backbone

Built on LangChain for embeddings, text splitting, and vector stores. Supports OpenAI, Gemini, Ollama, HuggingFace.

20+ Document Formats

PDF, DOCX, XLSX, PPTX, HTML, images, audio, video - powered by IBM Docling's advanced parsing.

5 RAG Strategies

naive, HyDE, multi-query, parent-document, hybrid — swap with one config field. Add reranking optionally.

Built-in Evaluation

Measure hit rate, MRR, faithfulness, and answer similarity. Know if your RAG is actually working.

Zero Vendor Lock-in

docpipe never stores your data. It connects to your DB, calls your LLM API, then gets out of the way.

Use It Your Way

CLI for quick tasks, Python API for integration, Docker for deployment.

# Parse a document
$ docpipe parse invoice.pdf --format markdown

# Ingest into your vector DB
$ docpipe ingest report.pdf \
    --db "postgresql://..." \
    --table docs \
    --embedding-provider openai \
    --embedding-model text-embedding-3-small \
    --incremental

# Start API server
$ docpipe serve --port 8000
📦Install Options
pip install docpipe-sdk                # Core only
pip install docpipe-sdk[docling]       # + Document parsing (20+ formats)
pip install docpipe-sdk[langextract]   # + Google LangExtract
pip install docpipe-sdk[openai]        # + OpenAI embeddings & LLM
pip install docpipe-sdk[google]        # + Google Gemini
pip install docpipe-sdk[pgvector]      # + PostgreSQL vector store
pip install docpipe-sdk[rag]           # + Hybrid search (BM25)
pip install docpipe-sdk[rerank]        # + Local reranking (FlashRank)
pip install docpipe-sdk[server]        # + FastAPI server
pip install docpipe-sdk[all]           # Everything