PRDs · prds/05-ai-tools-ingestion-prd.md Docs Home

PRD 05: AI Tools, Research, Ingestion, and Knowledge

Problem Statement

The legacy Zweistein Python services contain a strong AI tool catalog, deep research provider routing, document/media ingestion, OCR, TTS, video analysis, and query engine behavior. The old implementation is powerful but entangled with LangGraph/LangChain, cloud-specific storage, missing cost controls, and limited observability.

The new platform must preserve the AI capabilities while making them provider-aware, observable, authenticated, and deployable in the Hetzner architecture.

Solution

Create an AI services domain that supports:

agent execution loop;
typed tool registry;
deep research;
web search and scraping;
OCR and document parsing;
image/video/audio analysis;
TTS and STT;
ingestion workers;
knowledge spaces and retrieval;
streaming events;
cost tracking, caching, and audit logs.

Legacy Source References

zweistein-reference/python_server/query_engine/agents/
zweistein-reference/python_server/query_engine/agents/tools/
zweistein-reference/python_server/query_engine/agents/deep_research/
zweistein-reference/python_server/query_engine/api/
zweistein-reference/python_server/query_engine/zweistein/
zweistein-reference/python_server/ingestion_worker/
zweistein-reference/server/src/query/
zweistein-reference/server/src/crawler/
zweistein-reference/server/src/files/
zweistein-reference/server/src/data-processing/

User Stories

As a user, I want agents to use tools, so that they can search, inspect files, analyze media, and produce richer outputs.
As a user, I want deep research, so that I can get long-form researched answers with sources.
As a user, I want OCR for documents and images, so that scanned material becomes usable.
As a user, I want video analysis, so that uploaded videos can be summarized or inspected.
As a user, I want image analysis and image generation, so that visual workflows remain possible.
As a user, I want text-to-speech, so that written output can become audio.
As a user, I want speech-to-text, so that voice input becomes text.
As a workspace admin, I want knowledge spaces, so that files, websites, videos, and conversations can be searched.
As a workspace admin, I want ingestion progress, so that I know whether files are indexed.
As an agent builder, I want to choose which tools an agent may use, so that behavior is controlled.
As an operator, I want tool costs recorded, so that AI spend can be billed and controlled.
As an operator, I want provider failures visible, so that outages can be diagnosed.
As a security owner, I want all query engine endpoints authenticated, so that public abuse is prevented.
As a product designer, I want AI progress streamed clearly, so that users understand what the agent is doing.

Functional Requirements

Agent Execution

Support a custom execution loop with planning, tool selection, tool execution, reflection, and final response.
Support provider-specific model routing.
Support typed events for stream updates.
Support cancellation and timeouts.
Support per-tool timeout configuration.

Tool Registry

Tools must have name, description, input schema, output schema, timeout, cost category, and permissions.
Preserve legacy tools where useful: web search, URL loader, deep research, OCR, image explainer, image generation, video analyzer, PDF filler, email sender, TTS, internal knowledge retrieval.
Drop framework-specific decorators from the product contract.

Deep Research

Support provider routing across available vendors.
Use phases: planning, searching, analyzing, final report.
Stream research progress with typed events.
Track queries, sources, learnings, model/provider, duration, and cost.

Ingestion

Support files, documents, images, audio, video, URLs, websites, YouTube, and conversations.
Process via background worker.
Extract text, metadata, embeddings, thumbnails/previews where possible.
Deduplicate by content hash.
Version indexed documents when content changes.
Emit progress and errors to the UI.

Knowledge and Retrieval

Support spaces/collections of knowledge.
Support semantic and keyword retrieval.
Store source references for citations.
Support chunk browser and source preview where practical.

Non-Functional Requirements

No hardcoded API keys.
No unauthenticated query engine endpoints.
No provider-specific storage assumption in core ingestion logic.
Every provider call must log tokens, cost estimate, latency, and request ID.
Cache embeddings and repeat research where safe.
Apply rate limits and quota checks before expensive operations.

Implementation Decisions

Preserve the old tool catalog as product capability, not framework code.
Replace LangGraph/LangChain with a direct provider/tool loop unless a later explicit decision says otherwise.
Add EventWriter abstraction for SSE/WebSocket/file logging.
Add storage adapter before porting ingestion.
Add cost tracker and usage record integration from day one.

Testing Decisions

Unit-test tool schema validation.
Unit-test deep research event flow with fake providers.
Integration-test ingestion for PDF, image, audio, URL, and conversation inputs.
API-test auth and tenant boundaries on query endpoints.
Load-test queue backpressure and long-running tools.

Out of Scope

Choosing final vector database before deployment architecture is agreed.
Reusing old hardcoded cloud credentials or provider config.