PRD 05: AI Tools, Research, Ingestion, and Knowledge
Problem Statement
The legacy Zweistein Python services contain a strong AI tool catalog, deep research provider routing, document/media ingestion, OCR, TTS, video analysis, and query engine behavior. The old implementation is powerful but entangled with LangGraph/LangChain, cloud-specific storage, missing cost controls, and limited observability.
The new platform must preserve the AI capabilities while making them provider-aware, observable, authenticated, and deployable in the Hetzner architecture.
Solution
Create an AI services domain that supports:
- agent execution loop;
- typed tool registry;
- deep research;
- web search and scraping;
- OCR and document parsing;
- image/video/audio analysis;
- TTS and STT;
- ingestion workers;
- knowledge spaces and retrieval;
- streaming events;
- cost tracking, caching, and audit logs.
Legacy Source References
zweistein-reference/python_server/query_engine/agents/zweistein-reference/python_server/query_engine/agents/tools/zweistein-reference/python_server/query_engine/agents/deep_research/zweistein-reference/python_server/query_engine/api/zweistein-reference/python_server/query_engine/zweistein/zweistein-reference/python_server/ingestion_worker/zweistein-reference/server/src/query/zweistein-reference/server/src/crawler/zweistein-reference/server/src/files/zweistein-reference/server/src/data-processing/
User Stories
- As a user, I want agents to use tools, so that they can search, inspect files, analyze media, and produce richer outputs.
- As a user, I want deep research, so that I can get long-form researched answers with sources.
- As a user, I want OCR for documents and images, so that scanned material becomes usable.
- As a user, I want video analysis, so that uploaded videos can be summarized or inspected.
- As a user, I want image analysis and image generation, so that visual workflows remain possible.
- As a user, I want text-to-speech, so that written output can become audio.
- As a user, I want speech-to-text, so that voice input becomes text.
- As a workspace admin, I want knowledge spaces, so that files, websites, videos, and conversations can be searched.
- As a workspace admin, I want ingestion progress, so that I know whether files are indexed.
- As an agent builder, I want to choose which tools an agent may use, so that behavior is controlled.
- As an operator, I want tool costs recorded, so that AI spend can be billed and controlled.
- As an operator, I want provider failures visible, so that outages can be diagnosed.
- As a security owner, I want all query engine endpoints authenticated, so that public abuse is prevented.
- As a product designer, I want AI progress streamed clearly, so that users understand what the agent is doing.
Functional Requirements
Agent Execution
- Support a custom execution loop with planning, tool selection, tool execution, reflection, and final response.
- Support provider-specific model routing.
- Support typed events for stream updates.
- Support cancellation and timeouts.
- Support per-tool timeout configuration.
Tool Registry
- Tools must have name, description, input schema, output schema, timeout, cost category, and permissions.
- Preserve legacy tools where useful: web search, URL loader, deep research, OCR, image explainer, image generation, video analyzer, PDF filler, email sender, TTS, internal knowledge retrieval.
- Drop framework-specific decorators from the product contract.
Deep Research
- Support provider routing across available vendors.
- Use phases: planning, searching, analyzing, final report.
- Stream research progress with typed events.
- Track queries, sources, learnings, model/provider, duration, and cost.
Ingestion
- Support files, documents, images, audio, video, URLs, websites, YouTube, and conversations.
- Process via background worker.
- Extract text, metadata, embeddings, thumbnails/previews where possible.
- Deduplicate by content hash.
- Version indexed documents when content changes.
- Emit progress and errors to the UI.
Knowledge and Retrieval
- Support spaces/collections of knowledge.
- Support semantic and keyword retrieval.
- Store source references for citations.
- Support chunk browser and source preview where practical.
Non-Functional Requirements
- No hardcoded API keys.
- No unauthenticated query engine endpoints.
- No provider-specific storage assumption in core ingestion logic.
- Every provider call must log tokens, cost estimate, latency, and request ID.
- Cache embeddings and repeat research where safe.
- Apply rate limits and quota checks before expensive operations.
Implementation Decisions
- Preserve the old tool catalog as product capability, not framework code.
- Replace LangGraph/LangChain with a direct provider/tool loop unless a later explicit decision says otherwise.
- Add
EventWriterabstraction for SSE/WebSocket/file logging. - Add storage adapter before porting ingestion.
- Add cost tracker and usage record integration from day one.
Testing Decisions
- Unit-test tool schema validation.
- Unit-test deep research event flow with fake providers.
- Integration-test ingestion for PDF, image, audio, URL, and conversation inputs.
- API-test auth and tenant boundaries on query endpoints.
- Load-test queue backpressure and long-running tools.
Out of Scope
- Choosing final vector database before deployment architecture is agreed.
- Reusing old hardcoded cloud credentials or provider config.