PRDs · prds/05-ai-tools-ingestion-prd.md Docs Home

PRD 05: AI Tools, Research, Ingestion, and Knowledge

Problem Statement

The legacy Zweistein Python services contain a strong AI tool catalog, deep research provider routing, document/media ingestion, OCR, TTS, video analysis, and query engine behavior. The old implementation is powerful but entangled with LangGraph/LangChain, cloud-specific storage, missing cost controls, and limited observability.

The new platform must preserve the AI capabilities while making them provider-aware, observable, authenticated, and deployable in the Hetzner architecture.

Solution

Create an AI services domain that supports:

  • agent execution loop;
  • typed tool registry;
  • deep research;
  • web search and scraping;
  • OCR and document parsing;
  • image/video/audio analysis;
  • TTS and STT;
  • ingestion workers;
  • knowledge spaces and retrieval;
  • streaming events;
  • cost tracking, caching, and audit logs.

Legacy Source References

  • zweistein-reference/python_server/query_engine/agents/
  • zweistein-reference/python_server/query_engine/agents/tools/
  • zweistein-reference/python_server/query_engine/agents/deep_research/
  • zweistein-reference/python_server/query_engine/api/
  • zweistein-reference/python_server/query_engine/zweistein/
  • zweistein-reference/python_server/ingestion_worker/
  • zweistein-reference/server/src/query/
  • zweistein-reference/server/src/crawler/
  • zweistein-reference/server/src/files/
  • zweistein-reference/server/src/data-processing/

User Stories

  1. As a user, I want agents to use tools, so that they can search, inspect files, analyze media, and produce richer outputs.
  2. As a user, I want deep research, so that I can get long-form researched answers with sources.
  3. As a user, I want OCR for documents and images, so that scanned material becomes usable.
  4. As a user, I want video analysis, so that uploaded videos can be summarized or inspected.
  5. As a user, I want image analysis and image generation, so that visual workflows remain possible.
  6. As a user, I want text-to-speech, so that written output can become audio.
  7. As a user, I want speech-to-text, so that voice input becomes text.
  8. As a workspace admin, I want knowledge spaces, so that files, websites, videos, and conversations can be searched.
  9. As a workspace admin, I want ingestion progress, so that I know whether files are indexed.
  10. As an agent builder, I want to choose which tools an agent may use, so that behavior is controlled.
  11. As an operator, I want tool costs recorded, so that AI spend can be billed and controlled.
  12. As an operator, I want provider failures visible, so that outages can be diagnosed.
  13. As a security owner, I want all query engine endpoints authenticated, so that public abuse is prevented.
  14. As a product designer, I want AI progress streamed clearly, so that users understand what the agent is doing.

Functional Requirements

Agent Execution

  • Support a custom execution loop with planning, tool selection, tool execution, reflection, and final response.
  • Support provider-specific model routing.
  • Support typed events for stream updates.
  • Support cancellation and timeouts.
  • Support per-tool timeout configuration.

Tool Registry

  • Tools must have name, description, input schema, output schema, timeout, cost category, and permissions.
  • Preserve legacy tools where useful: web search, URL loader, deep research, OCR, image explainer, image generation, video analyzer, PDF filler, email sender, TTS, internal knowledge retrieval.
  • Drop framework-specific decorators from the product contract.

Deep Research

  • Support provider routing across available vendors.
  • Use phases: planning, searching, analyzing, final report.
  • Stream research progress with typed events.
  • Track queries, sources, learnings, model/provider, duration, and cost.

Ingestion

  • Support files, documents, images, audio, video, URLs, websites, YouTube, and conversations.
  • Process via background worker.
  • Extract text, metadata, embeddings, thumbnails/previews where possible.
  • Deduplicate by content hash.
  • Version indexed documents when content changes.
  • Emit progress and errors to the UI.

Knowledge and Retrieval

  • Support spaces/collections of knowledge.
  • Support semantic and keyword retrieval.
  • Store source references for citations.
  • Support chunk browser and source preview where practical.

Non-Functional Requirements

  • No hardcoded API keys.
  • No unauthenticated query engine endpoints.
  • No provider-specific storage assumption in core ingestion logic.
  • Every provider call must log tokens, cost estimate, latency, and request ID.
  • Cache embeddings and repeat research where safe.
  • Apply rate limits and quota checks before expensive operations.

Implementation Decisions

  • Preserve the old tool catalog as product capability, not framework code.
  • Replace LangGraph/LangChain with a direct provider/tool loop unless a later explicit decision says otherwise.
  • Add EventWriter abstraction for SSE/WebSocket/file logging.
  • Add storage adapter before porting ingestion.
  • Add cost tracker and usage record integration from day one.

Testing Decisions

  • Unit-test tool schema validation.
  • Unit-test deep research event flow with fake providers.
  • Integration-test ingestion for PDF, image, audio, URL, and conversation inputs.
  • API-test auth and tenant boundaries on query endpoints.
  • Load-test queue backpressure and long-running tools.

Out of Scope

  • Choosing final vector database before deployment architecture is agreed.
  • Reusing old hardcoded cloud credentials or provider config.