================ architecture.rst ================ Architecture ============ The Dataset Generator is built as a configurable directed graph of processing nodes. The high-level pipeline and core components are illustrated below. High-Level Pipeline ------------------- .. mermaid:: flowchart TD classDef expensiveModel stroke:#d55,stroke-width:2px classDef cheapModel stroke:#5d5,stroke-width:2px A0[Start] --> TopicGen TopicGen[Topic Generation] --> Email Email[Email Text] --> Answer Answer[Answer Text] --> Tags Tags[Add Tags] --> Rephrase Rephrase --> Translate Translate --> End class TopicGen,Email,Answer expensiveModel class Tags,Rephrase,Translate cheapModel 1. **TopicGen**, **Email**, and **Answer** nodes use “expensive” models (e.g. GPT-4) to ensure high quality. 2. **Tags**, **Rephrase**, and **Translate** use cheaper models (e.g. Qwen-7B) for cost efficiency. 3. Each node reads and enriches a shared context object, caching its output so downstream nodes can reuse results without recomputing. Graph Runner and Nodes ---------------------- .. uml:: @startuml package graph { class GraphRunner { + run(): List } class Graph { - endNodes: List } class Node { - parents: List - was_executed: bool + {async} execute(): Any } class _ExecutionNode <> { + execute(inputs): Any } class Cache <> { - stored_result: Any } GraphRunner --> Graph Graph --> Node Node --> _ExecutionNode Node --> Cache } package ai_graph { class AINode extends graph.Node { - chat_assistant: ChatAssistant + {async} execute(inputs) } class Storage { - data: Map + save(data) + load(): Object } } @enduml - **GraphRunner**: The entry point that triggers execution of the graph from all terminal nodes. Calling `run()` executes each node in order, respecting dependencies. - **Graph**: Maintains the parent→child relationships of all **Node** objects in the pipeline. - **Node**: Orchestrates caching. If a node’s result is already computed (in its **Cache**), it loads it; otherwise it creates a private **_ExecutionNode** to compute the output. Nodes pass data via a shared context. - **AINode**: A subclass of `Node` that wraps an AI assistant. It generates prompts and calls the assistant to produce text outputs. - **Storage**: A global key-value store for persisting intermediate data across the pipeline. Assistants & Data Models ------------------------ .. uml:: @startuml package ai_graph { class AIAssistant { + get_response(prompt): string } class PromptBuilder { + add_input() + set_output() } class InputType <> { + get_description(): string } class OutputType <> { + get_description(): string } } @enduml - **AIAssistant**: Abstracts any LLM API (e.g. OpenAI, local LLM, etc.) behind a uniform interface. Developers can plug in different model endpoints here. - **PromptBuilder**: Constructs prompts in a structured (JSON-schema) format. It takes an `InputType` definition, adds current context fields, and specifies the expected `OutputType`. This ensures inputs and outputs are validated. - **InputType / OutputType**: Pydantic dataclasses (JSON-schema models) that declare the fields for prompts and responses. They include validation rules and documentation for each field. Randomized Inputs ----------------- .. uml:: @startuml package random { interface IRandom { + get_random(): V } class RandomCollection implements IRandom { - values: List - weights: List + get_random(): V } class RandomTable implements IRandom { - rows: Map> + get_random(key:K): V } } @enduml To introduce variability without AI: - **IRandom**: An interface for classes that can generate random values on demand. - **RandomCollection**: Holds a list of values with associated weights. It optionally applies small random perturbations to the weights each draw to simulate noise, then samples a random element. - **RandomTable**: Maps a key (often an enum, e.g. ticket type) to a `RandomCollection` of values (e.g. possible priorities for that ticket type). This allows conditional sampling (different priorities for bugs vs. feature requests, etc.). Cost & Usage Analysis --------------------- .. uml:: @startuml package analysis { interface Analyzer { + get_cost_analysis(): AnalysisResult } class AssistantRun implements Analyzer { + get_cost_analysis(): AnalysisResult } class AssistantRunsAnalyzer implements Analyzer { + get_cost_analysis(): AnalysisResult } class DatasetUsageAnalyzer implements Analyzer { + generate_cost_summary(): List[FormattedAnalysis] } class AnalysisResult { + prompt_tokens: int + prompts_cost: Money + completion_tokens: int + completions_cost: Money + total_cost: Money } } @enduml The system tracks token counts and cost throughout: - **AssistantRun**: Records the token usage and cost of a single `run_openai` (or equivalent) call for one node. - **AssistantRunsAnalyzer**: Aggregates multiple `AssistantRun` records (e.g. from one pipeline run) into an `AnalysisResult` summarizing total tokens and cost. - **DatasetUsageAnalyzer**: Gathers cost information across all generated tickets and produces a formatted summary (breakdown per assistant and totals in any currency). It can generate reports like CSV or tables. These analyzers help you audit the monetary cost of generation by agent. See the Usage section for examples of generating cost reports.