================
architecture.rst
================

Architecture
============

The Dataset Generator is built as a configurable directed graph of processing nodes.  The high-level pipeline and core components are illustrated below.

High-Level Pipeline
-------------------

.. mermaid::

    flowchart TD
      classDef expensiveModel stroke:#d55,stroke-width:2px
      classDef cheapModel   stroke:#5d5,stroke-width:2px

      A0[Start]           --> TopicGen
      TopicGen[Topic Generation] --> Email
      Email[Email Text]   --> Answer
      Answer[Answer Text] --> Tags
      Tags[Add Tags]      --> Rephrase
      Rephrase            --> Translate
      Translate           --> End

      class TopicGen,Email,Answer expensiveModel
      class Tags,Rephrase,Translate cheapModel

1. **TopicGen**, **Email**, and **Answer** nodes use “expensive” models (e.g. GPT-4) to ensure high quality.
2. **Tags**, **Rephrase**, and **Translate** use cheaper models (e.g. Qwen-7B) for cost efficiency.
3. Each node reads and enriches a shared context object, caching its output so downstream nodes can reuse results without recomputing.

Graph Runner and Nodes
----------------------

.. uml::

    @startuml
    package graph {
      class GraphRunner {
        + run(): List<Any>
      }
      class Graph {
        - endNodes: List<INode>
      }
      class Node {
        - parents: List<INode>
        - was_executed: bool
        + {async} execute(): Any
      }
      class _ExecutionNode <<private>> {
        + execute(inputs): Any
      }
      class Cache <<private>> {
        - stored_result: Any
      }
      GraphRunner --> Graph
      Graph --> Node
      Node --> _ExecutionNode
      Node --> Cache
    }
    package ai_graph {
      class AINode extends graph.Node {
        - chat_assistant: ChatAssistant
        + {async} execute(inputs)
      }
      class Storage {
        - data: Map<String,Object>
        + save(data)
        + load(): Object
      }
    }
    @enduml

- **GraphRunner**: The entry point that triggers execution of the graph from all terminal nodes. Calling `run()` executes each node in order, respecting dependencies.
- **Graph**: Maintains the parent→child relationships of all **Node** objects in the pipeline.
- **Node**: Orchestrates caching. If a node’s result is already computed (in its **Cache**), it loads it; otherwise it creates a private **_ExecutionNode** to compute the output. Nodes pass data via a shared context.
- **AINode**: A subclass of `Node` that wraps an AI assistant. It generates prompts and calls the assistant to produce text outputs.
- **Storage**: A global key-value store for persisting intermediate data across the pipeline.

Assistants & Data Models
------------------------

.. uml::

    @startuml
    package ai_graph {
      class AIAssistant {
        + get_response(prompt): string
      }
      class PromptBuilder {
        + add_input()
        + set_output()
      }
      class InputType <<DataModel>> {
        + get_description(): string
      }
      class OutputType <<DataModel>> {
        + get_description(): string
      }
    }
    @enduml

- **AIAssistant**: Abstracts any LLM API (e.g. OpenAI, local LLM, etc.) behind a uniform interface. Developers can plug in different model endpoints here.
- **PromptBuilder**: Constructs prompts in a structured (JSON-schema) format. It takes an `InputType` definition, adds current context fields, and specifies the expected `OutputType`. This ensures inputs and outputs are validated.
- **InputType / OutputType**: Pydantic dataclasses (JSON-schema models) that declare the fields for prompts and responses. They include validation rules and documentation for each field.

Randomized Inputs
-----------------

.. uml::

    @startuml
    package random {
      interface IRandom<V> {
        + get_random(): V
      }
      class RandomCollection<V> implements IRandom<V> {
        - values: List<V>
        - weights: List<float>
        + get_random(): V
      }
      class RandomTable<K,V> implements IRandom<V> {
        - rows: Map<K,RandomCollection<V>>
        + get_random(key:K): V
      }
    }
    @enduml

To introduce variability without AI:

- **IRandom**: An interface for classes that can generate random values on demand.
- **RandomCollection**: Holds a list of values with associated weights. It optionally applies small random perturbations to the weights each draw to simulate noise, then samples a random element.
- **RandomTable**: Maps a key (often an enum, e.g. ticket type) to a `RandomCollection` of values (e.g. possible priorities for that ticket type). This allows conditional sampling (different priorities for bugs vs. feature requests, etc.).

Cost & Usage Analysis
---------------------

.. uml::

    @startuml
    package analysis {
      interface Analyzer {
        + get_cost_analysis(): AnalysisResult
      }
      class AssistantRun implements Analyzer {
        + get_cost_analysis(): AnalysisResult
      }
      class AssistantRunsAnalyzer implements Analyzer {
        + get_cost_analysis(): AnalysisResult
      }
      class DatasetUsageAnalyzer implements Analyzer {
        + generate_cost_summary(): List[FormattedAnalysis]
      }
      class AnalysisResult {
        + prompt_tokens: int
        + prompts_cost: Money
        + completion_tokens: int
        + completions_cost: Money
        + total_cost: Money
      }
    }
    @enduml

The system tracks token counts and cost throughout:

- **AssistantRun**: Records the token usage and cost of a single `run_openai` (or equivalent) call for one node.
- **AssistantRunsAnalyzer**: Aggregates multiple `AssistantRun` records (e.g. from one pipeline run) into an `AnalysisResult` summarizing total tokens and cost.
- **DatasetUsageAnalyzer**: Gathers cost information across all generated tickets and produces a formatted summary (breakdown per assistant and totals in any currency). It can generate reports like CSV or tables.

These analyzers help you audit the monetary cost of generation by agent.  See the Usage section for examples of generating cost reports.