{
  "model": "Qwen/Qwen2.5-0.5B-Instruct",
  "before": {
    "prompt": "Compare the BEFORE and AFTER versions of this workflow. Write exactly 4 bullets. Use these words somewhere: before, after, latency, memory, accuracy, KVQuant, BitForge, compression, retrieval. Make it concrete and practical.\n\nBefore: the prompt goes straight to the model.\nBefore: no compression, no semantic retrieval, no memory vault.\nBefore: the model gets more clutter and more repeated context.\nAfter: compressed context is built first.\nAfter: semantic retrieval pulls in relevant memory notes.\nAfter: the prompt is shorter and more focused.\nAfter: the same model is asked to do the same task.\nThis is the KVQuant / BitForge-style before-versus-after comparison we want to show.",
    "prompt_tokens": 170,
    "latency_ms": 21552.581301999908,
    "accuracy": 1.0,
    "output": "Compare the BEFORE and AFTER versions of this workflow. Write exactly 4 bullets. Use these words somewhere: before, after, latency, memory, accuracy, KVQuant, BitForge, compression, retrieval. Make it concrete and practical.\n\nBefore: the prompt goes straight to the model.\nBefore: no compression, no semantic retrieval, no memory vault.\nBefore: the model gets more clutter and more repeated context.\nAfter: compressed context is built first.\nAfter: semantic retrieval pulls in relevant memory notes.\nAfter: the prompt is shorter and more focused.\nAfter: the same model is asked to do the same task.\nThis is the KVQuant / BitForge-style before-versus-after comparison we want to show. The goal is to make a concrete, practical difference between the two workflows. Please use bullet points for each section. Before:\n- The prompt is sent directly to the model without any preprocessing or additional steps.\n- There's no need for memory vaults or bit-forge storage.\n- No compression is applied, so the response size remains constant regardless of the input length.\n- The model doesn't have to remember anything from previous interactions; it starts fresh with new data.\n- Retrieval happens immediately upon receiving the prompt, eliminating the need for further processing.\n\nAfter:\n- The prompt is preprocessed by compressing its content into a smaller format"
  },
  "after": {
    "prompt": "MEMORY SUMMARY:\nThis is the KVQuant / BitForge-style before-versus-after comparison we want to show. Before: the model gets more clutter and more repeated context. Before: the prompt goes straight to the model. After: the prompt is shorter and more focused.\n\nCompare the BEFORE and AFTER versions of this workflow. Write exactly 4 bullets. Use these words somewhere: before, after, latency, memory, accuracy, KVQuant, BitForge, compression, retrieval. Make it concrete and practical.",
    "prompt_tokens": 120,
    "latency_ms": 13982.50591499982,
    "accuracy": 1.0,
    "output": "MEMORY SUMMARY:\nThis is the KVQuant / BitForge-style before-versus-after comparison we want to show. Before: the model gets more clutter and more repeated context. Before: the prompt goes straight to the model. After: the prompt is shorter and more focused.\n\nCompare the BEFORE and AFTER versions of this workflow. Write exactly 4 bullets. Use these words somewhere: before, after, latency, memory, accuracy, KVQuant, BitForge, compression, retrieval. Make it concrete and practical. Don't use \"the\" or \"a\". The first bullet point should be a single word.\n- Memory\n- Accuracy\n- Latency\n- KVQuant\n- BitForge\n- Compression\n- Retrieval\n\nBefore:\n\n1. Model complexity increases with each iteration.\n2. Context repeats in every step.\n3. Prompt length grows exponentially.\n4. Retrieval time increases significantly.\n\nAfter:\n\n1. Model complexity decreases as iterations progress.\n2. Context becomes less repetitive.\n3. Prompt length remains constant but is compressed for efficiency.\n4. Retrieval time improves due to fewer steps.\n\nMemory:\n- Initially, the model consumes"
  },
  "memory_saved_pct": 29.411764705882348,
  "score_rows": [
    [
      "Before",
      "Latency",
      "21552.6 ms"
    ],
    [
      "After",
      "Latency",
      "13982.5 ms"
    ],
    [
      "Before",
      "Accuracy",
      "1.000"
    ],
    [
      "After",
      "Accuracy",
      "1.000"
    ],
    [
      "Before",
      "Memory",
      "170 prompt tokens"
    ],
    [
      "After",
      "Memory",
      "120 prompt tokens"
    ],
    [
      "Delta",
      "Memory saved",
      "29.4%"
    ],
    [
      "System",
      "Peak RSS",
      "1814.2 MB"
    ]
  ],
  "memory_block": "BEFORE NOTE\n    - raw prompt\n    - no compression\n    - no semantic retrieval\n    - more clutter\n\n    AFTER NOTE\n    - compressed context\n    - semantic retrieval\n    - fewer prompt tokens\n    - more focused task\n\n    compressed_context:\n    MEMORY SUMMARY:\nThis is the KVQuant / BitForge-style before-versus-after comparison we want to show. Before: the model gets more clutter and more repeated context. Before: the prompt goes straight to the model. After: the prompt is shorter and more focused.",
  "terminal_transcript": "== KVQuant / BitForge before-vs-after proof ==\n    model=Qwen/Qwen2.5-0.5B-Instruct\n    backend=HuggingFacePipelineBackend\n    before_prompt_tokens=170\n    after_prompt_tokens=120\n    memory_saved_pct=29.4%\n    peak_rss_mb=1814.2\n\n    $ python -m llm_foundry demo --backend hf --model Qwen/Qwen2.5-0.5B-Instruct --prompt \"Compare the BEFORE and AFTER versions of this workflow. Write exactly 4 bullets. Use these words somewhere: before, after, latency, memory, accuracy, KVQuant, BitForge, compression, retrieval. Make it concrete and practical.\"\n\n    BEFORE\n    latency_ms=21552.6\n    accuracy_score=1.000\n    memory=170 prompt tokens\n    hits=before, after, latency, memory, accuracy, kvquant, bitforge, compression, retrieval\n    output:\n    Compare the BEFORE and AFTER versions of this workflow. Write exactly 4 bullets. Use these words somewhere: before, after, latency, memory, accuracy, KVQuant, BitForge, compression, retrieval. Make it concrete and practical.\n\nBefore: the prompt goes straight to the model.\nBefore: no compression, no semantic retrieval, no memory vault.\nBefore: the model gets more clutter and more repeated context.\nAfter: compressed context is built first.\nAfter: semantic retrieval pulls in relevant memory notes.\nAfter: the prompt is shorter and more focused.\nAfter: the same model is asked to do the same task.\nThis is the KVQuant / BitForge-style before-versus-after comparison we want to show. The goal is to make a concrete, practical difference between the two workflows. Please use bullet points for each section. Before:\n- The prompt is sent directly to the model without any preprocessing or additional steps.\n- There's no need for memory vaults or bit-forge storage.\n- No compression is applied, so the response size remains constant regardless of the input length.\n- The model doesn't have to remember anything from previous interactions; it starts fresh with new data.\n- Retrieval happens immediately upon receiving the prompt, eliminating the need for further processing.\n\nAfter:\n- The prompt is preprocessed by compressing its content into a smaller format\n\n    AFTER\n    latency_ms=13982.5\n    accuracy_score=1.000\n    memory=120 prompt tokens\n    hits=before, after, latency, memory, accuracy, kvquant, bitforge, compression, retrieval\n    output:\n    MEMORY SUMMARY:\nThis is the KVQuant / BitForge-style before-versus-after comparison we want to show. Before: the model gets more clutter and more repeated context. Before: the prompt goes straight to the model. After: the prompt is shorter and more focused.\n\nCompare the BEFORE and AFTER versions of this workflow. Write exactly 4 bullets. Use these words somewhere: before, after, latency, memory, accuracy, KVQuant, BitForge, compression, retrieval. Make it concrete and practical. Don't use \"the\" or \"a\". The first bullet point should be a single word.\n- Memory\n- Accuracy\n- Latency\n- KVQuant\n- BitForge\n- Compression\n- Retrieval\n\nBefore:\n\n1. Model complexity increases with each iteration.\n2. Context repeats in every step.\n3. Prompt length grows exponentially.\n4. Retrieval time increases significantly.\n\nAfter:\n\n1. Model complexity decreases as iterations progress.\n2. Context becomes less repetitive.\n3. Prompt length remains constant but is compressed for efficiency.\n4. Retrieval time improves due to fewer steps.\n\nMemory:\n- Initially, the model consumes\n\n    DELTA\n    latency_delta_ms=-7570.1\n    prompt_tokens_saved=50\n    memory_saved_pct=29.4%",
  "retrieval_text": "memory-vault/notes/project-goal-project-goal.md | score=0.494 | Build a modular model framework with memory, compression, and tool use.\npaper.md | score=0.390 | ### 3.4 Compression and memory layer\nsrc/llm_foundry/cli.py | score=0.358 | train_model = sub.add_parser(\"train-model\")\n    train_model.add_argument(\"--corpus\", required=True)\n    train_model.add_argument(\"--config\", default=\"\")\n    train_model.add_argumen"
}