{
  "model": "Qwen/Qwen2.5-0.5B-Instruct",
  "before": {
    "prompt": "Answer in exactly 4 bullets. Explain what changed between BEFORE and AFTER in this workflow, and mention latency, memory, accuracy, compression, and retrieval.\n\nBefore: the model gets a raw prompt.\nBefore: no compression, no semantic retrieval, no memory vault.\nBefore: the prompt is noisy and repetitive.\nBefore: the model has to carry more clutter.",
    "prompt_tokens": 87,
    "latency_ms": 28590.313772,
    "accuracy": 0.5,
    "output": "Answer in exactly 4 bullets. Explain what changed between BEFORE and AFTER in this workflow, and mention latency, memory, accuracy, compression, and retrieval.\n\nBefore: the model gets a raw prompt.\nBefore: no compression, no semantic retrieval, no memory vault.\nBefore: the prompt is noisy and repetitive.\nBefore: the model has to carry more clutter. \nAfter: the model gets a compressed version of the prompt.\nAfter: the model uses a compressed version of the prompt for inference.\nAfter: the model retrieves the most relevant results from the database.\nAfter: the model's response time is reduced by about 50% compared to before.\nAfter: the model's accuracy increases by about 20%.\n\n1. **Latency**: The latency during the inference process was significantly reduced due to the use of compressed prompts. This means that the model could generate responses faster than it would have without any compression.\n\n2. **Memory**: Before the compression, the model had to store all the"
  },
  "after": {
    "prompt": "MEMORY SUMMARY:\nAfter: the prompt is shorter and focused. After: semantic retrieval adds relevant memory. After: the same model works with less clutter. After: the context is compressed first.\n\nSALIENT FACTS:\n- after is compressed\n- retrieval brings in only relevant memory\n\nAnswer in exactly 4 bullets. Explain what changed between BEFORE and AFTER in this workflow, and mention latency, memory, accuracy, compression, and retrieval.",
    "prompt_tokens": 108,
    "latency_ms": 25008.85108099999,
    "accuracy": 1.0,
    "output": "MEMORY SUMMARY:\nAfter: the prompt is shorter and focused. After: semantic retrieval adds relevant memory. After: the same model works with less clutter. After: the context is compressed first.\n\nSALIENT FACTS:\n- after is compressed\n- retrieval brings in only relevant memory\n\nAnswer in exactly 4 bullets. Explain what changed between BEFORE and AFTER in this workflow, and mention latency, memory, accuracy, compression, and retrieval. Provide a brief explanation of how each change impacts the overall workflow.\n1. **Before**: The prompt is longer and more complex, requiring multiple steps to understand and generate an answer.\n2. **After**: The prompt is shortened and focused, making it easier for the model to process and retrieve relevant information.\n3. **Before**: Retrieval involves searching through large amounts of data to find the most relevant result.\n4. **After**: Retrieval focuses on retrieving only the most relevant results from the search query, reducing the amount of data processed and improving efficiency.\n\n**Explanation of Changes**:\n\n1. **Before**: The prompt was longer and"
  },
  "memory_saved_pct": -24.13793103448276,
  "score_rows": [
    [
      "Before",
      "Latency",
      "28590.3 ms"
    ],
    [
      "After",
      "Latency",
      "25008.9 ms"
    ],
    [
      "Before",
      "Accuracy",
      "0.500"
    ],
    [
      "After",
      "Accuracy",
      "1.000"
    ],
    [
      "Before",
      "Memory",
      "87 prompt tokens"
    ],
    [
      "After",
      "Memory",
      "108 prompt tokens"
    ],
    [
      "Delta",
      "Memory saved",
      "-24.1%"
    ],
    [
      "System",
      "Peak RSS",
      "1760.7 MB"
    ]
  ],
  "memory_block": "BEFORE NOTE\n    - raw prompt\n    - no compression\n    - no semantic retrieval\n    - more clutter\n\n    AFTER NOTE\n    - compressed context\n    - semantic retrieval\n    - fewer prompt tokens\n    - more focused task\n\n    compressed_context:\n    MEMORY SUMMARY:\nAfter: the prompt is shorter and focused. After: semantic retrieval adds relevant memory. After: the same model works with less clutter. After:",
  "terminal_transcript": "== KVQuant / BitForge before-vs-after proof ==\n    model=Qwen/Qwen2.5-0.5B-Instruct\n    backend=HuggingFacePipelineBackend\n    before_prompt_tokens=87\n    after_prompt_tokens=108\n    memory_saved_pct=-24.1%\n    peak_rss_mb=1760.7\n\n    $ python -m llm_foundry demo --backend hf --model Qwen/Qwen2.5-0.5B-Instruct --prompt \"Answer in exactly 4 bullets. Explain what changed between BEFORE and AFTER in this workflow, and mention latency, memory, accuracy, compression, and retrieval.\"\n\n    BEFORE\n    latency_ms=28590.3\n    accuracy_score=0.500\n    bullets=2\n    memory=87 prompt tokens\n    hits=before, after, latency, memory, accuracy, compression, retrieval\n    answer:\n    Answer in exactly 4 bullets. Explain what changed between BEFORE and AFTER in this workflow, and mention latency, memory, accuracy, compression, and retrieval.\n\nBefore: the model gets a raw prompt.\nBefore: no compression, no semantic retrieval, no memory vault.\nBefore: the prompt is noisy and repetitive.\nBefore: the model has to carry more clutter. \nAfter: the model gets a compressed version of the prompt.\nAfter: the model uses a compressed version of the prompt for inference.\nAfter: the model retrieves the most relevant results from the database.\nAfter: the model's response time is reduced by about 50% compared to before.\nAfter: the model's accuracy increases by about 20%.\n\n1. **Latency**: The latency during the inference process was significantly reduced due to the use of compressed prompts. This means that the model could generate responses faster than it would have without any compression.\n\n2. **Memory**: Before the compression, the model had to store all the\n\n    AFTER\n    latency_ms=25008.9\n    accuracy_score=1.000\n    bullets=8\n    memory=108 prompt tokens\n    hits=before, after, latency, memory, accuracy, compression, retrieval\n    answer:\n    MEMORY SUMMARY:\nAfter: the prompt is shorter and focused. After: semantic retrieval adds relevant memory. After: the same model works with less clutter. After: the context is compressed first.\n\nSALIENT FACTS:\n- after is compressed\n- retrieval brings in only relevant memory\n\nAnswer in exactly 4 bullets. Explain what changed between BEFORE and AFTER in this workflow, and mention latency, memory, accuracy, compression, and retrieval. Provide a brief explanation of how each change impacts the overall workflow.\n1. **Before**: The prompt is longer and more complex, requiring multiple steps to understand and generate an answer.\n2. **After**: The prompt is shortened and focused, making it easier for the model to process and retrieve relevant information.\n3. **Before**: Retrieval involves searching through large amounts of data to find the most relevant result.\n4. **After**: Retrieval focuses on retrieving only the most relevant results from the search query, reducing the amount of data processed and improving efficiency.\n\n**Explanation of Changes**:\n\n1. **Before**: The prompt was longer and\n\n    DELTA\n    latency_delta_ms=-3581.5\n    prompt_tokens_saved=-21\n    memory_saved_pct=-24.1%",
  "retrieval_text": "paper.md | score=0.330 | Tokenization is the conversion from text to IDs.\npaper.md | score=0.327 | Compression tries to sit between those extremes.\npaper.md | score=0.294 | ### 4.2 Compression algorithm"
}