KVQuant / BitForge proof: the same local model, before vs after

I wanted to see the thing the user actually asked for: not a made-up screenshot, but a local model doing work twice — once with a raw prompt, then again after LLM Foundry compressed the context and pulled relevant memory back in.

model=Qwen/Qwen2.5-0.5B-Instruct before prompt tokens=170 after prompt tokens=120 memory saved=29.4%

21553 ms

before latency

13983 ms

after latency

1.000

before accuracy score

1.000

after accuracy score

Before prompt

Compare the BEFORE and AFTER versions of this workflow. Write exactly 4 bullets. Use these words somewhere: before, after, latency, memory, accuracy, KVQuant, BitForge, compression, retrieval. Make it concrete and practical.

Before: the prompt goes straight to the model.
Before: no compression, no semantic retrieval, no memory vault.
Before: the model gets more clutter and more repeated context.
After: compressed context is built first.
After: semantic retrieval pulls in relevant memory notes.
After: the prompt is shorter and more focused.
After: the same model is asked to do the same task.
This is the KVQuant / BitForge-style before-versus-after comparison we want to show.

After prompt

MEMORY SUMMARY:
This is the KVQuant / BitForge-style before-versus-after comparison we want to show. Before: the model gets more clutter and more repeated context. Before: the prompt goes straight to the model. After: the prompt is shorter and more focused.

Compare the BEFORE and AFTER versions of this workflow. Write exactly 4 bullets. Use these words somewhere: before, after, latency, memory, accuracy, KVQuant, BitForge, compression, retrieval. Make it concrete and practical.

Before output

Compare the BEFORE and AFTER versions of this workflow. Write exactly 4 bullets. Use these words somewhere: before, after, latency, memory, accuracy, KVQuant, BitForge, compression, retrieval. Make it concrete and practical.

Before: the prompt goes straight to the model.
Before: no compression, no semantic retrieval, no memory vault.
Before: the model gets more clutter and more repeated context.
After: compressed context is built first.
After: semantic retrieval pulls in relevant memory notes.
After: the prompt is shorter and more focused.
After: the same model is asked to do the same task.
This is the KVQuant / BitForge-style before-versus-after comparison we want to show. The goal is to make a concrete, practical difference between the two workflows. Please use bullet points for each section. Before:
- The prompt is sent directly to the model without any preprocessing or additional steps.
- There's no need for memory vaults or bit-forge storage.
- No compression is applied, so the response size remains constant regardless of the input length.
- The model doesn't have to remember anything from previous interactions; it starts fresh with new data.
- Retrieval happens immediately upon receiving the prompt, eliminating the need for further processing.

After:
- The prompt is preprocessed by compressing its content into a smaller format

After output

MEMORY SUMMARY:
This is the KVQuant / BitForge-style before-versus-after comparison we want to show. Before: the model gets more clutter and more repeated context. Before: the prompt goes straight to the model. After: the prompt is shorter and more focused.

Compare the BEFORE and AFTER versions of this workflow. Write exactly 4 bullets. Use these words somewhere: before, after, latency, memory, accuracy, KVQuant, BitForge, compression, retrieval. Make it concrete and practical. Don't use "the" or "a". The first bullet point should be a single word.
- Memory
- Accuracy
- Latency
- KVQuant
- BitForge
- Compression
- Retrieval

Before:

1. Model complexity increases with each iteration.
2. Context repeats in every step.
3. Prompt length grows exponentially.
4. Retrieval time increases significantly.

After:

1. Model complexity decreases as iterations progress.
2. Context becomes less repetitive.
3. Prompt length remains constant but is compressed for efficiency.
4. Retrieval time improves due to fewer steps.

Memory:
- Initially, the model consumes

Scores

Version	Metric	Value
Before	Latency	21552.6 ms
After	Latency	13982.5 ms
Before	Accuracy	1.000
After	Accuracy	1.000
Before	Memory	170 prompt tokens
After	Memory	120 prompt tokens
Delta	Memory saved	29.4%
System	Peak RSS	1814.2 MB

Memory and retrieval

BEFORE NOTE
    - raw prompt
    - no compression
    - no semantic retrieval
    - more clutter

    AFTER NOTE
    - compressed context
    - semantic retrieval
    - fewer prompt tokens
    - more focused task

    compressed_context:
    MEMORY SUMMARY:
This is the KVQuant / BitForge-style before-versus-after comparison we want to show. Before: the model gets more clutter and more repeated context. Before: the prompt goes straight to the model. After: the prompt is shorter and more focused.

Terminal transcript

== KVQuant / BitForge before-vs-after proof ==
model=Qwen/Qwen2.5-0.5B-Instruct
backend=HuggingFacePipelineBackend
before_prompt_tokens=170
after_prompt_tokens=120
memory_saved_pct=29.4%
peak_rss_mb=1814.2

$ python -m llm_foundry demo --backend hf --model Qwen/Qwen2.5-0.5B-Instruct --prompt "Compare the BEFORE and AFTER versions of this workflow. Write exactly 4 bullets. Use these words somewhere: before, after, latency, memory, accuracy, KVQuant, BitForge, compression, retrieval. Make it concrete and practical."

BEFORE
latency_ms=21552.6
accuracy_score=1.000
memory=170 prompt tokens
hits=before, after, latency, memory, accuracy, kvquant, bitforge, compression, retrieval
output:
Compare the BEFORE and AFTER versions of this workflow. Write exactly 4 bullets. Use these words somewhere: before, after, latency, memory, accuracy, KVQuant, BitForge, compression, retrieval. Make it concrete and practical.

Before: the prompt goes straight to the model.
Before: no compression, no semantic retrieval, no memory vault.
Before: the model gets more clutter and more repeated context.
After: compressed context is built first.
After: semantic retrieval pulls in relevant memory notes.
After: the prompt is shorter and more focused.
After: the same model is asked to do the same task.
This is the KVQuant / BitForge-style before-versus-after comparison we want to show. The goal is to make a concrete, practical difference between the two workflows. Please use bullet points for each section. Before:
- The prompt is sent directly to the model without any preprocessing or additional steps.
- There's no need for memory vaults or bit-forge storage.
- No compression is applied, so the response size remains constant regardless of the input length.
- The model doesn't have to remember anything from previous interactions; it starts fresh with new data.
- Retrieval happens immediately upon receiving the prompt, eliminating the need for further processing.

After:
- The prompt is preprocessed by compressing its content into a smaller format

AFTER
latency_ms=13982.5
accuracy_score=1.000
memory=120 prompt tokens
hits=before, after, latency, memory, accuracy, kvquant, bitforge, compression, retrieval
output:
MEMORY SUMMARY:
This is the KVQuant / BitForge-style before-versus-after comparison we want to show. Before: the model gets more clutter and more repeated context. Before: the prompt goes straight to the model. After: the prompt is shorter and more focused.

Compare the BEFORE and AFTER versions of this workflow. Write exactly 4 bullets. Use these words somewhere: before, after, latency, memory, accuracy, KVQuant, BitForge, compression, retrieval. Make it concrete and practical. Don't use "the" or "a". The first bullet point should be a single word.
- Memory
- Accuracy
- Latency
- KVQuant
- BitForge
- Compression
- Retrieval

Before:

1. Model complexity increases with each iteration.
2. Context repeats in every step.
3. Prompt length grows exponentially.
4. Retrieval time increases significantly.

After:

1. Model complexity decreases as iterations progress.
2. Context becomes less repetitive.
3. Prompt length remains constant but is compressed for efficiency.
4. Retrieval time improves due to fewer steps.

Memory:
- Initially, the model consumes

DELTA
latency_delta_ms=-7570.1
prompt_tokens_saved=50
memory_saved_pct=29.4%

Repo retrieval hits

memory-vault/notes/project-goal-project-goal.md | score=0.494 | Build a modular model framework with memory, compression, and tool use.
paper.md | score=0.390 | ### 3.4 Compression and memory layer
src/llm_foundry/cli.py | score=0.358 | train_model = sub.add_parser("train-model")
    train_model.add_argument("--corpus", required=True)
    train_model.add_argument("--config", default="")
    train_model.add_argumen

Links

GitHub: https://github.com/AmSach/llm-foundry
GitHub profile: https://github.com/AmSach
Instagram: https://www.instagram.com/i.amsach
LinkedIn: https://www.linkedin.com/in/theamansachan