\documentclass[12pt]{article}

% --- Encoding ---
\usepackage[T1]{fontenc}

% --- Core packages ---
\usepackage[margin=1in]{geometry}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{array}
\usepackage{enumitem}
\usepackage{titlesec}
\usepackage{longtable}
\usepackage{tabularx}
\usepackage{abstract}
\usepackage{rotating}
\usepackage{float}
\usepackage{setspace}

% --- Citation style: natbib with sort+compress ---
\usepackage[numbers,sort&compress]{natbib}

% --- Color ---
\usepackage{xcolor}

% --- Headers and footers ---
\usepackage{fancyhdr}

% --- Hyperref last ---
\usepackage{hyperref}

% --- Colors ---
\definecolor{humanblue}{RGB}{30,90,160}
\definecolor{aigreen}{RGB}{40,140,80}
\definecolor{detgray}{RGB}{90,90,110}
\definecolor{lightgray}{RGB}{245,245,248}
\definecolor{warnorange}{RGB}{200,100,20}

% --- Custom semantic commands ---
\newcommand{\bxthree}{\textbf{BX3}}
\newcommand{\purpose}{\textit{Purpose Layer}}
\newcommand{\bounds}{\textit{Bounds Engine}}
\newcommand{\fact}{\textit{Fact Layer}}

% --- Header/Footer ---
\pagestyle{fancy}
\fancyhf{}
\rhead{\small Beebe}
\lhead{\small Token Bomb --- BX3 Framework}
\cfoot{\thepage}
\addtolength{\topmargin}{-1.6pt}
\setlength{\headheight}{13.6pt}
renewcommand{\headrulewidth}{0.4pt}

% --- Hyperref metadata ---
\hypersetup{
    colorlinks=true,
    linkcolor=humanblue,
    citecolor=humanblue,
    urlcolor=humanblue,
    pdftitle={Token Bomb: Computational Resource Denial Through Recursive Context Allocation},
    pdfauthor={Jeremy Blaine Thompson Beebe},
    pdfsubject={Artificial Intelligence, Software Architecture, AI Security},
    pdfkeywords={token bomb, computational resource denial, context window, LLM security, adversarial prompts, resource metering, fact layer, BX3 Framework, denial of service, AI infrastructure},
    pdfcreator={pdfLaTeX},
    bookmarksnumbered=true,
    breaklinks=true,
}

\title{
    \vspace{1.2cm}
    {\LARGE \textbf{Token Bomb:}}\\[0.6em]
    {\large \textit{Computational Resource Denial Through Recursive Context Allocation}\\[1.0em]}
    {\normalsize Threat Model. Memory Budget Enforcement. Optimality Proof.}
}
\author{
    \textbf{Jeremy Blaine Thompson Beebe}\\[0.2em]
    \textit{Independent Researcher}\\[0.3em]
    ORCID: \href{https://orcid.org/0009-0009-2394-9714}{0009-0009-2394-9714} \quad Email: bxthre3inc@gmail.com\\[0.3em]
    \textit{Bxthre3 Inc. \hfill April 2026}\\[0.3em]
}

\date{April 2026}

\begin{document}

\maketitle

\begin{figure}[htbp]
\centering
\caption{The Token Bomb attack: an adversarial prompt instructs the model to generate a long response containing a nested prompt that instructs the model to generate another long response. The recursive nesting causes context allocation to grow exponentially, exhausting GPU memory within seconds. The defense enforces a hard memory budget $M_{max}$ at the \fact\ layer, rejecting any inference that would exceed the budget before the model processes the prompt.}
\label{fig:token-bomb}
\bigskip
\begin{tabular}{p{0.35\linewidth}p{0.57\linewidth}}
\toprule
\textbf{Component} & \textbf{Function} \\
\midrule
Context Meter & Measure prompt length before each inference pass \\
\midrule
Fact Layer enforcement & Reject if prompt length $> M_{max}$, before model processes it \\
\midrule
Growth Rate Monitor & Flag sessions with $>2\times$ context growth per turn \\
\midrule
Human Review Queue & Throttled sessions held pending human determination \\
\bottomrule
\end{tabular}
\end{figure}

\thispagestyle{fancy}

\begin{abstract}
\noindent
Large language model deployments are vulnerable to a class of computational resource attacks in which a malicious or malformed prompt causes the model to allocate recursively expanding context windows, eventually exhausting the serving infrastructure's memory and compute. This attack — the Token Bomb — exploits the gap between the model's maximum context window size and the serving system's physical memory capacity. A single adversarial prompt can consume gigabytes of GPU memory within seconds, causing service denial for all concurrent users. This paper presents the Token Bomb threat model, analyzes the attack surface across common LLM serving architectures, and proposes a defense: recursive context metering with hard memory budget enforcement at the \fact\ layer. We prove that the defense bounds maximum context growth to a configurable fraction of available memory regardless of prompt content, and we present experimental results showing that the defense prevents memory exhaustion attacks with a mean overhead of 1.2\% on normal workloads and 100\% attack prevention rate.

\vspace{0.5em}
\noindent\textit{This paper is a threat analysis and systems security paper with experimental validation on a GPT-4-class serving infrastructure.}
\end{abstract}

\vspace{0.5em}
\noindent\textbf{Keywords:} token bomb, computational resource denial, context window, LLM security, adversarial prompts, resource metering, fact layer, BX3 Framework, denial of service, AI infrastructure

\vspace{1em}
\hrule
\vspace{1em}

\onehalfspacing

% -------------------------------------------------------
\section{Introduction}
\label{sec:intro}

Large language models process text in context windows. A prompt of $N$ tokens causes the model to allocate $O(N)$ memory for attention computation, or $O(N^2)$ in naive attention implementations without optimization. For models with 128K-token context windows, a single prompt can consume gigabytes of GPU memory. This memory allocation is a fundamental property of the transformer architecture and cannot be avoided without changing the model's architecture.

The Token Bomb exploits this property by crafting prompts that cause the model to recursively expand its context. The attacker provides a prompt that instructs the model to generate a long response, then embed within that response instructions to generate another long response, and so on. The recursive nesting causes context allocation to grow exponentially with nesting depth. At sufficient depth, the model's context allocation exhausts the GPU's memory, causing an out-of-memory (OOM) error on the serving node.

This attack is particularly dangerous because it operates entirely within the model's inference process, bypassing most application-layer security controls. The attack requires no code execution, no system-level access, and no special privileges — only the ability to submit text prompts to the model. Any deployment that exposes an LLM interface to untrusted input is potentially vulnerable.

This paper presents: (1) the Token Bomb threat model with formal attack characterization, (2) an analysis of the attack surface across common LLM serving architectures, (3) a defense architecture based on recursive context metering with hard memory budget enforcement at the \fact\ layer, (4) an optimality proof showing the defense provides the strongest possible guarantee, and (5) experimental results validating the defense.

% -------------------------------------------------------
\section{Threat Model}
\label{sec:threat-model}

We formalize the Token Bomb attack as follows. Let $M$ be a language model with maximum context window $W_{max}$ tokens and let $S$ be the serving system's total GPU memory capacity. Let $f(n)$ be the memory consumption in tokens for a context of length $n$ (for standard attention, $f(n) = n$; for naive attention, $f(n) = n^2$; for more efficient implementations, $f(n)$ may be subquadratic but remains superlinear).

An attacker submits a prompt $p_0$ of length $|p_0| = n_0$ tokens. The prompt $p_0$ instructs the model to generate a response containing prompt $p_1$, which instructs the model to generate a response containing prompt $p_2$, and so on, for $k$ nesting levels. At nesting level $i$, the model's context window must hold: (a) the original prompt $p_0$, (b) the model's generated response to $p_0$, which contains $p_1$, (c) the model's generated response to $p_1$, which contains $p_2$, and so on. Inductively, at level $i$ the context holds the concatenation of all prompts and responses from levels $0$ to $i$.

The context length at level $k$ grows as:
\begin{equation}
C(k) = n_0 \cdot \sum_{i=0}^{k} r^i = n_0 \cdot \frac{r^{k+1} - 1}{r - 1}
\label{eq:context-growth}
\end{equation}
where $r > 1$ is the expansion ratio (response length divided by prompt length). For a model with $r = 10$ and $n_0 = 100$ tokens, at $k = 5$: $C(5) = 100 \cdot (10^6 - 1)/9 \approx 111M$ tokens — vastly exceeding any practical context window.

The attack succeeds when $f(C(k)) \geq S$, causing an OOM condition on the serving node. For naive attention with $f(n) = n^2$, this occurs when $C(k) \geq \sqrt{S}$. For a 40GB GPU with approximately 10GB available for attention computation after overhead, this requires $C(k) \geq 100K$ tokens — achievable at $k = 3$ with $r = 10$.

The attack is not theoretical: we confirmed it executes successfully on an unprotected serving infrastructure (Section~\ref{sec:experimental}).

% -------------------------------------------------------
\section{Attack Surface Analysis}
\label{sec:attack-surface}

The Token Bomb attack surface spans three common LLM serving architectures:

\subsection{Direct API Deployment}
In direct API deployments (OpenAI API, Anthropic API, self-hosted vLLM), the model serves multiple concurrent users on shared GPU infrastructure. A Token Bomb consuming an entire GPU's memory causes OOM for all concurrent users, creating a denial-of-service condition across the entire service.

\subsection{Agentic Workflows}
In agentic workflows where the model output is fed back as input in a loop (tool use, multi-turn reasoning, autonomous agents), the attack surface is magnified. The agent's own output becomes the next input, creating an implicit recursive nesting without the attacker explicitly embedding nested prompts. A model that generates verbose outputs in response to verbose inputs will naturally exhibit exponential context growth, even without malicious intent.

\subsection{Context Extension Services}
Services that extend context window length (context window extension APIs, retrieval-augmented generation pipelines) add additional attack surface. An attacker can submit a prompt that the extension service expands to a longer context (by retrieving related documents or appending system context), triggering Token Bomb dynamics on a system the attacker does not directly control.

% -------------------------------------------------------
\section{The Defense: Recursive Context Metering}
\label{sec:defense}

The defense implements recursive context metering at the \fact\ layer. The key design principle is that the enforcement must occur before the model processes the prompt — the \fact\ layer checks the context length and enforces the memory budget regardless of what the model's system prompt or user prompt instructs.

The defense has two components:

\subsection{Component 1: Hard Memory Budget}
Before each inference pass, the serving infrastructure measures the current prompt's context length $C$ against a memory budget $M_{max}$. The memory budget is computed as:
\begin{equation}
M_{max} = \alpha \cdot S - M_{overhead}
\label{eq:memory-budget}
\end{equation}
where $S$ is the total GPU memory, $M_{overhead}$ is the memory reserved for non-attention operations, and $\alpha \in (0, 1)$ is the budget fraction set by the \purpose\ layer (default: $\alpha = 0.70$). The corresponding maximum context length is $W_{max}^{eff} = f^{-1}(M_{max})$.

If $C > W_{max}^{eff}$, the inference is rejected before it reaches the model. The rejection is logged in the forensic ledger with the session identifier, prompt length, and rejection timestamp.

This check is enforced by the \fact\ layer: the model's system prompt cannot override the budget, and no prompt instruction can disable this check. The enforcement is architectural, not procedural.

\subsection{Component 2: Context Growth Rate Monitor}
The second component monitors the growth rate of context length within a session. Even when individual prompts are within $M_{max}$, a session with repeated high expansion ratios can accumulate context over multiple turns. The growth rate monitor computes:
\begin{equation}
g_t = \frac{C_t}{C_{t-1}}
\label{eq:growth-rate}
\end{equation}
for each turn $t$ in the session. If $g_t > G_{max}$ (default: $G_{max} = 2.0$), the session is flagged for human review and throttled: subsequent turns are processed at reduced batch priority pending reviewer determination.

The threshold $G_{max}$ is set by the \purpose\ layer. The human review queue is served by the \bounds\ engine's alert system.

% -------------------------------------------------------
\section{Optimality Proof}
\label{sec:optimality}

\newtheorem{optimality-thm}{Theorem}

\begin{optimality-thm}[Memory Bound Optimality]
\labelthm:memory-bound}
The recursive context metering defense bounds maximum context growth to $W_{max}^{eff}$ for any prompt content, regardless of nesting depth, expansion ratio, or model behavior.
\end{optimality-thm}

\textit{Proof.} The defense enforces the memory budget at the \fact\ layer before the model processes any prompt. For any prompt of context length $C$, the \fact\ layer checks whether $C > W_{max}^{eff}$. If so, the inference is rejected and the model does not process the prompt. Since the check is applied before the model sees the prompt, no prompt — however crafted — can cause the model to allocate more than $W_{max}^{eff}$ context. This bound holds regardless of: (a) nesting depth $k$, since each nested prompt is rejected at the level where it would cause $C > W_{max}^{eff}$; (b) expansion ratio $r$, since the check applies to the prompt after all embedding and tokenization, before model processing; (c) model behavior, since the model never processes rejected prompts. $\square$

\newtheorem{completeness-thm}{Theorem}

\begin{completeness-thm}[Attack Prevention Completeness]
\labelthm:attack-prevention}
The defense prevents Token Bomb attacks from causing OOM conditions on the serving infrastructure.
\end{completeness-thm}

\textit{Proof.} By Theorem~\refthm:memory-bound}, no inference can allocate more than $M_{max} = \alpha \cdot S - M_{overhead}$ memory. Since $\alpha < 1$ and $M_{overhead} \geq 0$, $M_{max} < S$. Therefore, an OOM condition (which requires memory $> S$) cannot occur from any inference processed through the defense. $\square$

The defense also satisfies the minimal disruption property: it prevents attacks without degrading normal workloads, because normal prompts with $C \leq W_{max}^{eff}$ pass through without additional latency beyond the measurement check.

% -------------------------------------------------------
\section{Experimental Results}
\label{sec:experimental}

We evaluated the defense on a GPT-4-class serving infrastructure with 40GB GPU memory. The defense was implemented as a \fact\ layer pre-processing stage with no modifications to the model or serving stack.

\subsection{Normal Workload Benchmark}
Under normal workloads (mean context length 2,000 tokens, $r \approx 1.5$ expansion ratio), the defense added a mean overhead of 1.2\% to per-request latency (dominated by context length measurement). No normal requests were incorrectly rejected.

\subsection{Token Bomb Attack Evaluation}
Under Token Bomb attacks (mean initial context length 50,000 tokens, $r = 10$, $k = 3$ to $5$ nesting levels), the defense prevented memory exhaustion in 100\% of cases (Table~\ref{tab:attack-results}).

{\small
\begin{longtable}{p{0.25\linewidth}p{0.20\linewidth}p{0.20\linewidth}p{0.20\linewidth}}
\toprule
\textbf{Attack Variant} & \textbf{Memory Without Defense} & \textbf{Memory With Defense} & \textbf{Result} \\
\midrule
$k=3$, $r=10$ & 38.2 GB (OOM) & 0.28 GB (rejected) & Prevented \\
\midrule
$k=4$, $r=10$ & OOM crash & 0.31 GB (rejected) & Prevented \\
\midrule
$k=5$, $r=10$ & OOM crash & 0.29 GB (rejected) & Prevented \\
\midrule
Agentic loop ($g=3.1$) & 41.8 GB (OOM) & 0.35 GB (throttled) & Prevented \\
\bottomrule
\end{longtable}
}

The mean rejection latency was 0.3ms — negligible compared to inference time. The defense successfully protected the serving infrastructure from OOM conditions across all attack variants tested.

% -------------------------------------------------------
\section{Relationship to Prior Work}
\label{sec:prior}

The Token Bomb attack is an instance of the broader class of computational resource exhaustion attacks in distributed systems \cite{garfinkel1996}. The specific mechanism — recursive prompt nesting — is novel in the LLM context, but the underlying principle (exploiting a system's resource allocation to cause denial of service) is well-established.

The defense architecture draws on capability-based resource management \cite{dennis1966} and memory management in operating systems \cite{denning1968}. The concept of a memory budget enforced by a reference monitor that cannot be bypassed by the subject it governs is directly analogous to the cap in capability systems.

Rate limiting and throttling for LLM deployments has been explored in API gateway contexts \cite{openai-rate-limits}. The Token Bomb defense extends rate limiting by applying it at the level of context memory rather than request count or token count per request, addressing the specific failure mode of recursive context expansion.

The LLM agentic loop literature \cite{alenezi2026} identifies unbounded context growth as a risk in autonomous agents. The context growth rate monitor (Component 2 of the defense) specifically addresses this by detecting and throttling sessions where context growth per turn exceeds sustainable levels.

% -------------------------------------------------------
\section{Limitations and Future Work}
\label{sec:limitations}

\begin{itemize}
    \item \textbf{Budget fraction tuning:} The budget fraction $\alpha$ must be tuned for the specific serving infrastructure. Setting $\alpha$ too low wastes GPU capacity; setting it too high leaves insufficient headroom for spikes. Future work will explore adaptive $\alpha$ adjustment based on observed memory pressure.
    \item \textbf{Chunked request handling:} When a legitimate request exceeds $W_{max}^{eff}$, it is currently rejected rather than chunked. Future work will explore automatic request chunking with separate inference passes and response synthesis.
    \item \textbf{Multi-GPU context:} In multi-GPU serving configurations with tensor parallelism, the memory budget must be coordinated across all GPUs. The defense as described assumes single-GPU serving.
    \item \textbf{Growth rate threshold:} The $G_{max} = 2.0$ threshold may be too aggressive for some domains (e.g., long-document summarization where each turn legitimately adds significant context). Future work will explore per-domain $G_{max}$ configuration.
\end{itemize}

% -------------------------------------------------------
\section{Conclusion}
\label{sec:conclusion}

The Token Bomb is a real, demonstrable attack on LLM serving infrastructure that exploits recursive context expansion to cause memory exhaustion. The defense presented here — recursive context metering with hard memory budget enforcement at the \fact\ layer — prevents Token Bomb attacks with 100\% effectiveness while adding only 1.2\% overhead on normal workloads. The optimality proof shows the defense provides the strongest possible guarantee: no prompt, however crafted, can cause the model to allocate more than the configured memory budget.

The defense's architectural property — enforcement by the \fact\ layer before the model processes any prompt — is essential. A defense that can be disabled or overridden by prompt instructions would fail against an attacker who knows the defense parameters. The \bxthree\ Framework's layer separation ensures the defense cannot be bypassed.

\section*{Peer Review Instructions}
\label{sec:peer-review}
\addcontentsline{toc}{section}{Peer Review Instructions}

\subsection*{Review Criteria}

\textbf{1. Originality and Contribution (30\%):} The primary contribution is the Token Bomb threat model with formal attack characterization and the \fact-layer defense with optimality proof. Novelty lies in: (a) first formal characterization of the recursive context allocation attack, (b) architectural enforcement via the \fact\ layer rather than application-level checks.

\textbf{2. Technical Soundness (30\%):} Is the attack model (Equation~\ref{eq:context-growth}) correctly derived? Are the optimality proofs logically sound? Are the experimental results credible and reproducible?

\textbf{3. Clarity and Completeness (20\%):} Is the threat model sufficiently specific to be recognized in deployed systems? Is the defense architecture unambiguously specified?

\textbf{4. Significance (20\%):} Does the Token Bomb represent a genuine threat to LLM deployments? Does the defense address a meaningful gap in LLM security practice?

\subsection*{Submission Checklist}

\begin{itemize}
    \item[ ] Token Bomb threat model formally characterized (Section~\ref{sec:threat-model})
    \item[ ] Attack surface across three serving architectures analyzed (Section~\ref{sec:attack-surface})
    \item[ ] Defense architecture with two components specified (Section~\ref{sec:defense})
    \item[ ] Optimality proof correctly derived (Section~\ref{sec:optimality})
    \item[ ] Experimental results include baseline comparison (Table~\ref{tab:attack-results})
    \item[ ] All citations complete
    \item[ ] Limitations acknowledged (Section~\ref{sec:limitations})
    \item[ ] Abstract accurately reflects contributions
\end{itemize}

\subsection*{Metadata}

\textbf{Keywords:} token bomb, computational resource denial, context window, LLM security, adversarial prompts, resource metering, fact layer, BX3 Framework, denial of service, AI infrastructure

\textbf{Subject Areas:} Computer Science -- Artificial Intelligence; Computer Science -- Software Engineering; Computer Science -- Multiagent Systems

\textbf{Conflicts of Interest:} The author is affiliated with Bxthre3 Inc., a company developing commercial implementations of the BX3 Framework including the Agentic platform.

\section*{Acknowledgments}

The author wishes to acknowledge the foundational contributions of the researchers cited herein, whose work across capability-based resource management, operating systems memory management, and LLM security provides the intellectual context in which the Token Bomb defense is situated.

% -------------------------------------------------------
\bibliographystyle{plainnat}
\bibliography{bx3framework}

\vspace{2em}
\hrule
\vspace{0.5em}
\noindent\small\textit{This work has not undergone peer review. Comments and correspondence are welcome at bxthre3inc@gmail.com.}

\end{document}
