LLM Billing Exposed: How Tokenization Obscures True Costs and What to Do About It

A technical review of recent research revealing how LLM providers might overcharge users through tokenization, why transparency alone isn't enough, and how per-character billing could close the loophole.

kekePowerkekePower
7 min read
·
comments
·
...
LLM Billing Exposed: How Tokenization Obscures True Costs and What to Do About It

The Hidden Costs of LLM-as-a-Service

Most cloud-based large language model (LLM) APIs bill users by the number of tokens processed or generated. At first glance, this seems reasonable: tokens are a natural unit of work for LLMs, and tokenization is well-understood—at least for those who control the model. But the underlying process is not transparent to users. As detailed in the recent paper "Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives", this creates an opportunity for providers to overcharge, whether accidentally or deliberately, by exploiting the gap between what gets generated and what gets reported.

How Tokenization Enables Informational Asymmetry

Tokenization—the process of splitting input and output strings into discrete units—differs between models, and the granularity of tokens can vary wildly. Providers, who own both the model and its tokenizer, are the only ones who know exactly how a prompt or response was segmented internally. This puts users at a disadvantage: they must trust the provider’s meter, with no way to independently audit the reported token count. The paper frames this as a classic moral hazard: the party with more information (the provider) has the incentive and means to manipulate the metric that determines billing.

Principal–Agent Model: Incentives for Overcharging

The authors formalize the problem as a principal–agent scenario:

  • Principal (user): Wants useful text, pays per reported token count.
  • Agent (provider): Sees the full tokenization and output, and decides what token sequence (and thus token count) to report.

Profit for the provider is directly proportional to the number of tokens billed, minus a small cost for longer sequences. The incentive is clear: if the provider can split outputs into more (shorter) tokens, they can increase revenue, even if the actual semantic content remains unchanged.

A concrete example from the paper illustrates the potential scale: the string "San Diego" can be tokenized as either two tokens (["San", "Diego"]) or nine (["S", "a", "n", " ", "D", "i", "e", "g", "o"]), depending on the chosen vocabulary and segmentation. That’s a 4.5x difference in billable tokens for the exact same output.

Real-World Tokenization Discrepancies (e.g. 'San Diego' Example)

To move beyond anecdote, the authors systematically tested how badly a user could be overcharged if a provider maximally abused tokenization. If every character in an output were reported as its own token, users would be on the hook for a token count far exceeding what honest tokenization would yield.

Table 1 of the paper summarizes results across five popular open-weights models (Llama-3, Gemma, Ministral-8B-Instruct-2410): the “character-per-token” strategy would overbill users by a factor of roughly 3.1× to 3.5× compared to normal operation. This isn’t a corner case; it’s a systematic vulnerability, given the right (or wrong) incentives.

Experimental Setup and Model Selection

The authors’ experiments were conducted on a dataset of 400 English prompts from the LMSYS Chatbot Arena. They tested five models—Llama-3 (1B and 3B), Gemma-3 (1B and 4B), and Ministral-8B-Instruct-2410—using dual A100 80 GB GPUs (one GPU per run). Two main experimental conditions were explored:

  • Token-splitting baseline: Model runs at temperature 1.0, with every character split into its own token—a worst-case for overbilling.
  • Heuristic attack (Algorithm 1): Model runs at higher temperature (1.3), applying a practical algorithm to maximize plausible token splitting without detection. Algorithm 1 always splits the token with the highest vocabulary index, because BPE indices correlate with token length.

This setup provides a realistic, reproducible benchmark for quantifying the overbilling risk under both naive and sophisticated attack strategies.

The proof-of-concept implementation is open-source: https://github.com/Networks-Learning/token-pricing.

Next-Token Probabilities and Longest Plausible Tokenization

A natural countermeasure is transparency: if the provider is required to publish the next-token probabilities at each generation step, users (or auditors) could, in principle, verify the plausibility of the reported token sequence. But the paper demonstrates that this is not a panacea.

The new problem for a dishonest provider is to find the longest plausible tokenization—that is, the longest possible token sequence that is still consistent with the model’s output probabilities and could plausibly have been generated. This is not just a bookkeeping challenge; the paper proves it is NP-Hard (by reduction from the Hamiltonian Path problem). So, while transparency raises the bar for overbilling, it doesn’t close the loophole entirely.

Algorithmic Cheating: Practical Heuristics and Empirical Overbilling

Despite the theoretical hardness, the authors introduce a practical heuristic, Algorithm 1, that works surprisingly well. The algorithm iteratively splits high-index tokens (i.e., those that are likely to be splittable without raising suspicions) until the resulting sequence fails a plausibility check.

  • Complexity: The algorithm runs in O(m (log m + σₘₐₓ)), where m is the number of splits and σₘₐₓ is the maximum token length.
  • Empirical results: Even under the scrutiny of next-token probability checks, this heuristic enabled overbilling by up to 13% on Ministral-8B-Instruct-2410 and 9.5% on Llama-3 (for top-p = 0.99, temperature = 1.3). The NP-hardness proof extends verbatim to top-k sampling and to any fixed probability threshold.

So, while fully auditable billing reduces the maximum theoretical exploit, it does not eliminate the provider’s incentive—or ability—to cheat.

Theoretical Proof and Simple Implementation

To close the exploit, the authors propose a fundamental change: bill users based on number of characters, not tokens. They prove an incentive-compatibility theorem:

Any additive pricing scheme that removes the incentive to lie must charge linearly by characters.

With this approach, the provider has no reason to manipulate tokenization, since every character is worth the same regardless of how it’s split. In fact, the provider is now incentivized to use shorter token sequences (i.e., better compression), which also reduces energy consumption and latency. Shorter sequences also cut GPU time and energy per request.

Practical Guidance: Setting Per-Character Rates

Transitioning from per-token to per-character billing is straightforward in practice. The paper suggests setting the per-character price as:

per_character_price = per_token_price / avg_characters_per_token

Where:

  • per_token_price (r₀): The current price per token. (In practice you’d count UTF-8 bytes or code-points; multi-byte languages need careful handling.)
  • avg_characters_per_token (cₚₜ): The empirical average, measured across real workloads (≈ 4.2–4.5 in the authors’ experiments).

This preserves the expected cost for most users, while eliminating the incentive (and technical means) for providers to overcharge via token splitting.

Scope of Study and Potential for Broader Exploits

The analysis is comprehensive within its scope, but several limitations remain:

  • The study focuses exclusively on additive pricing (i.e., each unit costs the same); it does not address contracts based on output quality, latency, or performance.
  • The method assumes that checking the plausibility of a token sequence against the model’s output probabilities is computationally cheap. More sophisticated attacks could require multi-pass or adversarial evaluation.
  • The paper only investigates misreporting of token counts; real-world providers might also misreport other metadata, such as model version or even next-token distributions.
  • The use of prompts from LMSYS Chatbot Arena is a reasonable baseline, but may not capture the diversity of real-world workloads across different languages, domains, or usage patterns.

All of these open questions highlight the need for further research into not only billing metrics but also the broader space of “hidden-action” exploits in LLM-as-a-service models.

Implications for Developers, Providers, and the Ecosystem

For practitioners and decision-makers, the findings have immediate implications:

  1. Developers should demand auditable usage meters, or—better yet—insist on per-character billing for LLM services. Relying on the provider’s opaque tokenization is now a known risk.
  2. Providers have a reputational and legal incentive to adopt simpler, fairer billing schemes before regulatory scrutiny increases. The per-character approach is provably incentive-compatible and easy to implement.
  3. Researchers now have a formal framework for analyzing hidden-action vulnerabilities in LLM economics, opening the door to further work on trustworthy metering, contract design, and adversarial billing.

Conclusion

Per-token billing for LLM APIs is not just a technical oddity—it’s a structural bug that creates real opportunities for overcharging. As shown in "Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives", even transparency and probabilistic auditing can’t fully eliminate the exploit. The fix is simple and robust: bill per character, and take tokenization off the billing critical path. This better aligns incentives for both providers and users, and moves the ecosystem toward fairer, more auditable LLM-as-a-service contracts.

Have you used, or are you using, a cloud provider for your LLM needs? Comment down below!

LLMTokenizationCloud BillingIncentive DesignArxiv

Comments

LLM Billing Exposed: How Tokenization Obscures True Costs and What to Do About It | AI Muse by kekePower