AI Cost Glossary

Plain-English definitions of the terms used across our calculators.

Token

The unit models read and bill by — roughly ¾ of an English word (~4 characters). Prices are quoted per million tokens.

Input vs output tokens

Input (prompt) tokens are what you send; output (completion) tokens are what the model generates. Output is usually priced 2–5× higher.

Context window

The maximum number of tokens (input + output) a model can handle in one request. Larger windows enable longer documents but can raise cost.

Prompt caching

Reusing a stored prompt prefix so repeated context is billed at a discounted cached rate. See prompt caching explained.

Batch API

An asynchronous mode that trades latency for roughly a 50% discount — ideal for non-urgent bulk jobs.

Embedding

A numeric vector representing text, used for semantic search and retrieval. Embedding models are far cheaper than generation models.

RAG (retrieval-augmented generation)

Searching a knowledge base for relevant chunks and passing them to the model as context before it answers. See the RAG cost calculator.

Reranker

A model that re-orders retrieved chunks by relevance, letting you retrieve fewer chunks while keeping quality.

Agent

An LLM system that completes multi-step tasks using planning, tools and retries — typically many model calls per task.

Failure-adjusted cost

Nominal cost divided by success rate, reflecting that you pay for failed attempts too.

Self-hosting

Running an open-weight model on your own GPUs instead of paying a per-token API. See API vs self-hosted.