Plain meaning
Start with the shortest useful explanation before going deeper.
The process of running a trained model on new inputs to generate predictions or outputs. Inference is the 'using' phase (vs. training). Inference cost depends on model size, input/output token count, and hardware (GPUs/TPUs). API providers (Anthropic, OpenAI) charge per token for inference. On-device inference (llama.cpp, GGUF) runs locally without API calls.