AI / ML

DPO (Direct Preference Optimization)

A simplified alternative to RLHF that aligns LLM outputs with human preferences without training a separate reward model or using reinforcement learning. DPO directly optimizes a policy using pairs of preferred and dispreferred outputs, making it computationally cheaper and more stable than RLHF's multi-stage pipeline. Widely adopted in 2024-2025 for fine-tuning open-source models.

IDdpoAliasDirect Preference Optimization

Plain meaning

Start with the shortest useful explanation before going deeper.

A simplified alternative to RLHF that aligns LLM outputs with human preferences without training a separate reward model or using reinforcement learning. DPO directly optimizes a policy using pairs of preferred and dispreferred outputs, making it computationally cheaper and more stable than RLHF's multi-stage pipeline. Widely adopted in 2024-2025 for fine-tuning open-source models.

Mental model

Use the quick analogy first so the term is easier to reason about when you meet it in code, docs, or prompts.

Think of it as a piece of the context or inference stack behind agentic and LLM-powered Solana products.

Technical context

Place the term inside its Solana layer so the definition is easier to reason about.

LLMs, RAG, embeddings, inference, and agent-facing primitives.

Why builders care

Turn the term from vocabulary into something operational for product and engineering work.

This term unlocks adjacent concepts quickly, so it works best when you treat it as a junction instead of an isolated definition.

AI handoff

AI handoff

Use this compact block when you want to give an agent or assistant grounded context without dumping the entire page.

DPO (Direct Preference Optimization) (dpo)
Category: AI / ML
Definition: A simplified alternative to RLHF that aligns LLM outputs with human preferences without training a separate reward model or using reinforcement learning. DPO directly optimizes a policy using pairs of preferred and dispreferred outputs, making it computationally cheaper and more stable than RLHF's multi-stage pipeline. Widely adopted in 2024-2025 for fine-tuning open-source models.
Aliases: Direct Preference Optimization
Related: RLHF (Reinforcement Learning from Human Feedback), Fine-Tuning, Training (ML)
Glossary Copilot

Ask grounded Solana questions without leaving the glossary.

Use glossary context, relationships, mental models, and builder paths to get structured answers instead of generic chat output.

Explain this code

Optional: paste Anchor, Solana, or Rust code so the Copilot can map primitives back to glossary terms.

Ask a glossary-grounded question

Ask a glossary-grounded question

The Copilot will answer using the current term, related concepts, mental models, and the surrounding glossary graph.

Concept graph

See the term as part of a network, not a dead-end definition.

These branches show which concepts this term touches directly and what sits one layer beyond them.

Branch

RLHF (Reinforcement Learning from Human Feedback)

A training technique that aligns LLM outputs with human preferences. Process: (1) train a reward model from human comparisons of outputs, (2) use reinforcement learning (PPO) to optimize the LLM against the reward model. RLHF makes models more helpful, harmless, and honest. Used by Claude, ChatGPT, and other assistants. Alternatives include DPO (Direct Preference Optimization) and Constitutional AI.

Branch

Fine-Tuning

The process of further training a pre-trained model on a specialized dataset to improve performance on specific tasks. Fine-tuning adapts a foundation model's weights using domain-specific data (e.g., Solana documentation, smart contract code). Techniques include full fine-tuning, LoRA (Low-Rank Adaptation), and QLoRA. Fine-tuned models can outperform general models on narrow tasks.

Branch

Training (ML)

The process of optimizing a model's parameters by exposing it to data and adjusting weights to minimize a loss function. Pre-training on large datasets creates foundation models. Training LLMs requires massive compute (thousands of GPUs, weeks/months). Training data quality, diversity, and size directly impact model capabilities. Distinguished from fine-tuning (smaller scale, specific domain).

Next concepts to explore

Keep the learning chain moving instead of stopping at one definition.

These are the next concepts worth opening if you want this term to make more sense inside a real Solana workflow.

AI / ML

RLHF (Reinforcement Learning from Human Feedback)

A training technique that aligns LLM outputs with human preferences. Process: (1) train a reward model from human comparisons of outputs, (2) use reinforcement learning (PPO) to optimize the LLM against the reward model. RLHF makes models more helpful, harmless, and honest. Used by Claude, ChatGPT, and other assistants. Alternatives include DPO (Direct Preference Optimization) and Constitutional AI.

AI / ML

Fine-Tuning

The process of further training a pre-trained model on a specialized dataset to improve performance on specific tasks. Fine-tuning adapts a foundation model's weights using domain-specific data (e.g., Solana documentation, smart contract code). Techniques include full fine-tuning, LoRA (Low-Rank Adaptation), and QLoRA. Fine-tuned models can outperform general models on narrow tasks.

AI / ML

Training (ML)

The process of optimizing a model's parameters by exposing it to data and adjusting weights to minimize a loss function. Pre-training on large datasets creates foundation models. Training LLMs requires massive compute (thousands of GPUs, weeks/months). Training data quality, diversity, and size directly impact model capabilities. Distinguished from fine-tuning (smaller scale, specific domain).

AI / ML

Embedding

A dense vector representation of text (or other data) in a continuous high-dimensional space where semantically similar items are closer together. Embedding models (OpenAI ada-002, Cohere, sentence-transformers) convert text to vectors of 256-3072 dimensions. Used in RAG for semantic search, in recommendation systems, and for clustering. Stored and queried via vector databases.

Related terms

Follow the concepts that give this term its actual context.

Glossary entries become useful when they are connected. These links are the shortest path to adjacent ideas.

AI / MLrlhf

RLHF (Reinforcement Learning from Human Feedback)

A training technique that aligns LLM outputs with human preferences. Process: (1) train a reward model from human comparisons of outputs, (2) use reinforcement learning (PPO) to optimize the LLM against the reward model. RLHF makes models more helpful, harmless, and honest. Used by Claude, ChatGPT, and other assistants. Alternatives include DPO (Direct Preference Optimization) and Constitutional AI.

AI / MLfine-tuning

Fine-Tuning

The process of further training a pre-trained model on a specialized dataset to improve performance on specific tasks. Fine-tuning adapts a foundation model's weights using domain-specific data (e.g., Solana documentation, smart contract code). Techniques include full fine-tuning, LoRA (Low-Rank Adaptation), and QLoRA. Fine-tuned models can outperform general models on narrow tasks.

AI / MLtraining

Training (ML)

The process of optimizing a model's parameters by exposing it to data and adjusting weights to minimize a loss function. Pre-training on large datasets creates foundation models. Training LLMs requires massive compute (thousands of GPUs, weeks/months). Training data quality, diversity, and size directly impact model capabilities. Distinguished from fine-tuning (smaller scale, specific domain).

More in category

Stay in the same layer and keep building context.

These entries live beside the current term and help the page feel like part of a larger knowledge graph instead of a dead end.

AI / ML

LLM (Large Language Model)

A neural network trained on vast text corpora to understand and generate human language. LLMs (GPT-4, Claude, Llama, Gemini) use transformer architectures with billions of parameters. They power chatbots, code generation, summarization, and reasoning tasks. In blockchain development, LLMs assist with smart contract writing, audit review, documentation, and code explanation.

AI / ML

Transformer

The neural network architecture underlying modern LLMs, introduced in 'Attention Is All You Need' (2017). Transformers use self-attention mechanisms to process input sequences in parallel (unlike recurrent networks). Key components: multi-head attention, positional encoding, feedforward layers, and layer normalization. Variants include encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5).

AI / ML

Attention Mechanism

A neural network component that allows models to weigh the relevance of different parts of the input when producing output. Self-attention computes query-key-value dot products across all positions, enabling each token to 'attend' to every other token. Multi-head attention runs multiple attention functions in parallel. Attention is O(n²) in sequence length, driving context window research.

AI / ML

Foundation Model

A large AI model trained on broad data that can be adapted for many downstream tasks. Foundation models (GPT-4, Claude, Llama 3, Gemini) are pre-trained on internet-scale text/code and can be fine-tuned, prompted, or used via APIs for specific applications. The term emphasizes that one base model serves as the foundation for diverse use cases rather than training task-specific models.