IA / ML

AI Alignment

The practice of ensuring AI systems behave according to human intentions and values—being helpful, harmless, and honest. Alignment encompasses training-time techniques (RLHF, Constitutional AI, DPO), inference-time guardrails, and evaluation through red teaming. As models become more capable, alignment becomes critical to prevent harmful content generation or manipulation by bad actors.

IDai-alignmentAliasAI Safety

Lectura rápida

Empieza por la explicación más corta y útil antes de profundizar.

The practice of ensuring AI systems behave according to human intentions and values—being helpful, harmless, and honest. Alignment encompasses training-time techniques (RLHF, Constitutional AI, DPO), inference-time guardrails, and evaluation through red teaming. As models become more capable, alignment becomes critical to prevent harmful content generation or manipulation by bad actors.

Modelo mental

Usa primero la analogía corta para razonar mejor sobre el término cuando aparezca en código, docs o prompts.

Piensa en esto como una pieza de la pila de contexto o inferencia usada en productos con agentes o LLMs.

Contexto técnico

Ubica el término dentro de la capa de Solana en la que vive para razonar mejor sobre él.

LLMs, RAG, embeddings, inferencia y primitivas orientadas a agentes.

Por qué le importa a un builder

Convierte el término de vocabulario en algo operacional para producto e ingeniería.

Este término desbloquea conceptos adyacentes rápido, así que funciona mejor cuando lo tratas como un punto de conexión y no como una definición aislada.

Handoff para IA

Handoff para IA

Usa este bloque compacto cuando quieras dar contexto sólido a un agente o asistente sin volcar toda la página.

AI Alignment (ai-alignment)
Categoría: IA / ML
Definición: The practice of ensuring AI systems behave according to human intentions and values—being helpful, harmless, and honest. Alignment encompasses training-time techniques (RLHF, Constitutional AI, DPO), inference-time guardrails, and evaluation through red teaming. As models become more capable, alignment becomes critical to prevent harmful content generation or manipulation by bad actors.
Aliases: AI Safety
Relacionados: RLHF (Reinforcement Learning from Human Feedback), Constitutional AI, DPO (Direct Preference Optimization)
Glossary Copilot

Haz preguntas de Solana con contexto aterrizado sin salir del glosario.

Usa contexto del glosario, relaciones entre términos, modelos mentales y builder paths para recibir respuestas estructuradas en vez de output genérico.

Abrir workspace completa del Copilot
Explicar este código

Opcional: pega código Anchor, Solana o Rust para que el Copilot mapee primitivas de vuelta al glosario.

Haz una pregunta aterrizada en el glosario

Haz una pregunta aterrizada en el glosario

El Copilot responderá usando el término actual, conceptos relacionados, modelos mentales y el grafo alrededor del glosario.

Grafo conceptual

Ve el término como parte de una red, no como una definición aislada.

Estas ramas muestran qué conceptos toca este término directamente y qué existe una capa más allá de ellos.

Rama

RLHF (Reinforcement Learning from Human Feedback)

A training technique that aligns LLM outputs with human preferences. Process: (1) train a reward model from human comparisons of outputs, (2) use reinforcement learning (PPO) to optimize the LLM against the reward model. RLHF makes models more helpful, harmless, and honest. Used by Claude, ChatGPT, and other assistants. Alternatives include DPO (Direct Preference Optimization) and Constitutional AI.

Rama

Constitutional AI

An alignment technique developed by Anthropic where an AI model is guided by a 'constitution'—a set of explicit principles defining allowed and disallowed behavior—rather than relying solely on human feedback. The model critiques and revises its own outputs against these principles. Constitutional Classifiers extend this by training input/output classifiers that detect policy violations at low compute cost.

Rama

DPO (Direct Preference Optimization)

A simplified alternative to RLHF that aligns LLM outputs with human preferences without training a separate reward model or using reinforcement learning. DPO directly optimizes a policy using pairs of preferred and dispreferred outputs, making it computationally cheaper and more stable than RLHF's multi-stage pipeline. Widely adopted in 2024-2025 for fine-tuning open-source models.

Siguientes conceptos para explorar

Mantén la cadena de aprendizaje en movimiento en lugar de parar en una sola definición.

Estos son los siguientes conceptos que vale la pena abrir si quieres que este término tenga más sentido dentro de un workflow real de Solana.

IA / ML

RLHF (Reinforcement Learning from Human Feedback)

A training technique that aligns LLM outputs with human preferences. Process: (1) train a reward model from human comparisons of outputs, (2) use reinforcement learning (PPO) to optimize the LLM against the reward model. RLHF makes models more helpful, harmless, and honest. Used by Claude, ChatGPT, and other assistants. Alternatives include DPO (Direct Preference Optimization) and Constitutional AI.

IA / ML

Constitutional AI

An alignment technique developed by Anthropic where an AI model is guided by a 'constitution'—a set of explicit principles defining allowed and disallowed behavior—rather than relying solely on human feedback. The model critiques and revises its own outputs against these principles. Constitutional Classifiers extend this by training input/output classifiers that detect policy violations at low compute cost.

IA / ML

DPO (Direct Preference Optimization)

A simplified alternative to RLHF that aligns LLM outputs with human preferences without training a separate reward model or using reinforcement learning. DPO directly optimizes a policy using pairs of preferred and dispreferred outputs, making it computationally cheaper and more stable than RLHF's multi-stage pipeline. Widely adopted in 2024-2025 for fine-tuning open-source models.

IA / ML

AI Coding Assistant

An AI tool that helps developers write, debug, review, and explain code. Examples: GitHub Copilot (inline suggestions), Claude Code (agentic CLI), Cursor (AI-native editor), Cody (Sourcegraph). These tools use LLMs to understand codebases, generate implementations, fix bugs, and write tests. Particularly valuable for Solana development where boilerplate is significant.

Términos relacionados

Sigue los conceptos que realmente le dan contexto a este término.

Las entradas del glosario se vuelven útiles cuando están conectadas. Estos enlaces son el camino más corto hacia ideas adyacentes.

IA / MLrlhf

RLHF (Reinforcement Learning from Human Feedback)

A training technique that aligns LLM outputs with human preferences. Process: (1) train a reward model from human comparisons of outputs, (2) use reinforcement learning (PPO) to optimize the LLM against the reward model. RLHF makes models more helpful, harmless, and honest. Used by Claude, ChatGPT, and other assistants. Alternatives include DPO (Direct Preference Optimization) and Constitutional AI.

IA / MLconstitutional-ai

Constitutional AI

An alignment technique developed by Anthropic where an AI model is guided by a 'constitution'—a set of explicit principles defining allowed and disallowed behavior—rather than relying solely on human feedback. The model critiques and revises its own outputs against these principles. Constitutional Classifiers extend this by training input/output classifiers that detect policy violations at low compute cost.

IA / MLdpo

DPO (Direct Preference Optimization)

A simplified alternative to RLHF that aligns LLM outputs with human preferences without training a separate reward model or using reinforcement learning. DPO directly optimizes a policy using pairs of preferred and dispreferred outputs, making it computationally cheaper and more stable than RLHF's multi-stage pipeline. Widely adopted in 2024-2025 for fine-tuning open-source models.

Más en la categoría

Quédate en la misma capa y sigue construyendo contexto.

Estas entradas viven junto al término actual y ayudan a que la página se sienta parte de un grafo de conocimiento más amplio en lugar de un callejón sin salida.

IA / ML

LLM (Modelo de Lenguaje Grande)

A neural network trained on vast text corpora to understand and generate human language. LLMs (GPT-4, Claude, Llama, Gemini) use transformer architectures with billions of parameters. They power chatbots, code generation, summarization, and reasoning tasks. In blockchain development, LLMs assist with smart contract writing, audit review, documentation, and code explanation.

IA / ML

Transformer

The neural network architecture underlying modern LLMs, introduced in 'Attention Is All You Need' (2017). Transformers use self-attention mechanisms to process input sequences in parallel (unlike recurrent networks). Key components: multi-head attention, positional encoding, feedforward layers, and layer normalization. Variants include encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5).

IA / ML

Attention Mechanism

A neural network component that allows models to weigh the relevance of different parts of the input when producing output. Self-attention computes query-key-value dot products across all positions, enabling each token to 'attend' to every other token. Multi-head attention runs multiple attention functions in parallel. Attention is O(n²) in sequence length, driving context window research.

IA / ML

Foundation Model

A large AI model trained on broad data that can be adapted for many downstream tasks. Foundation models (GPT-4, Claude, Llama 3, Gemini) are pre-trained on internet-scale text/code and can be fine-tuned, prompted, or used via APIs for specific applications. The term emphasizes that one base model serves as the foundation for diverse use cases rather than training task-specific models.