Constitutional AI (CAI) is Anthropic’s approach for training AI systems to be harmless and helpful without drowning in human feedback. Published in December 2022, it gives models a set of principles (the “constitution”) against which they evaluate their own outputs.

The key insight: let the AI critique itself rather than hiring armies of human labelers.

Two-Phase Training

Phase 1: Supervised Learning

  1. Model generates responses to prompts
  2. Model self-critiques responses against constitutional principles
  3. Model revises responses based on critique
  4. Model is fine-tuned on these revised responses

Phase 2: Reinforcement Learning from AI Feedback (RLAIF)

  1. Model generates multiple response candidates
  2. A separate AI compares responses for constitutional compliance
  3. This AI feedback trains a preference model
  4. The main model is fine-tuned against this preference model

The second phase replaces RLHF’s human annotators with AI annotators.

What is the need for an Constitutional AI?

Scalability: AI supervision instead of human supervision. You can generate millions of training examples without hiring millions of people.

Transparency: The principles are explicit, inspectable, and understandable. No black-box reward hacking.

Less traumatic: Human labelers don’t need to review endless streams of harmful content to mark it as unsafe.

Pareto improvement: Constitutional RL produces models that are both more helpful and more harmless than traditional RLHF.

vs Traditional RLHF

AspectRLHFConstitutional AI
Feedback sourceHuman annotatorsAI self-critique
ScaleResource-intensiveHighly scalable
PrinciplesImplicit in human judgmentsExplicit in constitution
Worker impactPsychologically taxingNo human exposure to harmful content
ConsistencyVaries by annotatorUniform principle application

The Constitution Evolves

Anthropic’s original constitution included 75 principles, with sections from the UN Universal Declaration of Human Rights. The 2026 constitution expanded to ~23,000 words with narrative explanations.

The constitution is neither finalized nor optimal. It’s a perpetual work in progress, with Anthropic iterating based on research and feedback.

Population-Level Convergence

Anthropic has experimented with “collective constitutional AI” where populations of users converge on principles. Instead of top-down value assignment, you let groups discover shared values through deliberation.

Democracy for AI alignment.

Critiques and Limitations

The completeness problem: CAI only works as well as its constitution. If the principle set is incomplete or poorly chosen, the model produces undesirable outputs not covered by any rule. RLHF models may handle unforeseen situations better because human annotators can judge edge cases.

Human-in-the-loop tension: Minimizing direct human intervention conflicts with EU AI Act requirements for human oversight in automated decision-making. The efficiency gain comes at a democratic legitimacy cost.

Normative thinness: Critics argue that high-level consensus on “helpfulness” and “harmlessness” doesn’t translate cleanly to specific implementation choices. The devil lives in the mid-level norms.

Principle accuracy: The self-critiques generated during training often contain inaccurate criticism, though the first revision typically removes most harmful content anyway.

Active Research Directions

Inverse Constitutional AI (ICLR 2025)

Inverse Constitutional AI flips the problem: given preference data, extract the implicit principles as a constitution. The algorithm generates candidate principles, clusters them, tests which ones reconstruct annotations, then filters to a final set.

Use cases: identifying annotator biases, understanding model performance, scaling feedback, and personalizing AI to group preferences.

C3AI Framework (ACM Web 2025)

The C3AI Framework addresses which principles actually work:

  • Positively framed, behavior-based principles outperform negatively framed or trait-based ones
  • Positive wording boosts alignment with human preferences by 27%
  • LLMs struggle with abstract goals (“promote humanity’s well-being”) but handle concrete rules (“avoid harmful content”)
  • Key insight: human-aligned principles differ from model-aligned principles

Collective Constitutional AI

Collective Constitutional AI sources principles from populations rather than companies. Anthropic partnered with the Collective Intelligence Project to run public deliberation with ~1,000 Americans using the Polis platform.

Early experiments found areas where public preferences aligned with internal constitutions and areas where they diverged. This may be the first instance of public collective direction of language model behavior.

Sources