Implementing GRPO: Scaling Reasoning Without the Critic Overhead

The Bottleneck: PPO’s Memory Tax and DPO’s Limitations

For years, Proximal Policy Optimization (PPO) was the gold standard for RLHF. However, it imposes a massive infrastructure tax: you effectively need to load four models into VRAM (Actor, Critic, Reference, and Reward Model). For 70B+ parameter models, this makes fine-tuning prohibitive for all but the largest labs.

Direct Preference Optimization (DPO) solved the memory issue by removing the reinforcement learning loop entirely, treating alignment as a supervised classification problem on preference pairs. However, DPO struggles with chain-of-thought (CoT) optimization. It relies on static datasets and doesn’t inherently encourage the model to explore reasoning paths.

Enter Group Relative Policy Optimization (GRPO). Popularized by DeepSeek’s R1/V3 research, GRPO removes the Critic model entirely. Instead of estimating a value function, it generates a group of outputs for a single prompt, calculates the rewards, and uses the group average as the baseline. This reduces VRAM usage significantly while preserving the exploration benefits of RL, making it the superior choice for optimizing mathematical and coding reasoning.

The Architecture: How GRPO Eliminates the Value Function

In standard PPO, the Advantage function $A_t$ is calculated using a value network (Critic). GRPO substitutes this with the mean reward of a group of sampled outputs.

The Algorithm

Sampling: For each prompt $q$, sample a group of $G$ outputs ${o_1, o_2, …, o_G}$ from the old policy $\pi_{\theta_{old}}$.
Scoring: Apply a reward model (or rule-based verifier) to get rewards $r_1, …, r_G$.
Advantage Calculation: Compute the advantage for each output based on its relative performance within the group.

A_i = \frac{r_i - \text{mean}(r)}{\text{std}(r) + \epsilon}

Optimization: Maximize the GRPO objective, which includes the KL-divergence penalty to keep the model close to the reference policy.

\mathcal{L}_{GRPO} = \mathbb{E}_{q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}} \left[ \frac{1}{G} \sum_{i=1}^G \left( \min \left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i, \text{clip}(...) A_i \right) - \beta \mathbb{D}_{KL} \right) \right]

Visualizing the Workflow

graph TD
    subgraph "GRPO Workflow"
    A["Input Prompt (q)"] --> B{"Policy Model"}
    B --"Sample G outputs"--> C["Outputs {o1, o2... oG}"]
    C --> D["Reward Function / Verifier"]
    D --"Calculate Rewards"--> E["Compute Group Mean & Std"]
    E --> F["Calculate Advantage (Ai)"]
    F --> G["Update Policy (No Critic Network)"]
    end
    
    subgraph "Legacy PPO"
    X["Input"] --> Y["Actor"]
    X --> Z["Critic (Value Net)"]
    Y --> R["Reward"]
    R & Z --> CALC["GAE Estimation"]
    end

The Implementation

We will implement GRPO using the Hugging Face TRL (Transformer Reinforcement Learning) library, which recently integrated support for group-relative strategies.

Prerequisites

Hardware: A single A100 (80GB) or equivalent for 7B-14B models. GRPO is memory efficient.
Libraries: trl>=0.11.0, transformers, torch.

Python Implementation

This script demonstrates aligning a model on math problems using a deterministic reward function (the “verifier”) rather than a neural reward model. This mirrors the DeepSeek-Math approach.

import torch
from datasets import load_dataset
from trl import GRPOTrainer, GRPOConfig
from transformers import AutoTokenizer, AutoModelForCausalLM

# 1. Configuration
# Note: We do NOT load a Critic model.
MODEL_ID = "deepseek-ai/deepseek-coder-6.7b-instruct"
OUTPUT_DIR = "./grpo-reasoning-adapter"

training_args = GRPOConfig(
    output_dir=OUTPUT_DIR,
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    learning_rate=5e-6,
    beta=0.04,          # KL penalty coefficient
    max_grad_norm=1.0,
    logging_steps=10,
    save_strategy="steps",
    # GRPO Specifics
    num_generations=8,  # The group size (G). Higher = better baseline estimation but more VRAM.
    max_completion_length=512,
    bf16=True
)

# 2. Reward Function (The Verifier)
# In this scenario, we use a simple arithmetic checker. 
# In production, this would parse the model's <answer> tag and compare to ground truth.
def arithmetic_reward_func(prompts, completions, answer, **kwargs):
    rewards = []
    for comp, true_ans in zip(completions, answer):
        # Heuristic: Check if the ground truth number appears in the output
        # DeepSeek uses rigorous parsing logic here.
        score = 1.0 if str(true_ans) in comp else 0.0
        
        # Shaping: Penalize extreme verbosity if incorrect
        if score == 0.0 and len(comp) > 200:
            score -= 0.1
            
        rewards.append(score)
    return rewards

# 3. Load Data & Model
dataset = load_dataset("gsm8k", "main", split="train[:1%]") # Tiny subset for demo

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

# Load Model in 4-bit to save further memory if needed
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"
)

# 4. Initialize GRPO Trainer
trainer = GRPOTrainer(
    model=model,
    reward_processing_class=tokenizer, # Handles tokenization internally
    args=training_args,
    train_dataset=dataset,
    reward_funcs=arithmetic_reward_func, # Can accept multiple reward functions
)

# 5. Execute Training
if __name__ == "__main__":
    print(f"Starting GRPO with Group Size: {training_args.num_generations}")
    trainer.train()
    trainer.save_model(OUTPUT_DIR)

Implementation Steps

Define the Verifier: Unlike DPO which needs a preference dataset (Chosen vs Rejected), GRPO requires a prompt and a ground truth. The reward_funcs logic is critical. For coding, this runs unit tests; for math, this checks numerical equivalence.
Set Group Size ($G$): In GRPOConfig, the num_generations parameter controls $G$.
- Low $G$ (e.g., 4): High variance in advantage estimation.
- High $G$ (e.g., 16+): Better baseline, but linearly increases inference cost during training.
Tune Beta ($\beta$): The KL penalty acts differently in GRPO than DPO. Start with 0.04 (as per DeepSeek-Math) rather than the standard 0.1.
Format Handling: Ensure your model is prompted to output strictly formatted answers (e.g., “Put the final answer in \boxed{}“) so the reward function can extract it reliably.

Comparison: GRPO vs. The Rest

Feature	PPO	DPO	GRPO
Models in VRAM	4 (Actor, Critic, Ref, RM)	2 (Policy, Ref)	2 (Policy, Ref)
Data Requirement	Preference Pairs or Reward Signal	Preference Pairs (Offline)	Prompts + Evaluator
Reasoning Capability	High (Exploration allowed)	Low (Offline cloning)	High (Exploration via Group)
Stability	Low (Sensitive to hyperparameters)	High	Medium-High
Best Use Case	General Chat	Chat / Style Transfer	Math, Logic, Code

GRPO represents a shift from proxy-based optimization (training a Reward Model to guess what humans like) to outcome-based optimization (checking if the answer is actually correct). By removing the Critic, you free up roughly 30-40% of your VRAM, allowing you to train larger models or increase batch sizes.

For 2026’s reasoning-heavy workloads, DPO is for style; GRPO is for correctness.

Implementing GRPO: Scaling Reasoning Without the Critic Overhead

The Bottleneck: PPO’s Memory Tax and DPO’s Limitations

The Architecture: How GRPO Eliminates the Value Function

The Algorithm

Visualizing the Workflow

The Implementation

Prerequisites

Python Implementation

Implementation Steps

Comparison: GRPO vs. The Rest

You Missed

JSON Vs JSONL for LLM Datasets: What’s the Difference for AI Prompts and Training Pipelines

How to Use a Prompt Generator Without Creating Generic AI Prompts

How to Convert OpenAPI Specs into Function Calling Schemas: Practical AI Prompts for AI Agents

How to Choose Chunk Size for RAG: Practical AI Prompts for Precision, Recall, and Cost

Implementing GRPO: Scaling Reasoning Without the Critic Overhead

The Bottleneck: PPO’s Memory Tax and DPO’s Limitations

The Architecture: How GRPO Eliminates the Value Function

The Algorithm

Visualizing the Workflow

The Implementation

Prerequisites

Python Implementation

Implementation Steps

Comparison: GRPO vs. The Rest

Related Post

Beyond the Memory Wall: A Deep-Dive into LLM Operator Acceleration Libraries

Why Artificial Intelligence Still Doesn’t Get Sarcasm

Inside the Black Box: Why Even AI Creators Can’t Fully Explain How Their Models Think

You Missed

JSON Vs JSONL for LLM Datasets: What’s the Difference for AI Prompts and Training Pipelines

How to Use a Prompt Generator Without Creating Generic AI Prompts

How to Convert OpenAPI Specs into Function Calling Schemas: Practical AI Prompts for AI Agents

How to Choose Chunk Size for RAG: Practical AI Prompts for Precision, Recall, and Cost