AlignADV: Learnability-Guided Adversarial Training

Abstract

Closed-loop adversarial training has emerged as a vital paradigm for enhancing the safety of autonomous driving policies by enabling them to learn from rare safety-critical scenarios. Standard training pipelines typically generate adversarial scenarios first, then sample them for policy optimization. However, most existing frameworks remain attack-oriented. Driven primarily by collision maximization, current generators often synthesize practically unsolvable extreme situations, thereby degrading the learning process. Furthermore, conventional heuristic and simplified sampling strategies ignore the continuously evolving capability of the driving policy, leading to sample inefficiency and delayed convergence. To overcome these limitations, we propose AlignADV, a learnability-guided closed-loop adversarial training framework designed to convert adversarial scenarios into resolvable and capability-aligned curricula. First, we reformulate adversarial scenario generation as a preference alignment problem and employ direct preference optimization to guide the generator toward critical yet resolvable scenarios. Second, we introduce the concept of behavioral fingerprint to extract the intrinsic characteristics of the evolving policy and construct a multi-modal capability prediction model that accurately evaluates policy performance without expensive simulations. By combining the resolvability-aligned scenario set with these capability predictions, we develop a dynamic curriculum sampling mechanism that prioritizes scenarios targeting the exact vulnerabilities of the current policy. Comprehensive experiments using the Waymo Open Motion Dataset demonstrate that AlignADV significantly improves both convergence efficiency and final performance, reducing training steps by up to 40.6% compared to baseline methods, while reducing collision rate and improving route completion rate in both normal and adversarial traffic conditions. These results highlight a shift from attack-oriented scenario generation to learnability-guided policy improvement, offering a principled direction for safer and more efficient autonomous driving training.

Problem Background

The efficacy of adversarial training depends on two key factors: what scenarios to generate and how to use these scenarios to maximize learning efficiency and policy performance.

Research motivation and AlignADV framework — Overview of the research motivation and AlignADV framework.

Two bottlenecks in attack-oriented adversarial training

Practically unsolvable adversarial scenarios. Collision-maximizing generators can synthesize extreme situations where no reasonable collision-avoidance response exists, which weakens learning feedback and may induce overly conservative behavior.
Capability-mismatched scenario sampling. Heuristic or simplified sampling strategies ignore the continuously evolving capability of the driving policy, making sampled scenarios too easy for the current policy and reducing training efficiency.

Method Overview

The primary objective of AlignADV is to shift from the conventional unidirectional attack paradigm to learnability-guided curriculum loop. In this work, learnability serves as an operational principle that guides AlignADV to jointly pursue resolvability-aligned scenario generation and capability-aligned scenario sampling.

Resolvable Adversarial Scenario Generation fine-tunes a pretrained adversarial generator with expert-evaluated preference pairs and Direct Preference Optimization, guiding generated scenarios toward critical yet resolvable interactions.
Dynamic Policy Capability Prediction represents the evolving policy with behavioral fingerprints and predicts scenario-specific success probabilities before expensive simulation.
Capability-Aligned Curriculum Sampling combines the resolvable scenario library with predicted policy capabilities to construct a dynamic curriculum distribution targeting the current policy's predicted vulnerabilities.

Results

Experiments are conducted in MetaDrive by reconstructing real-world traffic logs from the Waymo Open Motion Dataset. The evaluation verifies scenario solvability, behavioral fingerprints and capability prediction, and closed-loop adversarial training performance.

40.6%

training steps saved compared to vanilla adversarial training

99.87%

solvability rate after preference optimization

0.885

precision for the top 10% of negative predictions

26%

relative reduction in crash rate compared with vanilla adversarial training

Scenario solvability

The fine-tuned generator sharply reduces unsolvable cases while preserving criticality and behavioral diversity.

Model	Unsolvable Scenarios ↓	Solvability Rate ↑	Average TTC ↓	APD ↑	FPD ↑	Variance ↑
Pre-trained	606	96.05%	0.582 s	2.771 m	8.369 m	2.696
Fine-tuned	20	99.87%	0.577 s	2.709 m	8.284 m	2.779

Behavioral fingerprints and capability prediction

Evolution of behavioral fingerprints form a continuous trajectory across rapid learning, exploration, and convergence, reflecting the evolving competence of the policy. The capability predictor further combines these fingerprints with local scenario context to estimate success probabilities before simulation, reaching 0.885 precision when identifying the top 10% most challenging cases.

Correlation between predicted and ground truth success rates — Correlation between Predicted and Ground Truth Success Rates

Feature ablation variants — Feature Ablation Variants

Comparison with baseline methods — Comparison with Baseline Methods

Closed-loop adversarial training

RL training comparison between generators — Training Curves under Different Scenario Generation Methods

RL training comparison between sampling strategies — Training Curves under Different Curriculum Sampling Strategies

Normal scenarios

Method	Crash Rate ↓	Route Comp. ↑	Cost ↓	Reward ↑	Steps Saved ↑
Replay	18.65% ± 2.70%	73.69% ± 1.99%	0.482 ± 0.026	49.01 ± 1.99	--
Vanilla Adv	15.95% ± 1.30%	75.73% ± 1.53%	0.461 ± 0.003	50.68 ± 1.84	Baseline
Solvable Adv	14.79% ± 2.57%	77.84% ± 1.14%	0.432 ± 0.021	53.43 ± 1.53	27.1%
Static	14.70% ± 2.32%	74.47% ± 2.44%	0.468 ± 0.043	50.45 ± 2.73	7.7%
History	14.85% ± 1.07%	77.13% ± 0.69%	0.434 ± 0.024	51.90 ± 1.07	3.9%
Ours	11.80% ± 1.78%	79.30% ± 1.54%	0.404 ± 0.036	54.15 ± 1.36	40.6%

Adversarial scenarios

Method	Crash Rate ↓	Route Comp. ↑	Cost ↓	Reward ↑	Steps Saved ↑
Replay	40.31% ± 1.49%	65.77% ± 1.05%	0.653 ± 0.008	42.23 ± 1.30	--
Vanilla Adv	35.77% ± 1.95%	67.68% ± 2.39%	0.634 ± 0.023	43.16 ± 1.96	Baseline
Solvable Adv	33.27% ± 1.34%	70.10% ± 1.54%	0.601 ± 0.017	45.39 ± 1.57	5.8%
Static	35.50% ± 4.37%	66.21% ± 3.69%	0.639 ± 0.030	42.70 ± 2.58	0.0%
History	34.97% ± 1.85%	69.22% ± 1.31%	0.609 ± 0.027	44.42 ± 1.30	17.4%
Ours	31.37% ± 3.51%	71.66% ± 1.93%	0.588 ± 0.026	46.24 ± 1.91	30.9%

Overall, compared with the adversarial training baseline, AlignADV uses only 59% of the training steps to reach the established performance thresholds and delivers a 26% relative reduction in crash rate.

BibTeX

@article{mei2026alignadv,
  title={From Attacks to Curricula: Learnability-Guided Adversarial Training for Safe Autonomous Driving},
  author={Mei, Yuewen and Nie, Tong and Sun, Jie and Shi, Haotian and Ma, Wei and Sun, Jian},
  year={2026}
}