Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization

[CVPR2026]

1Tsinghua University 2University of New South Wales

Abstract

Generating human motion that satisfies customized zero-shot goal functions, enabling applications such as controllable character animation and behavior synthesis for virtual agents, is a critical capability. While current approaches handle many unseen constraints, they fail on tasks with very challenging spatiotemporal restrictions, such as severe spatial obstacles or specified numbers of walking steps.

To equip motion generators for these highly constrained tasks, we present a retrieval-guided method built on the training-free diffusion noise optimization framework. The key idea is to search within large motion datasets for guidance that can potentially satisfy difficult constraints. We introduce relational task parsing to group target constraints and identify the difficult ones to be handled by retrieved reference. A better initialization for diffusion noise is then obtained via a reward-guided mask that combines random noise with retrieved noise. By optimizing diffusion noise from this improved initialization, we successfully solve highly constrained generation tasks. By leveraging LLM for relational task parsing, the whole framework is further enabled to automatically reason for what to retrieve, improving the intelligence of moving agents under a training-free optimization scheme. Code will be released upon publication.

Overall Pipeline

Overview

Qualitative Results

(1) Tasks with challenging spatiotemporal constraints

"walk to pass a very narrow gap"

gap_width=0.4m, walk_distance=5m

"walk to avoid a very low barrier"

barrier_height=0.5m, walk_distance=5m

"walk and foot reaches a very high position"

height=1.8m, walk_distance=5m

(2) Tasks with challenging numerical constraints

"walk and raise hands"

for four meters with six steps

"walk and clap hands wide"

for four times

"walk to avoid overhead barrier"

in four meters with five steps

Comparison

Comparison with ProgMoGen+DNO

Task: very narrow gap

ProgMoGen+DNO
(high constraint error, causing scene penetration)

Ours

Task: very low barrier

ProgMoGen+DNO
(large joint jitter)

Ours

Task: walk and raise hands for five steps

ProgMoGen+DNO
(wrong number for steps)

Ours

Task: walk and foot reaches a very high position

ProgMoGen+DNO
(physically unsatisfactory motion: step in the air)

Ours

Ablation Studies

Effect of each module

Task: very low barrier

w/o relational task parsing C_R
(large constraint error and frame inconsistency)

retrieval branch only
(poor local quality near end frames)

Ours (full)

Diversity

Task: very narrow gap

generation 1

generation 2

generation 3

Visualization of retrieved samples

Task: walk and clap hands wide for four times

retrieved sample x_R

Ours (full)

Different type of contraints

Task: hand reaches a very high position

with spatiotemporal constraint

target position z=2.5m

with numerical constraint

reach for three times