Course Outline — Spring 2026


Course Information


Course Description

This advanced research seminar treats video generation not merely as media synthesis, but as the foundation for General World Models and Embodied Agents. The course begins with the thesis that video models are world models, then surveys frontier visual models (Kimi K2.5, Z-Image, ERNIE 5.0) and the latest advances in Video-LLM unification. From there, we explore Vision-Language-Action (VLA) models, generative agents that act within video worlds, and controlled generation problems including talking heads and drivable 3D avatars. The curriculum also covers 4D dynamic scenes, long-horizon consistency, and the critical topic of safety and provenance for generated media. Throughout the course, students engage with the broader societal implications of AI through curated readings and gain hands-on experience with AI-assisted (vibe) coding tools.


Course Format & Tools


Weekly Schedule

Note: This schedule is subject to change to reflect the rapid recent advances in AI.

Week Date Topics Slides Student Presenters
1 Feb 6 The New Thesis — Video Models as World Models Slides
2 Feb 13 Video Representation Learning Beyond Reconstruction (VideoGPT, Video Diffusion Models, Multimodal Models, Agents) Slides
3 Feb 20 (CNY, take home reading) Video-LLM Unification — From Encoders to Instruction Tuning (Video-LLaMA, Video-LLaVA, LLaVA-Video) Slides
4 Feb 27 Frontier Visual Models — Kimi K2.5, Z-Image, ERNIE 5.0 Slides YANG Zhiqin (embodied dreamer), Jincheng Fang (controllable video gen and business), Fan YANG (HyperDiffusion), Sida Lin (Seedance 1.5)
5 Mar 6 VLA — How Vision-Language Meets Control (PaLM-E, RT-2, OpenVLA) Slides Yakun Cui (world simulator), Hanquan Yang, Shiyuan Song (UniPi, MDP, robot control), Jianxin Huang (sparse videogen2)
6 Mar 13 Video World Models Slides Jialiang CHEN (FlashWorld), Haokai Pang (3d gen), PENG Yi (AC Talker), Pengcheng WEN (DreamDojo),
Haoze Zheng (LingBot-VA)
7 Mar 20 Guest Lecture — Dr. Xu Xian Zihao WANG (shotverse), Yihang JIANG, Mingzhe ZHENG, Xuran MA
8 Mar 27 Talking Heads (Audio → Video) as a Controlled Generation Problem Haoze ZHENG, Wenyuan Mi, Boyu Li, Wuyou Zhou
9 Apr 3 Drivable Avatars — From 2D Faces to 3D Gaussian/NeRF Heads
10 Apr 10 4D Dynamic Scenes — Text-to-4D, Dynamic NeRFs, and Gaussian Splatting in Motion WANG Zhe, Jiapeng Sun, Yuean Lin, Yunfan Zhang
11 Apr 17 Long-Horizon Consistency — Memory, Tokens, and Anti-Drift Methods Chen Long, Haoyang Zhang, Yinfei Jiang, Hongbo Zhu
12 Apr 24 Safety, Provenance, and Reality Defense (for Video + Avatars)
13 May 1 Final Project