Course Outline — Spring 2026
This advanced research seminar treats video generation not merely as media synthesis, but as the foundation for General World Models and Embodied Agents. The course begins with the thesis that video models are world models, then surveys frontier visual models (Kimi K2.5, Z-Image, ERNIE 5.0) and the latest advances in Video-LLM unification. From there, we explore Vision-Language-Action (VLA) models, generative agents that act within video worlds, and controlled generation problems including talking heads and drivable 3D avatars. The curriculum also covers 4D dynamic scenes, long-horizon consistency, and the critical topic of safety and provenance for generated media. Throughout the course, students engage with the broader societal implications of AI through curated readings and gain hands-on experience with AI-assisted (vibe) coding tools.
Note: This schedule is subject to change to reflect the rapid recent advances in AI.
| Week | Date | Topics | Slides | Student Presenters |
|---|---|---|---|---|
| 1 | Feb 6 | The New Thesis — Video Models as World Models | Slides | |
| 2 | Feb 13 | Video Representation Learning Beyond Reconstruction (VideoGPT, Video Diffusion Models, Multimodal Models, Agents) | Slides | |
| 3 | Feb 20 (CNY, take home reading) | Video-LLM Unification — From Encoders to Instruction Tuning (Video-LLaMA, Video-LLaVA, LLaVA-Video) | Slides | |
| 4 | Feb 27 | Frontier Visual Models — Kimi K2.5, Z-Image, ERNIE 5.0 | Slides | YANG Zhiqin (embodied dreamer), Jincheng Fang (controllable video gen and business), Fan YANG (HyperDiffusion), Sida Lin (Seedance 1.5) |
| 5 | Mar 6 | VLA — How Vision-Language Meets Control (PaLM-E, RT-2, OpenVLA) | Slides | Yakun Cui (world simulator), |
| 6 | Mar 13 | Video World Models | Slides | Jialiang CHEN (FlashWorld), Haokai Pang (3d gen), PENG Yi (AC Talker), Pengcheng WEN (DreamDojo), |
| Haoze Zheng (LingBot-VA) | ||||
| 7 | Mar 20 | Guest Lecture — Dr. Xu Xian | Zihao WANG (shotverse), Yihang JIANG, |
|
| 8 | Mar 27 | Talking Heads (Audio → Video) as a Controlled Generation Problem | Haoze ZHENG, Wenyuan Mi, Boyu Li, Wuyou Zhou | |
| 9 | Apr 3 | Drivable Avatars — From 2D Faces to 3D Gaussian/NeRF Heads | ||
| 10 | Apr 10 | 4D Dynamic Scenes — Text-to-4D, Dynamic NeRFs, and Gaussian Splatting in Motion | WANG Zhe, Jiapeng Sun, Yuean Lin, Yunfan Zhang | |
| 11 | Apr 17 | Long-Horizon Consistency — Memory, Tokens, and Anti-Drift Methods | Chen Long, Haoyang Zhang, Yinfei Jiang, Hongbo Zhu | |
| 12 | Apr 24 | Safety, Provenance, and Reality Defense (for Video + Avatars) | ||
| 13 | May 1 | Final Project |