🌘 [2502.10248] Step-Video-T2V 技術報告: 影片基礎模型的實踐、挑戰和未來
➤ 提升影片生成質量與未來發展方向
https://arxiv.org/abs/2502.10248
這份報告介紹了 Step-Video-T2V,一款擁有 30B 參數的最先進文本轉影片預訓練模型,能生成長達 204 幀的影片。使用深度壓縮變分自編碼器進行視頻生成任務,實現 16x16 空間和 8x 時間壓縮比,同時保持出色的影片重建質量。使用兩個雙語文本編碼器對用戶提示進行編碼,處理英文和中文。訓練了具有 3D 全局關注的 DiT,並使用 Flow Matching 將噪聲轉換為潛在幀。應用基於視頻的 DPO 方法, Video-DPO,以減少瑕疵並提高生成的影片視覺質量。評估了 Step-Video-T2V 的表現並提出未來影片基礎模型的方向。
+ 這份報告突顯了文本轉影片技術的最新發展,讓影片生成更具創意和效率。
+ 看完報告後,對於影片生成技術的未來前景有更清晰的認識和期待。
#影片基礎模型 #AI技術 #技術進步
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.

arXiv.org