Diffusion transformers are the key behind OpenAI’s Sora — and they’re set to upend GenAI

Key Points:

  • The diffusion transformer is a significant advancement in GenAI technology.
  • The introduction of transformers in diffusion models improves scalability and effectiveness.
  • Transformers offer advantages over traditional U-Net backbones in AI-powered media generators.


OpenAI’s latest model, Sora, has captured attention by generating videos and interactive 3D environments in real-time, showcasing the cutting edge of GenAI technology. This achievement is underpinned by the diffusion transformer, an AI architecture that has been in existence for years but is now gaining prominence for its transformative capabilities.


Saining Xie, a computer science professor, initiated the diffusion transformer project in 2022 by combining the concepts of diffusion and transformers. This innovation enables GenAI models to efficiently scale beyond previous limits. The diffusion process, utilized by modern AI media generators like OpenAI’s DALL-E 3, involves iteratively adding noise to media and then training models to remove the noise gradually, ultimately producing desired output.

Traditionally, models use U-Net backbones in the diffusion process, but the introduction of transformers presents a more efficient and performance-enhancing alternative. Transformers, renowned for their attention mechanism in complex reasoning tasks, streamline architectures and make them parallelizable for training larger models with increased compute power.
Xie emphasized that the scalability and efficacy improvements seen in models like Sora highlight the transformative impact of transformers on diffusion models. While the adoption of diffusion transformers took time, recent projects like Sora and Stable Diffusion 3.0 underscore the importance of this architectural shift.


Looking ahead, Xie envisions an integrated approach in content understanding and creation within the diffusion transformer framework, aiming to synchronize these typically separate processes. As transformers offer enhanced speed, performance, and scalability compared to U-Nets, Xie advocates for a standardization of underlying architectures, with transformers leading the charge in this evolution.


In conclusion, the emergence of Sora and Stable Diffusion 3.0 powered by diffusion transformers hints at a promising future for GenAI models, promising significant advancements in efficiency and effectiveness in media generation. The integration of transformers signals a paradigm shift in the field, setting the stage for further innovation and growth in AI technologies.



Prompt Engineering Guides



©2024 The Horizon