Step into the frontier of artificial intelligence with this advanced course designed to explore the latest models powering visual and multimodal intelligence. From foundational mathematical tools to state-of-the-art architectures, you'll gain the skills to understand and build systems that interpret images, text, and more—just like today’s leading AI models.



Recommended experience
What you'll learn
Apply Nonlinear Support Vector Machines (NSVMs) and Fourier transforms to analyze and process visual data.
Use probabilistic reasoning and implement Recurrent Neural Networks (RNNs) to model temporal sequences and contextual dependencies in visual data.
Explain the principles of transformer architectures and how Vision Transformers (ViT) perform image classification and visual understanding tasks.
Implement CLIP for multimodal learning, and utilize diffusion models to generate high-fidelity images.
Skills you'll gain
Details to know

Add to your LinkedIn profile
August 2025
18 assignments
See how employees at top companies are mastering in-demand skills

There are 4 modules in this course
Welcome to Modern AI Models for Vision and Multimodal Understanding, the third course in the Computer Vision specialization. In this first module, you’ll explore foundational mathematical tools used in modern AI models for vision and multimodal understanding. You’ll begin with Support Vector Machines (SVMs), learning how linear and radial basis function (RBF) kernels define decision boundaries and how support vectors influence classification. Then, you’ll dive into the Fourier Transform, starting with 1D signals and progressing to 2D applications. You’ll learn how to move between time/spatial and frequency domains using the Discrete Fourier Transform (DFT) and its inverse, and how these transformations reveal patterns and structures in data. By the end of this module, you’ll understand how SVMs and Fourier analysis contribute to feature extraction, signal decomposition, and model interpretability in AI systems.
What's included
14 videos5 readings4 assignments
This module invites you to explore how probability theory and sequential modeling power modern AI systems. You’ll begin by examining how conditional and joint probabilities shape predictions in language and image models, and how the chain rule enables structured generative processes. Then, you’ll transition to recurrent neural networks (RNNs), learning how they handle sequential data through hidden states and feedback loops. You’ll compare RNNs to feedforward models, explore architectures like one-to-many and sequence-to-sequence, and address challenges like vanishing gradients. By the end, you’ll understand how probabilistic reasoning and temporal modeling combine to support tasks ranging from text generation to autoregressive image synthesis.
What's included
15 videos2 readings5 assignments
This module explores how attention-based architectures have reshaped the landscape of deep learning for both language and vision. You’ll begin by unpacking the mechanics of the Transformer, including self-attention, multi-head attention, and the encoder-decoder structure that enables parallel sequence modeling. Then, you’ll transition to Vision Transformers (ViTs), where images are tokenized and processed using the same principles that revolutionized NLP. Along the way, you’ll examine how normalization, positional encoding, and projection layers contribute to model performance. By the end, you’ll understand how Transformers and ViTs unify sequence and spatial reasoning in modern AI systems.
What's included
15 videos2 readings5 assignments
In this module, you’ll explore two transformative approaches in multimodal and generative AI. First, you’ll dive into CLIP, a model that learns a shared embedding space for images and text using contrastive pre-training. You’ll see how CLIP enables zero-shot classification by comparing image embeddings to textual descriptions, without needing labeled training data. Then, you’ll shift to diffusion models, which generate images through a gradual denoising process. You’ll learn how noise prediction, time conditioning, and reverse diffusion combine to produce high-quality samples. This module highlights how foundational models can bridge modalities and synthesize data with remarkable flexibility.
What's included
11 videos2 readings4 assignments
Earn a career certificate
Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.
Instructor

Offered by
Why people choose Coursera for their career





Open new doors with Coursera Plus
Unlimited access to 10,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription
Advance your career with an online degree
Earn a degree from world-class universities - 100% online
Join over 3,400 global companies that choose Coursera for Business
Upskill your employees to excel in the digital economy
Frequently asked questions
To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.
When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.
If you subscribed, you get a 7-day free trial during which you can cancel at no penalty. After that, we don’t give refunds, but you can cancel your subscription at any time. See our full refund policy.
More questions
Financial aid available,