University of Colorado Boulder
Modern AI Models for Vision and Multimodal Understanding
University of Colorado Boulder

Modern AI Models for Vision and Multimodal Understanding

Tom Yeh

Instructor: Tom Yeh

Included with Coursera Plus

Gain insight into a topic and learn the fundamentals.
Advanced level

Recommended experience

1 week to complete
at 10 hours a week
Flexible schedule
Learn at your own pace
Gain insight into a topic and learn the fundamentals.
Advanced level

Recommended experience

1 week to complete
at 10 hours a week
Flexible schedule
Learn at your own pace

What you'll learn

  • Apply Nonlinear Support Vector Machines (NSVMs) and Fourier transforms to analyze and process visual data.

  • Use probabilistic reasoning and implement Recurrent Neural Networks (RNNs) to model temporal sequences and contextual dependencies in visual data.

  • Explain the principles of transformer architectures and how Vision Transformers (ViT) perform image classification and visual understanding tasks.

  • Implement CLIP for multimodal learning, and utilize diffusion models to generate high-fidelity images.

Details to know

Shareable certificate

Add to your LinkedIn profile

Recently updated!

August 2025

Assessments

18 assignments

Taught in English

See how employees at top companies are mastering in-demand skills

 logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

There are 4 modules in this course

Welcome to Modern AI Models for Vision and Multimodal Understanding, the third course in the Computer Vision specialization. In this first module, you’ll explore foundational mathematical tools used in modern AI models for vision and multimodal understanding. You’ll begin with Support Vector Machines (SVMs), learning how linear and radial basis function (RBF) kernels define decision boundaries and how support vectors influence classification. Then, you’ll dive into the Fourier Transform, starting with 1D signals and progressing to 2D applications. You’ll learn how to move between time/spatial and frequency domains using the Discrete Fourier Transform (DFT) and its inverse, and how these transformations reveal patterns and structures in data. By the end of this module, you’ll understand how SVMs and Fourier analysis contribute to feature extraction, signal decomposition, and model interpretability in AI systems.

What's included

14 videos5 readings4 assignments

This module invites you to explore how probability theory and sequential modeling power modern AI systems. You’ll begin by examining how conditional and joint probabilities shape predictions in language and image models, and how the chain rule enables structured generative processes. Then, you’ll transition to recurrent neural networks (RNNs), learning how they handle sequential data through hidden states and feedback loops. You’ll compare RNNs to feedforward models, explore architectures like one-to-many and sequence-to-sequence, and address challenges like vanishing gradients. By the end, you’ll understand how probabilistic reasoning and temporal modeling combine to support tasks ranging from text generation to autoregressive image synthesis.

What's included

15 videos2 readings5 assignments

This module explores how attention-based architectures have reshaped the landscape of deep learning for both language and vision. You’ll begin by unpacking the mechanics of the Transformer, including self-attention, multi-head attention, and the encoder-decoder structure that enables parallel sequence modeling. Then, you’ll transition to Vision Transformers (ViTs), where images are tokenized and processed using the same principles that revolutionized NLP. Along the way, you’ll examine how normalization, positional encoding, and projection layers contribute to model performance. By the end, you’ll understand how Transformers and ViTs unify sequence and spatial reasoning in modern AI systems.

What's included

15 videos2 readings5 assignments

In this module, you’ll explore two transformative approaches in multimodal and generative AI. First, you’ll dive into CLIP, a model that learns a shared embedding space for images and text using contrastive pre-training. You’ll see how CLIP enables zero-shot classification by comparing image embeddings to textual descriptions, without needing labeled training data. Then, you’ll shift to diffusion models, which generate images through a gradual denoising process. You’ll learn how noise prediction, time conditioning, and reverse diffusion combine to produce high-quality samples. This module highlights how foundational models can bridge modalities and synthesize data with remarkable flexibility.

What's included

11 videos2 readings4 assignments

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.

Instructor

Tom Yeh
University of Colorado Boulder
4 Courses9,853 learners

Offered by

Why people choose Coursera for their career

Felipe M.
Learner since 2018
"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."
Jennifer J.
Learner since 2020
"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."
Larry W.
Learner since 2021
"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."
Chaitanya A.
"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."
Coursera Plus

Open new doors with Coursera Plus

Unlimited access to 10,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Frequently asked questions