2025-07-12 What am I Reading?

Vision Transformers

Continuing from the last WAIR, I wanted to take a look at how ViTs can be extended to dense prediction.

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

  • Linear in image size (regular ViT is quadratic)
  • Swin = Shifted Window
    • Localized self attention within a window partition
    • Successive window partitions are shifted, which allows for connecting features
  • Advantages in computational complexity vs the sliding window, and also vs traditional ViT (as mentioned above ViT is typically quadratic).
  • Enables dense prediction by adding FPN (Feature Pyramidal Network) on the hierarchical features from each stage of Swin network.
  • Has some neat tricks to optimize training/inference on the shifted windows, when the windows can be varying in shape.
    • Uses masking.

Diffusion models & SDE

Denoising Diffusion Probabilistic Models (DDPM)

  • Although diffusion models go back to 2015 (at least) this paper kicked off the current popularity of diffusion in generative models.
  • Diffusion models use both a (fixed) forward discrete Markov process and reverse (learned) Markov process.
  • The forward process conditional distributions (denoted $q$) are chosen to converge to $x_T \rightarrow \mathbf{N}(0, 1)$.
  • Instead of optimizing against the intractable negative log likelihood loss $- \log p_\theta(x)$ optimize against the ELBO $\log \frac{q(x_{1:T}\vert x_0)}{p_\theta(x_{0:T})} \geq - \log p_\theta(x)$
  • By making certain choices of how to parametrize $p_\theta$ and knowing some properties of the forward process corresponding to $q$, we eventually have a loss to minimize \(\mathbb{E}_{t, x_0, \epsilon} \Vert \mathbf{\epsilon} - \mathbf{\epsilon}_\theta (\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha_t}} \mathbf{\epsilon}, t)\Vert\)
    • $\mathbf{\epsilon}_\theta$ function approximator (i.e. our neural net) that approximates $\mathbf{\epsilon}$
    • $\bar{\alpha}_t$ depends only on $t$ and the noise schedule of the forward process, which is fixed.
    • $t$ is sampled, and there is normally a weighting term based on $t$ (which is dropped in this equation)
    • equivalent to Langevin dynamics, resembles denoising score matching.
  • Inference needs to step through all the iterations of the backward process, adding some small noise $\mathbf{z}$ each step.
    • \[\mathbf{x}_{t - 1} = \frac{1}{\sqrt{\alpha_t}}( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha_t}}} \mathbf{\epsilon}_\theta(\mathbf{x}_t, t)) + \sigma_t \mathbf{z}\]
    • $\sigma_t^2 = \beta_t$ is the choice of parametrization for variance of the reverse process distribution.