Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

CVPR 2024 (Oral, Best Paper Award Candidate)

Photogrammetry and Remote Sensing, ETH Zürich
Teaser image demonstrating Marigold depth estimation.

Marigold is the new state-of-the-art depth estimator for images in the wild.


We present Marigold, a diffusion model and associated fine-tuning protocol for monocular depth estimation. Its core principle is to leverage the rich visual knowledge stored in modern generative image models. Our model, derived from Stable Diffusion and fine-tuned with synthetic data, can zero-shot transfer to unseen data, offering state-of-the-art monocular depth estimation results.

The gallery below presents several images from the internet and a comparison of Marigold with the previous state-of-the-art method LeRes. Use the slider and gestures to reveal details on both sides.


How it works

Fine-tuning protocol

Starting from a pretrained Stable Diffusion, we encode the image $x$ and depth $d$ into the latent space using the original Stable Diffusion VAE. We fine-tune just the U-Net by optimizing the standard diffusion objective relative to the depth latent code. Image conditioning is achieved by concatenating the two latent codes before feeding them into the U-Net. The first layer of the U-Net is modified to accept concatenated latent codes.

Marigold training scheme

Inference scheme

Given an input image $x$, we encode it with the original Stable Diffusion VAE into the latent code $z^{(x)}$, and concatenate with the depth latent $z^{(d)}_t$ before giving it to the modified fine-tuned U-Net on every denoising iteration. After executing the schedule of $T$ steps, the resulting depth latent $z^{(d)}_0$ is decoded into an image, whose 3 channels are averaged to get the final estimation $\hat d$.

Marigold inference scheme

Comparison with other methods

Quantitative comparison of Marigold with SOTA affine-invariant depth estimators on several zero-shot benchmarks. All metrics are presented in percentage terms; bold numbers are the best, underscored second best. Our method outperforms other methods on both indoor and outdoor scenes in most cases, without ever seeing a real depth sample.

Comparison with other methods

Refer to the pdf paper linked above for more details on qualitative, quantitative, and ablation studies.



        title={Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation},
        author={Bingxin Ke and Anton Obukhov and Shengyu Huang and Nando Metzger and Rodrigo Caye Daudt and Konrad Schindler},
        booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},