Isotropic3D: Image-to-3D Generation Based on a Single CLIP Embedding

1College of Intelligent Robotics and Advanced Manufacturing, Fudan University 2Department of Computer Science and Technology, Tsinghua University 3School of Artificial Intelligence, Beijing Normal University 4College of Electronic Information Engineering, Tongji University *Corresponding author

Abstract

Encouraged by the growing availability of pre-trained 2D diffusion models, image-to-3D generation by leveraging Score Distillation Sampling (SDS) is making remarkable progress. However, most existing approaches rely heavily on reference-view image supervision, which often disrupts the inductive priors of diffusion models and leads to distorted geometry or overly smooth back regions. To overcome these limitations, we propose Isotropic3D, a novel image-to-3D framework that takes only a single image CLIP embedding as input. Our method ensures azimuth-angle isotropy by relying exclusively on the SDS loss, avoiding overfitting to the reference image. Isotropic3D is consist of two main components: a EMA-conditioned multi-view diffusion model (EMA-MVD) and a Neural Radiance Field (NeRF). The core of EMA-MVD lies in a two-stage fine-tuning. Firstly, we fine-tune a text-to-3D diffusion model by substituting its text encoder with an image encoder, by which the model preliminarily acquires image-to-image capabilities. Secondly, we perform fine-tuning using our Explicit Multi-view Attention (EMA), which combines noisy multi-view images with the noise-free reference image as an explicit condition. After fine-tuning, Isotropic3D, built upon SDS with NeRF, can generate multi-view consistent images from a single CLIP embedding and reconstruct a 3D model with improved symmetry, well-proportioned geometry, richly colored textures, and reduced distortion.

Isotropic3D is proficient in generating multi-view images that maintain mutual consistency, as well as producing a 3D model characterized by symmetrical and neat content, regular geometry, rich colored texture, and less distortion.


Method

Overview

Isotropic3D Pipeline


Overview

Explicit Multi-view Attention (EMA)


Qualitative comparisons with baseline models

Qualitative comparisons in different settings

CCR is denote as channel-concatenate reference image. "NOTHING" means that it does not generate anything. When removing CCR and L2 supervision together, this is equivalent to using only a single CLIP embedding.

Two groups of results are from two runs by Isotropic3D


More results