D3DM: High-Fidelity Image-to-3D Generation with Direct 3D Diffusion Model

Diffusion Models have shown significant promise in generating high-fidelity 3D shapes from a single image. However, prior methods often suffer from a lack of fine-grained features, directional ambiguities and inconsistencies between the generated 3D shapes and the input images, which collectively undermine the overall controllability of the model. To address these challenges, we propose a Direct 3D Diffusion Model (D3DM) which consists of a Cross-modal Variational Auto-Encoder (CVAE) and an Image-conditioned 3D Latent Diffusion Model (I3D-LDM). CVAE encodes each shape into the latent space and decodes the latent to predict the occupancy values of the query points. It features three techniques: Normal Space Sampling (NSS), Rotation-Invariant (RI) and Shape-Image Cross-Modal Attention (SICA). NSS enhances the performance of the CVAE to capture fine details, particularly edges and sharp features. RI ensures directional consistency between the mesh and the image, while SICA dynamically fuses 3D features with 2D image information to improve the fine-grained geometory. The I3D-LDM generates the latent from a single image, utilizing a Pixel-Semantic Co-Guidance (PSC) mechanism to integrate global semantics from CLIP with pixel-level details from DINOv2, which ensures the generated 3D shapes exhibit both global semantic and pixel detail consistency. Extensive experiments demonstrate that D3DM is capable of generating high-fidelity 3D shapes with accurate directional consistency from a single-view image.

D3DM: High-Fidelity Image-to-3D Generation with Direct 3D Diffusion Model

Abstract

Method Overview

Qualitative comparisons

More image-to-3D results

Application: Text to 3D