Connectivity-Stable 3D Voxel Diffusion via Sampling-Time Guidance
In this work, topology primarily refers to connectivity and skeleton continuity.


Conclusion
- Connectivity failure in 3D structure generation is a sampling-dynamics problem, not a model-capacity problem. We find that successful and failed samples start to diverge between t = 800 and t = 600 (T = 1000). Applying guidance over the window t = 800 to 300 raises the final connectivity success rate from 6.5% to 30%, showing that structural failure in 3D structures can be fixed by intervening on the sampling trajectory.
- Guidance in a mid-sampling window passes both the connectivity and naturalness sanity checks. Full-time guidance with reduced strength reaches 36% success rate but loses to window-restricted guidance on naturalness. For window-restricted guidance, intervening too early pushes the overall structure toward simplified outputs; intervening too late doesn't give the diffusion prior enough time to pull samples back to the natural distribution.
- A lightweight and on-demand scorer is a practical alternative to conditioning-based control when data is scarce. CAST's scorer has 346K parameters and trains in only 145 seconds. No human annotation needed. Instead of relying on a single general scorer (e.g., CLIP in DreamFusion) to carry all control objectives, control can be split into multiple narrow, domain-specific scorers, each lightweight and aligned directly with its target metric. Multi-objective control then becomes a composition problem.
The Problem
Current voxel 3D diffusion models tend to produce fragmented structures when generating thin, topologically structured objects (furniture, buildings, etc.) — broken parts, floating islands, missing voxels. I don't think this is something more training steps or larger model capacity will fully fix. The reason, as I see it, is that diffusion denoising updates are essentially local refinement; they lack any explicit notion of global connectivity or geometric inertia. In this experiment I use Minecraft tree data to characterize the topology problem and validate a fix.
Below are projected tri-view samples. Green = wood voxels, yellow = leaves (no wood), dark blue = air. Diffusion-generated 3D samples frequently exhibit structural breakage. Baseline diffusion trunks are rarely continuous (green blocks do not form a connected component under 26-neighbor connectivity):


Ground truth, in contrast, has fully connected trunks (no green-block discontinuity):


Research Goal