3D-DLP: Self-supervised 3D Object-centric
Scene Representation Learning

1Carnegie Mellon University    2Lambda AI
ICML 2026

Abstract

We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes — 3D keypoint position, bounding box dimensions, and appearance features — and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. On both simulated and real-world datasets, we demonstrate that the learned latent space is interpretable and controllable: by manipulating particle positions and decoding, we can generate novel scene configurations. Leveraging these compact 3D latent particles for downstream robotic manipulation improves performance over baselines that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure.

Object-Centric 3D Decomposition

3D-DLP decomposes each RLBench scene into a set of interpretable 3D latent particles. The top row shows our reconstructed RGB voxel scene; the bottom row shows the corresponding per-particle foreground/background decomposition that emerges purely from self-supervised training.

RGB Reconstruction

Close Jar — RGB reconstruction

Close Jar

Open Drawer — RGB reconstruction

Open Drawer

Push Buttons — RGB reconstruction

Push Buttons

Put Item in Drawer — RGB reconstruction

Put Item in Drawer

Reach and Drag — RGB reconstruction

Reach and Drag

Slide Block to Target — RGB reconstruction

Slide Block to Target

Turn Tap — RGB reconstruction

Turn Tap

Sweep to Dustpan — RGB reconstruction

Sweep to Dustpan

Foreground / Background Decomposition

Close Jar — particle decomposition

Close Jar

Open Drawer — particle decomposition

Open Drawer

Push Buttons — particle decomposition

Push Buttons

Put Item in Drawer — particle decomposition

Put Item in Drawer

Reach and Drag — particle decomposition

Reach and Drag

Slide Block to Target — particle decomposition

Slide Block to Target

Turn Tap — particle decomposition

Turn Tap

Sweep to Dustpan — particle decomposition

Sweep to Dustpan

Architecture

3D-DLP architecture

3D-DLP encodes RGB voxel observations into a set of latent particles via a K-means prior, attribute encoder, and 3D decoder with volumetric compositing.

Latent Space Controllability

By directly editing particle attributes (position, scale), we can generate novel scene configurations. Drag each slider to see the original scene versus the edited result.

Scene 1

Scene 2

Manipulation Performance

We isolate the value of 3D structure by swapping only the scene representation — 2D-DLP → 3D-DLP — inside the same entity-centric diffusion policy (EC-Diffuser). EquiDiff and PerAct use entirely different policy backbones and are shown as external references.

Controlled comparison. 3D-DLP and the 2D-DLP baselines run on the identical EC-Diffuser policy backbone — only the representation changes (2D → 3D object-centric particles), so the gain isolates the effect of lifting the representation to 3D. EquiDiff and PerAct are separate policy architectures, included only as external references.

MimicGen

12 tasks · mean success rate

Same EC-Diffuser policy, representation swapped 2D → 3D: +14.0 pts
Same policy (EC-Diffuser) · representation swapped
3D-DLP (Ours)
48.1%
2D-DLP multi-view
34.1%
2D-DLP single-view
30.8%
Different policy backbone (reference)
EquiDiff
47.3%

RLBench

10 tasks · mean success rate

Same EC-Diffuser policy: 3D-DLP best on 9/10 tasks (+7.3 pts vs. 2D)
Same policy (EC-Diffuser) · representation swapped
3D-DLP (Ours)
74.5%
2D-DLP multi-view
67.2%
2D-DLP single-view
66.7%
Different policy backbone (reference)
PerAct
68.8%
3D-DLP (ours) — EC-Diffuser 2D-DLP — same EC-Diffuser backbone different policy backbone
Only 3D-DLP vs. 2D-DLP is a controlled representation swap (identical policy). EquiDiff and PerAct differ in policy architecture and are external references.

EC-Diffuser with 3D-DLP — Policy Rollouts

3D-DLP particles serve as compact state for an entity-centric diffusion policy across diverse manipulation tasks.

Hammer

Hammer rollout

Three Piece Assembly

Three piece assembly rollout

Mug Cleanup

Mug cleanup rollout

Square

Square rollout

Ground Truth vs. Reconstruction

Drag each slider to compare ground-truth voxels with our 3D-DLP reconstruction.

BibTeX

@inproceedings{zhang20263ddlp,
  title     = {3D-DLP: Self-supervised 3D Object-centric Scene Representation Learning},
  author    = {Zhang, Ellina and Iyengar, Madhavan and Zadeh, Amir and Li, Chuan and Held, David and Pathak, Deepak and Daniel, Tal},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026}
}

Acknowledgements

This work is built on top of Latent Particle World Models (ICLR 2026 Oral) by Tal Daniel et al., which itself extends the Deep Latent Particles (DLPv2 / DDLP) framework by Tal Daniel and Aviv Tamar. We thank the authors for releasing their implementations, on which our 3D extensions are based.