3D-DLP: Self-supervised 3D Object-centric Scene Representation Learning

Zhang, Ellina; Iyengar, Madhavan; Zadeh, Amir; Li, Chuan; Held, David; Pathak, Deepak; Daniel, Tal

3D-DLP: Self-supervised 3D Object-centric
Scene Representation Learning

Ellina Zhang¹, Madhavan Iyengar¹, Amir Zadeh², Chuan Li², David Held¹, Deepak Pathak¹, Tal Daniel¹

¹Carnegie Mellon University ²Lambda AI
ICML 2026

Abstract

We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes — 3D keypoint position, bounding box dimensions, and appearance features — and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. On both simulated and real-world datasets, we demonstrate that the learned latent space is interpretable and controllable: by manipulating particle positions and decoding, we can generate novel scene configurations. Leveraging these compact 3D latent particles for downstream robotic manipulation improves performance over baselines that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure.

Object-Centric 3D Decomposition

3D-DLP decomposes each RLBench scene into a set of interpretable 3D latent particles. The top row shows our reconstructed RGB voxel scene; the bottom row shows the corresponding per-particle foreground/background decomposition that emerges purely from self-supervised training.

RGB Reconstruction

Close Jar

Open Drawer

Push Buttons

Put Item in Drawer

Reach and Drag

Slide Block to Target

Turn Tap

Sweep to Dustpan

Foreground / Background Decomposition

Close Jar

Open Drawer

Push Buttons

Put Item in Drawer

Reach and Drag

Slide Block to Target

Turn Tap

Sweep to Dustpan

Architecture

3D-DLP encodes RGB voxel observations into a set of latent particles via a K-means prior, attribute encoder, and 3D decoder with volumetric compositing.

Latent Space Controllability

By directly editing particle attributes (position, scale), we can generate novel scene configurations. Drag each slider to see the original scene versus the edited result.

Scene 1

Scene 2

Manipulation Performance

We isolate the value of 3D structure by swapping only the scene representation — 2D-DLP → 3D-DLP — inside the same entity-centric diffusion policy (EC-Diffuser). EquiDiff and PerAct use entirely different policy backbones and are shown as external references.

MimicGen

12 tasks · mean success rate

Same EC-Diffuser policy, representation swapped 2D → 3D: +14.0 pts

Same policy (EC-Diffuser) · representation swapped

3D-DLP (Ours)

48.1%

2D-DLP multi-view

34.1%

2D-DLP single-view

30.8%

Different policy backbone (reference)

EquiDiff

47.3%

RLBench

10 tasks · mean success rate

Same EC-Diffuser policy: 3D-DLP best on 9/10 tasks (+7.3 pts vs. 2D)

Same policy (EC-Diffuser) · representation swapped

3D-DLP (Ours)

74.5%

2D-DLP multi-view

67.2%

2D-DLP single-view

66.7%

Different policy backbone (reference)

PerAct

68.8%

3D-DLP (ours) — EC-Diffuser 2D-DLP — same EC-Diffuser backbone different policy backbone
Only 3D-DLP vs. 2D-DLP is a controlled representation swap (identical policy). EquiDiff and PerAct differ in policy architecture and are external references.

EC-Diffuser with 3D-DLP — Policy Rollouts

3D-DLP particles serve as compact state for an entity-centric diffusion policy across diverse manipulation tasks.

Hammer

Three Piece Assembly

Mug Cleanup

Square

Ground Truth vs. Reconstruction

Drag each slider to compare ground-truth voxels with our 3D-DLP reconstruction.

BibTeX

@inproceedings{zhang20263ddlp,
  title     = {3D-DLP: Self-supervised 3D Object-centric Scene Representation Learning},
  author    = {Zhang, Ellina and Iyengar, Madhavan and Zadeh, Amir and Li, Chuan and Held, David and Pathak, Deepak and Daniel, Tal},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026}
}

Acknowledgements

This work is built on top of Latent Particle World Models (ICLR 2026 Oral) by Tal Daniel et al., which itself extends the Deep Latent Particles (DLPv2 / DDLP) framework by Tal Daniel and Aviv Tamar. We thank the authors for releasing their implementations, on which our 3D extensions are based.

3D-DLP: Self-supervised 3D Object-centricScene Representation Learning