Chapter 1: Architectures

Core Questions

  • How do we tokenize continuous physical space?
  • What makes a good latent space for robotic control?
  • How do we align vision, language, and action modalities?

Topics

1.1 Scene Encoders

  • Vision transformers for spatial understanding
  • Depth integration and 3D representations
  • Temporal encoding for dynamic scenes

1.2 Multi-Modal Alignment

  • Contrastive learning for vision-language
  • Action space grounding
  • The embedding geometry problem

1.3 Tokenization Strategies

  • Discrete vs. continuous representations
  • Spatial vs. semantic tokens
  • Compression-fidelity trade-offs