Chapter 1: Architectures
Core Questions
- How do we tokenize continuous physical space?
- What makes a good latent space for robotic control?
- How do we align vision, language, and action modalities?
Topics
1.1 Scene Encoders
- Vision transformers for spatial understanding
- Depth integration and 3D representations
- Temporal encoding for dynamic scenes
1.2 Multi-Modal Alignment
- Contrastive learning for vision-language
- Action space grounding
- The embedding geometry problem
1.3 Tokenization Strategies
- Discrete vs. continuous representations
- Spatial vs. semantic tokens
- Compression-fidelity trade-offs