⚠️ DRAFT: NOT YET ASSIGNED
This assignment is still under review and subject to change. Do not begin work until this notice is removed.
Focus: Implementing the engine of the VLA stack.
In this assignment, you will implement a decoder-only Transformer from scratch using PyTorch. You will not use nn.Transformer or pre-built transformers library backbones. You will build the attention mechanism, the feed-forward blocks, and the autoregressive training loop.
The goal is to train this model to perform Next-Token Prediction on a synthetic robotic trajectory dataset.
This assignment confronts you with the reality of the Unified Sequence Hypothesis: that actions, pixels, and words are all just tokens in an autoregressive chain. The same architecture that powers GPT-4 can control a robot arm.
Your implementation must include:
Your model will minimize the negative log-likelihood of the sequence :
Where each represents a tokenized robotic state (position and orientation ).
Claude has scaffolded a template in the course repository under src/assignments/scratch-1/backbone.py. You must complete the following blocks:
Implement the scaled dot-product attention with a causal mask.
Where is the mask matrix: for and otherwise.
Key Points:
torch.tril to create the causal maskImplement the normalization layer defined by:
Where:
Key Points:
Write a training loop that processes a batch of synthetic trajectories, calculates the cross-entropy loss, and performs backpropagation.
Requirements:
We provide a script src/assignments/scratch-1/generate_data.py that creates 10,000 synthetic robot trajectories. Each trajectory consists of a 7-DOF robot arm moving toward a target.
Dataset Specifications:
Data Format:
{
'trajectories': torch.Tensor, # Shape: (10000, 50, 10)
'tokenized': torch.LongTensor # Shape: (10000, 50) - discretized actions
}
cd src/assignments/scratch-1
python generate_data.py --output data/trajectories.pkl
Commit your implementation to your student branch in the arpg/vla-foundations repo:
git checkout -b scratch-1-yourname
git add src/assignments/scratch-1/backbone.py
git commit -m "Complete Scratch-1: Transformer Backbone"
git push origin scratch-1-yourname
Open a pull request to the staging branch with:
Scratch-1: [Your Name]Create an MDX page at content/course/submissions/scratch-1/[your-name].mdx with the following sections:
---
title: "Scratch-1 Submission: Your Name"
student: "Your Name"
date: "2026-01-27"
---
# Scratch-1: The Transformer Backbone
## Loss Curve

The model converged after X iterations with final loss of Y.
## Attention Visualization

The attention patterns show...
## The Audit: Removing the Causal Mask
When I removed the causal mask, the following happened:
[Your analysis here]
### Why the Model "Cheats"
[Your explanation here]
## Code Highlights
[Optional: Highlight interesting implementation details]
## Challenges and Solutions
[Optional: Discuss difficulties you encountered]
All Pass Level requirements, plus:
| Component | Points |
|---|---|
| Implementation | 50 |
| - Causal Self-Attention (correct masking) | 15 |
| - RMSNorm (correct implementation) | 10 |
| - RoPE (correct rotation application) | 15 |
| - Training loop (convergence) | 10 |
| Report | 30 |
| - Loss curve (clear visualization) | 10 |
| - Attention maps (informative) | 10 |
| - Causal mask audit (insightful analysis) | 10 |
| Code Quality | 20 |
| - Clean, readable code | 10 |
| - Proper documentation | 5 |
| - Git workflow (clear commits) | 5 |
| Mastery Bonus | +10 |
| - KV-Caching implementation | 5 |
| - RoPE derivation and ablation | 5 |
cd src/assignments/scratch-1
python generate_data.py --num_trajectories 10000 --seq_length 50
Open src/assignments/scratch-1/backbone.py and locate the # TODO markers.
Start with RMSNorm (simplest), then move to Attention, then the full model.
python train.py --data data/trajectories.pkl --epochs 10
python visualize.py --checkpoint checkpoints/best_model.pt
logits + mask)torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)Q: Can I use einops for cleaner tensor operations? A: Yes, but document what each einops operation does.
Q: Can I use Flash Attention? A: Not for the base implementation. Flash Attention can be used for the Mastery bonus.
Q: How long should training take? A: On a single GPU: ~10-15 minutes for 10k trajectories. On CPU: ~1 hour.
Q: What batch size should I use? A: Start with 32. Increase if you have more VRAM.
Q: My loss isn't converging. Help? A: Check:
Due: Tuesday, January 27, 9:00 AM MST
Late submissions: 10% penalty per day (max 3 days late)
If you're stuck, attend office hours or post in the discussion forum. Common questions:
Good luck! This assignment will give you a deep understanding of how modern VLA models work under the hood.
"The Transformer is not magic. It's just matrix multiplications and softmax." — Andrej Karpathy