@gm8xx8
The Οβ release introduces a VLA generalist model for dexterous tasks like laundry folding and table bussing. Οβ uses a transformer with flow matching, combining VLM pre-training benefits and continuous action chunks at 50Hz, and is pre-trained on a broad dataset. With distinct pre-training and post-training stages, it supports zero-shot and fine-tuned task adaptation, demonstrating robustness to external interventions, as seen in an uncut video of Οβ folding laundry with a single model.
Οβ and its smaller, non-VLM version are evaluated against:
- Octo and OpenVLA for zero-shot VLA tasks
- ACT and Diffusion Policy for single tasks
Οβ surpasses in zero-shot accuracy, fine-tuning for new tasks, and language-following. Compute-parity ablations highlight trade-offs between VLA backbone gains and pre-training costs. Hierarchical methods like RT-H aid complex tasks needing low-level control and high-level planning, though Pi_0βs robust architecture largely drives its performance.
(link below)