breakdown of V-JEPA 2 world model by @k7agar diving into the architecture that made it possible for it to carry out cup grasping with 65% success rate mentions about the 'language goal problem' where the robot is able to understand what it needs to achieve without being shown a pic / multiple pics would be interesting to explore a decentralized approach for that 1. world model generates iterations of 'goal' 2. decentralized verifier network votes on which one is regarded as an accurate 'goal' e.g. identifying BLT sandwich link below
564