Quick predictions. 2025: 100B+ parameter model trained on d-networks. 2026: First 50B+ *multi-modal* model trained on volunteer GPUs. 2027-8: All else equal, GPT-3 175B dense competitive model reproduced. 2030+: A true "frontier" decentralized run, 1T params+.
Jake Brukhman 🚀 deAI Summer 2025
Jake Brukhman 🚀 deAI Summer 202510.6.2025
Thoughts/predictions in decentralized AI training, 2025. 1. One thing to say is we are definitely in a different world with decentralized AI training than we were 18 months ago. Back then, decentralized training was impossible and now it's in-market and is a field. 2. Make no mistake, the end goal of d-training is to train competitive, frontier models on d-networks. As such, we are just at the start of our competitive journey, but we're moving fast. 3. It's now consensus that we can pre-train and post-train multi-billion parameter models (mostly LLMs, mostly transformer architectures) on d-networks. The current state of the art is up to ~100B, the top end of which is in sight but has not been shown. 4. It's now consensus we can train <10B parameter models on d-networks pretty feasibly. There have also been specific case studies (primarily from @gensynai @PrimeIntellect @NousResearch) where 10B, 32B, 40B parameters have been or are being trained. @gensynai's post-training swarm operates on up to 72B parameter models. 5. The @PluralisHQ innovation has now invalidated the "impossibility" of scalable pre-training on d-networks by removing the communication inefficiency bottleneck. However, raw FLOPs, reliability, and verifiability remain bottlenecks for these types of networks -- problems that are very solvable but will take some time to technically resolve. With Protocol Learning from Pluralis as it stands, I think we get to ~100B models on 6-12 month timeframe. 6. How do we get from 100B to 300B parameter models? I think we need to find ways to effectively and fluidly shard parameters and to keep individual device memory relatively low (e.g. <32GB memory per device). I think we need to get to 20 EFlops in a network; that means something like 10-20K consumer devices running for 4-6 weeks on a training. Overall, d-training is poised to be a very exciting space. Some of its innovations are already being considered for broad AI applications.
3,69K