The $4,000 ‘Revolution’: Is Brumby’s Power Retention a True Breakthrough or Just a Clever Retraining Hack?

Introduction: In the eight years since “Attention Is All You Need,” the transformer architecture has defined AI’s trajectory. Now, a little-known startup, Manifest AI, claims to have sidestepped attention’s Achilles’ heel with a “Power Retention” mechanism in their Brumby-14B-Base model, boasting unprecedented efficiency. But before we declare the transformer era over, it’s crucial to peel back the layers of this ostensible breakthrough and scrutinize its true implications.
Key Points
- Power Retention offers a compelling theoretical solution to attention’s quadratic scaling problem, promising constant-time computational costs for arbitrarily long contexts.
- The ability to retrain an existing transformer model for $4,000 suggests a potential accelerant for new architectural paradigms, significantly lowering the barrier to entry for adaptation.
- Brumby’s performance is currently “on par” with comparable transformers, not consistently superior, and its low training cost is contingent on leveraging pre-trained weights, masking the true cost of developing such a system from scratch.
In-Depth Analysis
The quadratic scaling of attention, where compute and memory demands explode with context length, has long been the uncomfortable truth lurking beneath the transformer’s dominance. It’s why our LLMs struggle to reason across entire books or sprawling codebases without prohibitive costs or clever, often lossy, workarounds. Manifest AI’s Power Retention technique directly confronts this bottleneck by abandoning global pairwise comparisons in favor of a recurrent, fixed-size latent state update. This shift, reminiscent of the much older Recurrent Neural Networks (RNNs) but purportedly with transformer-level expressiveness, theoretically promises a constant per-token computational cost regardless of context length—a paradigm-shifting claim for truly long-context AI.
However, the headline-grabbing $4,000 training cost for the 14-billion-parameter Brumby-14B-Base demands a critical lens. This figure, while impressive, isn’t a testament to the inherent cheapness of building a Power Retention model from scratch. As even Manifest AI’s founder clarified, it’s entirely predicated on leveraging the extensive, and astronomically expensive, pre-training of the Qwen3-14B-Base transformer. Brumby didn’t rise from the ashes as a phoenix; it’s a transplant patient, inheriting a “brain” largely shaped by attention, then retaught new neural pathways. This “architectural swap” highlights an intriguing path for innovation—adapting existing knowledge to new mechanics—but it obscures the true R&D and training investment required to bring such a novel architecture to parity from square one.
On the performance front, Brumby-14B-Base is lauded for being “on par” with established transformers. While it shows particular strength in mathematical and long-context reasoning, precisely where attention struggles, it lags on “knowledge-heavy” evaluations like MMLU-Pro. This isn’t a resounding victory across the board, but rather a targeted gain. It suggests Power Retention might indeed offer a structural advantage for certain types of tasks, but it hasn’t yet proven itself as an overall superior or even broadly equivalent replacement for the transformer across the diverse capabilities we expect from modern LLMs.
Contrasting Viewpoint
While the narrative of an attention-free LLM trained for $4,000 is undoubtedly catchy, a more cynical view reveals several crucial caveats. The primary one is the “retraining, not rebuilding” distinction. It’s disingenuous to present $4,000 as the cost of developing a 14B parameter model in a new paradigm when it’s merely the cost of fine-tuning the last few layers of a model that likely cost millions, if not tens of millions, to initially train. This is akin to claiming you built a high-performance race car for the cost of a new paint job, neglecting the immense engineering and manufacturing that went into the chassis and engine. The real question is: can Manifest AI, or anyone, build a truly new, attention-free model of this scale from scratch, achieving comparable or superior performance at a fraction of the cost of a transformer’s full lifecycle? The answer, at this stage, is a resounding no. Furthermore, “on par” performance, while commendable for a novel architecture, is not a compelling reason for the industry to abandon its deeply entrenched transformer ecosystem. Without a significant, across-the-board leap in capabilities or a demonstrable order-of-magnitude reduction in full-cycle training costs, widespread adoption will remain elusive.
Future Outlook
The Brumby-14B-Base offers a compelling proof-of-concept for the viability of attention-free architectures, particularly for addressing the looming long-context problem. Over the next 1-2 years, we can expect to see an acceleration in research into recurrent and retention-based models, likely focusing on pushing their performance beyond mere parity and exploring their true scaling limits from scratch. The biggest hurdles will be demonstrating a clear, unambiguous advantage over transformers across a broad range of tasks, not just specific niches. Manifest AI, or future innovators, must prove they can train these models efficiently without relying on transformer transfer learning, thereby validating the architectural shift itself. Additionally, the nascent ecosystem around Power Retention will need to mature rapidly, providing tooling, optimization techniques, and community support to compete with the vast resources dedicated to transformers. Until then, Brumby remains an exciting, but still largely unproven, challenger in the AI architecture arena.
For more context on the ongoing costs and complexities of training large language models, see our previous exposé on [[The Hidden Billions Behind ChatGPT]].
Further Reading
Original Source: Attention ISN’T all you need?! New Qwen3 variant Brumby-14B-Base leverages Power Retention technique (VentureBeat AI)