Attention’s Reign Challenged: New ‘Power Retention’ Model Promises Transformer-Level Performance at a Fraction of the Cost | AI Faces Capacity Crunch; Gemini Deep Research Integrates Personal Data

Key Takeaways
- Manifest AI introduced Brumby-14B-Base, a variant of Qwen3-14B-Base that replaces the attention mechanism with a novel “Power Retention” architecture, achieving comparable performance to state-of-the-art transformers for a fraction of the cost.
- The Power Retention mechanism offers constant-time per-token computation, addressing the quadratic scaling bottleneck of attention for long contexts and enabling highly efficient retraining of existing transformer models.
- The AI industry is heading towards a “surge pricing” breakpoint due to an escalating capacity crunch, rising latency, and unsustainable costs, emphasizing an urgent need for greater efficiency in inference.
- Google Gemini’s Deep Research feature now offers enhanced capabilities by accessing users’ emails, Google Drive, and chat data to generate more comprehensive and personalized research reports.
- Elastic launched “Streams,” an AI-powered feature for observability that transforms raw, voluminous logs into structured insights, automating incident diagnosis and suggesting remediation steps for SREs.
Main Developments
A foundational shift may be underway in the world of artificial intelligence, as the long-held dominance of the transformer architecture faces its most credible challenge yet. Eight years after the “Attention Is All You Need” paper revolutionized AI, a small startup named Manifest AI has introduced Brumby-14B-Base, a new model that abandons attention entirely in favor of a novel mechanism called Power Retention.
Released on October 28, 2025, Brumby-14B-Base is a retrained variant of the open-source Qwen3-14B-Base. The core innovation lies in its Power Retention layer, which replaces the global pairwise comparison of attention with a recurrent state update. This architecture processes information like an RNN, continuously compressing past data into a fixed-size latent state, meaning its computational cost remains constant regardless of context length—a profound departure from the quadratically scaling expense of transformers. Manifest AI claims this allows Brumby to perform on par with established transformer models like Qwen3-14B and GLM-4.5-Air, particularly excelling in mathematical and long-context reasoning tasks where attention architectures typically falter.
Perhaps the most striking aspect is the economics: Manifest AI trained the 14-billion-parameter Brumby model for just $4,000, in only 60 hours on 32 Nvidia H100 GPUs. While this impressive cost reduction relies on retraining existing transformer weights, Jacob Buckman, founder of Manifest AI, emphasized its significance, calling it a “critical accelerant for the adoption of a new modeling paradigm” that can democratize large-scale experimentation. He projects similar retraining costs for even much larger models, hinting at a future where attention-free systems can achieve transformer performance for orders of magnitude less investment. This efficiency extends to hardware utilization, with Power Retention kernels reportedly achieving higher utilization than FlashAttention2 and Mamba, and delivering hundreds-fold speedups on very long contexts.
The emergence of such efficient architectures couldn’t come at a more critical time for the AI industry, which is grappling with a looming “capacity crunch.” Val Bercovici, chief AI officer at WEKA, warns that AI is rapidly approaching an “economic reckoning” akin to Uber’s surge pricing, particularly for inference. He argues that current AI rates are subsidized and unsustainable, with real market rates set to appear by 2027. The demand for ever-increasing tokens for higher accuracy, especially in agent swarms requiring hundreds to thousands of prompts and responses, creates compound latency delays that are becoming untenable. Bercovici stresses that efficiency, not just raw scale, will be paramount for AI profitability.
Amidst these architectural and economic discussions, AI’s capabilities continue to expand into practical applications. Google Gemini’s Deep Research feature, for example, has been enhanced to draw from users’ personal data, including emails, Google Drive documents, and chats. Touted as one of Gemini’s most requested features, this agentic capability allows the chatbot to create detailed research reports by analyzing a user’s own digital footprint, offering a glimpse into increasingly personalized and integrated AI experiences.
Similarly, in enterprise IT, AI is being leveraged to solve pervasive data overload. Elastic, the Search AI Company, launched “Streams” for its observability platform. This AI-powered feature is designed to transform the traditionally reactive and manual process of log analysis into a proactive and automated one. Streams automatically parses noisy logs, extracts relevant fields, surfaces critical errors and anomalies, and aims to offer remediation steps. This innovation promises to free SREs from manually sifting through gigabytes of logs, addressing skill shortages by augmenting practitioners with AI-driven expertise and automation.
These developments—from fundamental architectural breakthroughs to enhanced product features and enterprise solutions—collectively paint a picture of an AI landscape striving for greater efficiency, deeper integration, and more intelligent automation as it navigates both unprecedented growth and looming economic challenges.
Analyst’s View
Manifest AI’s Brumby-14B-Base represents a significant crack in the transformer’s armor, suggesting that the “Attention Is All You Need” mantra may finally be outdated. The ability to achieve performance parity with state-of-the-art transformers at a fraction of the cost, especially through efficient retraining, is a game-changer for AI accessibility and innovation. This architectural shift, prioritizing constant-time context processing, directly addresses the scalability and cost issues highlighted by the impending “capacity crunch” and “surge pricing” warnings.
The market should watch for how quickly other players adopt or adapt similar recurrent, attention-free architectures. The controversy surrounding Brumby’s training cost, while valid in its nuance, underscores a critical shift: the value is increasingly in efficient adaptation of existing knowledge rather than exclusively from-scratch training. This suggests a future where diverse, specialized architectures optimized for specific trade-offs (like long-context efficiency) will challenge the current homogenous landscape, pushing towards more democratized and economically viable AI development and deployment. The interplay between architectural efficiency and operational cost will define the next phase of AI’s maturation.
Source Material
- Attention ISN’T all you need?! New Qwen3 variant Brumby-14B-Base leverages Power Retention technique (VentureBeat AI)
- AI’s capacity crunch: Latency risk, escalating costs, and the coming surge-pricing breakpoint (VentureBeat AI)
- From logs to insights: The AI breakthrough redefining observability (VentureBeat AI)
- Google Gemini’s Deep Research can look into your emails, drive, and chats (The Verge AI)
- Brazil’s AI moment is here (OpenAI Blog)