Gate Learn

A research paper sent memory stocks plummeting

Intermediate

AI AI

Last Updated 2026-03-30 09:21:25

Reading Time: 7m

The article examines the limitations of benchmark comparisons, model scale, and engineering implementation. It introduces the DeepSeek efficiency shock and the Jevons paradox to explore how efficiency innovation both squeezes short-term hardware demand and creates greater long-term opportunities for application expansion.

On March 25, US tech stocks saw broad gains, with the Nasdaq 100 Index closing higher. However, one group of stocks bucked the trend and suffered losses:

SanDisk fell 3.50%, Micron dropped 3.4%, Seagate declined 2.59%, and Western Digital slipped 1.63%. The entire storage sector looked as if someone had cut the power in the middle of a party.

The cause was a research paper—or, more precisely, Google Research’s official spotlight on a new study.

What Did This Paper Actually Do?

To grasp the significance, it’s important to first understand a rarely discussed AI infrastructure concept: KV Cache.

When you interact with a large language model, it doesn’t start from scratch with every question. Instead, it stores the entire conversation context in memory as “key-value pairs”—this is the KV Cache, the model’s short-term working memory.

The issue is that the KV Cache grows proportionally with the context window length. When the context window reaches the million-token scale, GPU memory consumed by the KV Cache can even surpass the model’s own parameters. For inference clusters serving many users at once, this creates a real, daily infrastructure bottleneck and drives up costs.

The paper’s original version appeared on arXiv in April 2025 and will be officially published at ICLR 2026. Google Research named the algorithm TurboQuant—a lossless quantization method that compresses the KV Cache to 3 bits, reducing memory usage by at least sixfold. It requires no training or fine-tuning and works out of the box.

The technical approach has two main steps:

Step 1: PolarQuant. Instead of using the standard Cartesian coordinate system to represent vectors, it converts them into polar coordinates—comprising a “radius” and a set of “angles.” This fundamentally simplifies the geometry of high-dimensional space, enabling subsequent quantization with lower distortion.

Step 2: QJL (Quantized Johnson-Lindenstrauss). After PolarQuant handles the main compression, TurboQuant uses a one-bit QJL transformation to perform unbiased correction of the remaining error, ensuring accurate inner product estimation—crucial for the Transformer’s attention mechanism.

The results: On the LongBench benchmark, which covers question answering, code generation, and summarization, TurboQuant matched or even outperformed the best existing baseline, KIVI. On “needle-in-a-haystack” retrieval tasks, it achieved perfect recall. On NVIDIA’s H100, 4-bit TurboQuant accelerated attention logic operations by up to 8x.

Traditional quantization methods have a fundamental flaw: every compressed data block requires extra storage for “quantization constants” to record how to decompress, which adds 1–2 bits per value. While that may seem small, with million-token contexts these bits add up rapidly. TurboQuant completely eliminates this overhead through PolarQuant’s geometric rotation and QJL’s one-bit residual correction.

Why Did the Market Panic?

The implications are hard to ignore: a model that previously needed eight H100s to serve a million-token context could, in theory, do so with just two. Inference providers could handle more than six times as many concurrent long-context requests with the same hardware.

This directly undermines the core narrative for the storage sector.

Over the past two years, Seagate, Western Digital, and Micron have benefited from the AI investment boom for a single reason: As large models “remember” more, the demand for memory with long context windows seems limitless, and storage demand is expected to explode. Seagate’s stock surged more than 210% in 2025, and its 2026 production capacity was already sold out.

TurboQuant’s arrival directly challenges this premise.

Wells Fargo technology analyst Andrew Rocha put it succinctly: “As context windows get larger, data stored in the KV Cache grows explosively, and memory demand rises. TurboQuant is attacking this cost curve directly… If it’s widely adopted, it fundamentally calls into question how much memory capacity is really necessary.”

But Rocha also emphasized a key condition: IF.

What’s Really Worth Debating?

Is the market overreacting? Most likely, yes—at least to some extent.

First, the “8x acceleration” headline is misleading. Several analysts have pointed out that the 8x speedup is measured against older 32-bit non-quantized systems, not the already optimized systems currently deployed. The actual performance gain is real, but not as dramatic as the headlines suggest.

Second, the paper only tested small models. All of TurboQuant’s evaluations used models with up to 8 billion parameters. The real concern for storage suppliers is with models at the 70 billion or even 400 billion parameter scale, where the KV Cache becomes truly massive. TurboQuant’s performance at these scales is still unknown.

Third, Google has not released any official code. As of now, TurboQuant isn’t available in vLLM, llama.cpp, Ollama, or any mainstream inference framework. Community developers have implemented early versions based on the paper’s math, and one early replicator noted that if QJL’s error correction isn’t done properly, the output can become unreadable.

Still, this doesn’t mean the market’s concerns are unfounded.

This is the market’s collective muscle memory from the DeepSeek event in 2025. That episode taught everyone a harsh lesson: Algorithmic efficiency breakthroughs can instantly disrupt expensive hardware narratives. Since then, any efficiency breakthrough from a top AI lab triggers a reflex in hardware stocks.

Moreover, this signal comes from Google Research—not an obscure university lab. Google has the engineering muscle to turn papers into production tools, and is itself one of the world’s largest AI inference consumers. Once TurboQuant is deployed internally, it could quietly reshape server procurement strategies for Waymo, Gemini, and Google Search.

The Classic Pattern Repeats

There’s a classic debate here worth considering: Jevons Paradox.

Nineteenth-century economist William Jevons observed that improvements in steam engine efficiency didn’t reduce Britain’s coal consumption—instead, it increased dramatically. Lower costs from efficiency gains stimulated much broader adoption.

Supporters argue: If Google enables a model to run on 16GB of VRAM, developers won’t stop there—they’ll use the freed resources to run models six times more complex, process larger multimodal datasets, and support even longer contexts. Ultimately, software efficiency unlocks demand that was previously out of reach due to high costs.

However, this counterargument depends on the market having time to adapt and expand. During the period when TurboQuant evolves from paper to production tool to industry standard, can hardware demand grow fast enough to fill the “gap” created by greater efficiency?

No one knows the answer. The market is pricing in this uncertainty.

What This Means for the AI Industry

More important than storage stock volatility is the deeper trend revealed by TurboQuant.

The main battleground of the AI arms race is shifting from “scaling compute” to “maximizing efficiency.”

If TurboQuant proves its performance on large-scale models, it could drive a fundamental shift: long-context inference would move from a luxury only top labs can afford to the industry standard.

This efficiency race is where Google excels—developing mathematically near-optimal compression algorithms, pushing the limits of Shannon information theory, not just brute-force engineering. TurboQuant’s theoretical distortion rate is only about 2.7 times the information-theoretic lower bound.

This suggests similar breakthroughs are likely to follow. It marks the maturation of an entire research direction.

For the storage industry, the more sobering question isn’t “Will this affect demand this time?” but: As AI inference costs keep falling due to software, how wide can the hardware moat remain?

The answer for now: It’s still wide, but not so wide that these signals can be ignored.

Disclaimer:

This article is reprinted from [TechFlow], with copyright belonging to the original author [TechFlow]. If you have any concerns about this reprint, please contact the Gate Learn team, who will address it promptly according to relevant procedures.
Disclaimer: The views and opinions expressed in this article are those of the author alone and do not constitute investment advice.
Other language versions of this article are translated by the Gate Learn team. Unless Gate is specifically referenced, translated articles may not be copied, distributed, or plagiarized.

Content

What Did This Paper Actually Do?

Why Did the Market Panic?

What’s Really Worth Debating?

The Classic Pattern Repeats

What This Means for the AI Industry

Crypto Calendar

Tokens Unlock

Wormhole will unlock 1,280,000,000 W tokens on April 3rd, constituting approximately 28.39% of the currently circulating supply.

-7.32%

2026-04-02

Tokens Unlock

Pyth Network will unlock 2,130,000,000 PYTH tokens on May 19th, constituting approximately 36.96% of the currently circulating supply.

PYTH

2.25%

2026-05-18

Tokens Unlock

Pump.fun will unlock 82,500,000,000 PUMP tokens on July 12th, constituting approximately 23.31% of the currently circulating supply.

PUMP

-3.37%

2026-07-11

Tokens Unlock

Succinct will unlock 208,330,000 PROVE tokens on August 5th, constituting approximately 104.17% of the currently circulating supply.

PROVE

2026-08-04

Beginner

Arweave: Capturing Market Opportunity with AO Computer

Decentralised storage, exemplified by peer-to-peer networks, creates a global, trustless, and immutable hard drive. Arweave, a leader in this space, offers cost-efficient solutions ensuring permanence, immutability, and censorship resistance, essential for the growing needs of NFTs and dApps.

2026-03-24 11:54:35

Intermediate

The Upcoming AO Token: Potentially the Ultimate Solution for On-Chain AI Agents

AO, built on Arweave's on-chain storage, achieves infinitely scalable decentralized computing, allowing an unlimited number of processes to run in parallel. Decentralized AI Agents are hosted on-chain by AR and run on-chain by AO.

2026-03-24 11:54:38

Intermediate

What is AIXBT by Virtuals? All You Need to Know About AIXBT

AIXBT by Virtuals is a crypto project combining blockchain, artificial intelligence, and big data with crypto trends and prices.

2026-03-24 11:56:03

Advanced

AI+Crypto Landscape Explained: 7 Major Tracks & Over 60+ Projects

This article will explore the future development of AI and cryptocurrency, as well as explore investment opportunities, through seven modules: computing power cloud, computing power market, model assetization and training, AI Agent, data assetization, ZKML, and AI applications.

2026-03-24 11:54:10

Intermediate

AI Agents in DeFi: Redefining Crypto as We Know It

This article focuses on how AI is transforming DeFi in trading, governance, security, and personalization. The integration of AI with DeFi has the potential to create a more inclusive, resilient, and future-oriented financial system, fundamentally redefining how we interact with economic systems.

2026-03-24 11:55:43

Intermediate

Understanding Sentient AGI: The Community-built Open AGI

Discover how Sentient AGI is revolutionizing the AI industry with its community-built, decentralized approach. Learn about the Open, Monetizable, and Loyal (OML) model and how it fosters innovation and collaboration in AI development.

2026-03-24 11:55:53

A research paper sent memory stocks plummeting

What Did This Paper Actually Do?

Why Did the Market Panic?

What’s Really Worth Debating?

The Classic Pattern Repeats

What This Means for the AI Industry

Disclaimer:

Related Articles