On March 25, US tech stocks saw broad gains, with the Nasdaq 100 Index closing higher. However, one group of stocks bucked the trend and suffered losses:
SanDisk fell 3.50%, Micron dropped 3.4%, Seagate declined 2.59%, and Western Digital slipped 1.63%. The entire storage sector looked as if someone had cut the power in the middle of a party.
The cause was a research paper—or, more precisely, Google Research’s official spotlight on a new study.
To grasp the significance, it’s important to first understand a rarely discussed AI infrastructure concept: KV Cache.
When you interact with a large language model, it doesn’t start from scratch with every question. Instead, it stores the entire conversation context in memory as “key-value pairs”—this is the KV Cache, the model’s short-term working memory.
The issue is that the KV Cache grows proportionally with the context window length. When the context window reaches the million-token scale, GPU memory consumed by the KV Cache can even surpass the model’s own parameters. For inference clusters serving many users at once, this creates a real, daily infrastructure bottleneck and drives up costs.
The paper’s original version appeared on arXiv in April 2025 and will be officially published at ICLR 2026. Google Research named the algorithm TurboQuant—a lossless quantization method that compresses the KV Cache to 3 bits, reducing memory usage by at least sixfold. It requires no training or fine-tuning and works out of the box.
The technical approach has two main steps:
Step 1: PolarQuant. Instead of using the standard Cartesian coordinate system to represent vectors, it converts them into polar coordinates—comprising a “radius” and a set of “angles.” This fundamentally simplifies the geometry of high-dimensional space, enabling subsequent quantization with lower distortion.
Step 2: QJL (Quantized Johnson-Lindenstrauss). After PolarQuant handles the main compression, TurboQuant uses a one-bit QJL transformation to perform unbiased correction of the remaining error, ensuring accurate inner product estimation—crucial for the Transformer’s attention mechanism.
The results: On the LongBench benchmark, which covers question answering, code generation, and summarization, TurboQuant matched or even outperformed the best existing baseline, KIVI. On “needle-in-a-haystack” retrieval tasks, it achieved perfect recall. On NVIDIA’s H100, 4-bit TurboQuant accelerated attention logic operations by up to 8x.
Traditional quantization methods have a fundamental flaw: every compressed data block requires extra storage for “quantization constants” to record how to decompress, which adds 1–2 bits per value. While that may seem small, with million-token contexts these bits add up rapidly. TurboQuant completely eliminates this overhead through PolarQuant’s geometric rotation and QJL’s one-bit residual correction.
The implications are hard to ignore: a model that previously needed eight H100s to serve a million-token context could, in theory, do so with just two. Inference providers could handle more than six times as many concurrent long-context requests with the same hardware.
This directly undermines the core narrative for the storage sector.
Over the past two years, Seagate, Western Digital, and Micron have benefited from the AI investment boom for a single reason: As large models “remember” more, the demand for memory with long context windows seems limitless, and storage demand is expected to explode. Seagate’s stock surged more than 210% in 2025, and its 2026 production capacity was already sold out.
TurboQuant’s arrival directly challenges this premise.
Wells Fargo technology analyst Andrew Rocha put it succinctly: “As context windows get larger, data stored in the KV Cache grows explosively, and memory demand rises. TurboQuant is attacking this cost curve directly… If it’s widely adopted, it fundamentally calls into question how much memory capacity is really necessary.”
But Rocha also emphasized a key condition: IF.
Is the market overreacting? Most likely, yes—at least to some extent.
First, the “8x acceleration” headline is misleading. Several analysts have pointed out that the 8x speedup is measured against older 32-bit non-quantized systems, not the already optimized systems currently deployed. The actual performance gain is real, but not as dramatic as the headlines suggest.
Second, the paper only tested small models. All of TurboQuant’s evaluations used models with up to 8 billion parameters. The real concern for storage suppliers is with models at the 70 billion or even 400 billion parameter scale, where the KV Cache becomes truly massive. TurboQuant’s performance at these scales is still unknown.
Third, Google has not released any official code. As of now, TurboQuant isn’t available in vLLM, llama.cpp, Ollama, or any mainstream inference framework. Community developers have implemented early versions based on the paper’s math, and one early replicator noted that if QJL’s error correction isn’t done properly, the output can become unreadable.
Still, this doesn’t mean the market’s concerns are unfounded.
This is the market’s collective muscle memory from the DeepSeek event in 2025. That episode taught everyone a harsh lesson: Algorithmic efficiency breakthroughs can instantly disrupt expensive hardware narratives. Since then, any efficiency breakthrough from a top AI lab triggers a reflex in hardware stocks.
Moreover, this signal comes from Google Research—not an obscure university lab. Google has the engineering muscle to turn papers into production tools, and is itself one of the world’s largest AI inference consumers. Once TurboQuant is deployed internally, it could quietly reshape server procurement strategies for Waymo, Gemini, and Google Search.
There’s a classic debate here worth considering: Jevons Paradox.
Nineteenth-century economist William Jevons observed that improvements in steam engine efficiency didn’t reduce Britain’s coal consumption—instead, it increased dramatically. Lower costs from efficiency gains stimulated much broader adoption.
Supporters argue: If Google enables a model to run on 16GB of VRAM, developers won’t stop there—they’ll use the freed resources to run models six times more complex, process larger multimodal datasets, and support even longer contexts. Ultimately, software efficiency unlocks demand that was previously out of reach due to high costs.
However, this counterargument depends on the market having time to adapt and expand. During the period when TurboQuant evolves from paper to production tool to industry standard, can hardware demand grow fast enough to fill the “gap” created by greater efficiency?
No one knows the answer. The market is pricing in this uncertainty.
More important than storage stock volatility is the deeper trend revealed by TurboQuant.
The main battleground of the AI arms race is shifting from “scaling compute” to “maximizing efficiency.”
If TurboQuant proves its performance on large-scale models, it could drive a fundamental shift: long-context inference would move from a luxury only top labs can afford to the industry standard.
This efficiency race is where Google excels—developing mathematically near-optimal compression algorithms, pushing the limits of Shannon information theory, not just brute-force engineering. TurboQuant’s theoretical distortion rate is only about 2.7 times the information-theoretic lower bound.
This suggests similar breakthroughs are likely to follow. It marks the maturation of an entire research direction.
For the storage industry, the more sobering question isn’t “Will this affect demand this time?” but: As AI inference costs keep falling due to software, how wide can the hardware moat remain?
The answer for now: It’s still wide, but not so wide that these signals can be ignored.
This article is reprinted from [TechFlow], with copyright belonging to the original author [TechFlow]. If you have any concerns about this reprint, please contact the Gate Learn team, who will address it promptly according to relevant procedures.
Disclaimer: The views and opinions expressed in this article are those of the author alone and do not constitute investment advice.
Other language versions of this article are translated by the Gate Learn team. Unless Gate is specifically referenced, translated articles may not be copied, distributed, or plagiarized.





