How hybrid AI architectures manage millions of product attributes consistently

2026-01-09 10:52:54

The Hidden Problem of E-Commerce Scaling

Most discussions around scaling in e-commerce focus on distributed search and recommendation engines. But beneath the surface lies a more persistent, often overlooked problem: managing attribute values in product catalogs. With over 3 million SKUs, this quickly becomes a systemic issue.

Attribute values are the foundation of product discovery. They drive filters, comparisons, and search rankings. But in practice, they are fragmented: “XL”, “Small”, “12cm”, and “Large” mixed in one field. Or colors like “RAL 3020”, “Crimson”, “Red”, and “Dark Red” without a consistent structure. Multiply these inconsistencies across dozens of attributes per product, and the problem grows exponentially.

Filters behave unpredictably, search relevance drops, and customer navigation becomes frustrating. At the same time, merchants drown in manual data cleanup.

The Solution: Intelligent Hybrid Pipelines with Control Mechanisms

Instead of a black-box AI that arbitrarily sorts data, an architecture with three pillars was developed:

Explainability: Every decision is traceable
Predictability: The system behaves consistently
Human Control: Merchandisers can manually set critical attributes

The result was a hybrid pipeline combining LLM intelligence with clear rules and data persistence. It acts intelligently but remains controllable—AI with guardrails, not uncontrolled.

Offline Processing Instead of Real-Time Pipelines

A critical design decision was to use background jobs over live systems. This may sound like a compromise, but it was strategically sound:

Real-time processing would mean:

Unpredictable latency
Fragile system dependencies
Costly peaks in computation
Operational complications

Offline jobs offered:

Massive throughput without affecting customer traffic
Resilience: failures never impacted live systems
Cost control through scheduled processing
Isolation from LLM latency
Atomic, predictable updates

Separating customer-facing systems from data processing pipelines is crucial at the millions-of-SKUs scale.

Architecture with Persistence and Consistency

All data persistence was handled via MongoDB as the central operational store:

Attribute extraction: The first job pulled raw values and category context
AI service: The LLM received cleaned data plus context info (category breadcrumbs, metadata)
Deterministic fallbacks: Numeric ranges and simple sets were automatically recognized and rule-based sorted
Persistence: Sorted values, refined attribute names, and sort tags were stored in MongoDB
Search integration: Updated data flowed into Elasticsearch (keyword search) and Vespa (semantic search)

This persistence structure enabled easy verification, overwrites, and resynchronization with other systems.

Hybrid Control: AI Meets Merchant Decisions

Not every attribute requires AI intelligence. Therefore, each category could be tagged:

LLM_SORT: The model makes sorting decisions
MANUAL_SORT: Merchants define the order manually

This dual tagging system built trust. Humans retained control over critical business attributes, while AI handled routine work—and without pipeline interruption.

Data Cleanup as a Foundation

Before applying AI, a critical preprocessing step was performed:

Trim whitespace
Remove empty values
Deduplicate duplicates
Standardize category contexts

This seemingly simple cleanup dramatically improved LLM accuracy. Clean inputs led to consistent results—a fundamental principle at scale.

Transformation in Practice

The pipeline transformed chaotic raw data into structured outputs:

Attribute	Raw Values	Sorted Output
Size	XL, Small, 12cm, Large, M, S	Small, M, Large, XL, 12cm
Color	RAL 3020, Crimson, Red, Dark Red	Red, Dark Red, Crimson, Red (RAL 3020)
Material	Steel, Carbon Steel, Stainless, Stainless Steel	Steel, Stainless Steel, Carbon Steel
Numeric	5cm, 12cm, 2cm, 20cm	2cm, 5cm, 12cm, 20cm

Business Impact

The results were substantial:

Consistent attribute sorting across 3M+ SKUs
Predictable numeric sorting via deterministic logic
Improved search relevance
Intuitive filters on product pages
Increased customer trust and higher conversion rates

This was not just a technical victory—it was a win for user experience and revenue.

Key Takeaways

Hybrid surpasses pure AI: Guardrails are essential at scale
Context is king: Better context = significantly better LLM results
Offline architecture creates resilience: Background jobs are fundamental for throughput
Persistence without loss of control: Human override mechanisms build trust
Clean inputs = reliable outputs: Data quality determines AI success

Conclusion

Sorting attribute values may seem trivial, but it becomes a real problem with millions of products. By combining LLM intelligence, explicit rules, persistence, and merchant control, an elegant system was created that addresses complex, hidden challenges. It reminds us that the greatest successes often come from solving boring, overlooked problems—those that impact every product page.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.