How hybrid AI architectures manage millions of product attributes consistently

The Hidden Problem of E-Commerce Scaling

Most discussions around scaling in e-commerce focus on distributed search and recommendation engines. But beneath the surface lies a more persistent, often overlooked problem: managing attribute values in product catalogs. With over 3 million SKUs, this quickly becomes a systemic issue.

Attribute values are the foundation of product discovery. They drive filters, comparisons, and search rankings. But in practice, they are fragmented: “XL”, “Small”, “12cm”, and “Large” mixed in one field. Or colors like “RAL 3020”, “Crimson”, “Red”, and “Dark Red” without a consistent structure. Multiply these inconsistencies across dozens of attributes per product, and the problem grows exponentially.

Filters behave unpredictably, search relevance drops, and customer navigation becomes frustrating. At the same time, merchants drown in manual data cleanup.

The Solution: Intelligent Hybrid Pipelines with Control Mechanisms

Instead of a black-box AI that arbitrarily sorts data, an architecture with three pillars was developed:

  • Explainability: Every decision is traceable
  • Predictability: The system behaves consistently
  • Human Control: Merchandisers can manually set critical attributes

The result was a hybrid pipeline combining LLM intelligence with clear rules and data persistence. It acts intelligently but remains controllable—AI with guardrails, not uncontrolled.

Offline Processing Instead of Real-Time Pipelines

A critical design decision was to use background jobs over live systems. This may sound like a compromise, but it was strategically sound:

Real-time processing would mean:

  • Unpredictable latency
  • Fragile system dependencies
  • Costly peaks in computation
  • Operational complications

Offline jobs offered:

  • Massive throughput without affecting customer traffic
  • Resilience: failures never impacted live systems
  • Cost control through scheduled processing
  • Isolation from LLM latency
  • Atomic, predictable updates

Separating customer-facing systems from data processing pipelines is crucial at the millions-of-SKUs scale.

Architecture with Persistence and Consistency

All data persistence was handled via MongoDB as the central operational store:

  • Attribute extraction: The first job pulled raw values and category context
  • AI service: The LLM received cleaned data plus context info (category breadcrumbs, metadata)
  • Deterministic fallbacks: Numeric ranges and simple sets were automatically recognized and rule-based sorted
  • Persistence: Sorted values, refined attribute names, and sort tags were stored in MongoDB
  • Search integration: Updated data flowed into Elasticsearch (keyword search) and Vespa (semantic search)

This persistence structure enabled easy verification, overwrites, and resynchronization with other systems.

Hybrid Control: AI Meets Merchant Decisions

Not every attribute requires AI intelligence. Therefore, each category could be tagged:

  • LLM_SORT: The model makes sorting decisions
  • MANUAL_SORT: Merchants define the order manually

This dual tagging system built trust. Humans retained control over critical business attributes, while AI handled routine work—and without pipeline interruption.

Data Cleanup as a Foundation

Before applying AI, a critical preprocessing step was performed:

  • Trim whitespace
  • Remove empty values
  • Deduplicate duplicates
  • Standardize category contexts

This seemingly simple cleanup dramatically improved LLM accuracy. Clean inputs led to consistent results—a fundamental principle at scale.

Transformation in Practice

The pipeline transformed chaotic raw data into structured outputs:

Attribute Raw Values Sorted Output
Size XL, Small, 12cm, Large, M, S Small, M, Large, XL, 12cm
Color RAL 3020, Crimson, Red, Dark Red Red, Dark Red, Crimson, Red (RAL 3020)
Material Steel, Carbon Steel, Stainless, Stainless Steel Steel, Stainless Steel, Carbon Steel
Numeric 5cm, 12cm, 2cm, 20cm 2cm, 5cm, 12cm, 20cm

Business Impact

The results were substantial:

  • Consistent attribute sorting across 3M+ SKUs
  • Predictable numeric sorting via deterministic logic
  • Improved search relevance
  • Intuitive filters on product pages
  • Increased customer trust and higher conversion rates

This was not just a technical victory—it was a win for user experience and revenue.

Key Takeaways

  • Hybrid surpasses pure AI: Guardrails are essential at scale
  • Context is king: Better context = significantly better LLM results
  • Offline architecture creates resilience: Background jobs are fundamental for throughput
  • Persistence without loss of control: Human override mechanisms build trust
  • Clean inputs = reliable outputs: Data quality determines AI success

Conclusion

Sorting attribute values may seem trivial, but it becomes a real problem with millions of products. By combining LLM intelligence, explicit rules, persistence, and merchant control, an elegant system was created that addresses complex, hidden challenges. It reminds us that the greatest successes often come from solving boring, overlooked problems—those that impact every product page.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin

Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)