Most discussions around scaling in e-commerce focus on distributed search and recommendation engines. But beneath the surface lies a more persistent, often overlooked problem: managing attribute values in product catalogs. With over 3 million SKUs, this quickly becomes a systemic issue.
Attribute values are the foundation of product discovery. They drive filters, comparisons, and search rankings. But in practice, they are fragmented: “XL”, “Small”, “12cm”, and “Large” mixed in one field. Or colors like “RAL 3020”, “Crimson”, “Red”, and “Dark Red” without a consistent structure. Multiply these inconsistencies across dozens of attributes per product, and the problem grows exponentially.
Filters behave unpredictably, search relevance drops, and customer navigation becomes frustrating. At the same time, merchants drown in manual data cleanup.
The Solution: Intelligent Hybrid Pipelines with Control Mechanisms
Instead of a black-box AI that arbitrarily sorts data, an architecture with three pillars was developed:
Explainability: Every decision is traceable
Predictability: The system behaves consistently
Human Control: Merchandisers can manually set critical attributes
The result was a hybrid pipeline combining LLM intelligence with clear rules and data persistence. It acts intelligently but remains controllable—AI with guardrails, not uncontrolled.
Offline Processing Instead of Real-Time Pipelines
A critical design decision was to use background jobs over live systems. This may sound like a compromise, but it was strategically sound:
Real-time processing would mean:
Unpredictable latency
Fragile system dependencies
Costly peaks in computation
Operational complications
Offline jobs offered:
Massive throughput without affecting customer traffic
Resilience: failures never impacted live systems
Cost control through scheduled processing
Isolation from LLM latency
Atomic, predictable updates
Separating customer-facing systems from data processing pipelines is crucial at the millions-of-SKUs scale.
Architecture with Persistence and Consistency
All data persistence was handled via MongoDB as the central operational store:
Attribute extraction: The first job pulled raw values and category context
AI service: The LLM received cleaned data plus context info (category breadcrumbs, metadata)
Deterministic fallbacks: Numeric ranges and simple sets were automatically recognized and rule-based sorted
Persistence: Sorted values, refined attribute names, and sort tags were stored in MongoDB
Search integration: Updated data flowed into Elasticsearch (keyword search) and Vespa (semantic search)
This persistence structure enabled easy verification, overwrites, and resynchronization with other systems.
Hybrid Control: AI Meets Merchant Decisions
Not every attribute requires AI intelligence. Therefore, each category could be tagged:
LLM_SORT: The model makes sorting decisions
MANUAL_SORT: Merchants define the order manually
This dual tagging system built trust. Humans retained control over critical business attributes, while AI handled routine work—and without pipeline interruption.
Data Cleanup as a Foundation
Before applying AI, a critical preprocessing step was performed:
Trim whitespace
Remove empty values
Deduplicate duplicates
Standardize category contexts
This seemingly simple cleanup dramatically improved LLM accuracy. Clean inputs led to consistent results—a fundamental principle at scale.
Transformation in Practice
The pipeline transformed chaotic raw data into structured outputs:
Attribute
Raw Values
Sorted Output
Size
XL, Small, 12cm, Large, M, S
Small, M, Large, XL, 12cm
Color
RAL 3020, Crimson, Red, Dark Red
Red, Dark Red, Crimson, Red (RAL 3020)
Material
Steel, Carbon Steel, Stainless, Stainless Steel
Steel, Stainless Steel, Carbon Steel
Numeric
5cm, 12cm, 2cm, 20cm
2cm, 5cm, 12cm, 20cm
Business Impact
The results were substantial:
Consistent attribute sorting across 3M+ SKUs
Predictable numeric sorting via deterministic logic
Improved search relevance
Intuitive filters on product pages
Increased customer trust and higher conversion rates
This was not just a technical victory—it was a win for user experience and revenue.
Key Takeaways
Hybrid surpasses pure AI: Guardrails are essential at scale
Context is king: Better context = significantly better LLM results
Offline architecture creates resilience: Background jobs are fundamental for throughput
Persistence without loss of control: Human override mechanisms build trust
Clean inputs = reliable outputs: Data quality determines AI success
Conclusion
Sorting attribute values may seem trivial, but it becomes a real problem with millions of products. By combining LLM intelligence, explicit rules, persistence, and merchant control, an elegant system was created that addresses complex, hidden challenges. It reminds us that the greatest successes often come from solving boring, overlooked problems—those that impact every product page.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
How hybrid AI architectures manage millions of product attributes consistently
The Hidden Problem of E-Commerce Scaling
Most discussions around scaling in e-commerce focus on distributed search and recommendation engines. But beneath the surface lies a more persistent, often overlooked problem: managing attribute values in product catalogs. With over 3 million SKUs, this quickly becomes a systemic issue.
Attribute values are the foundation of product discovery. They drive filters, comparisons, and search rankings. But in practice, they are fragmented: “XL”, “Small”, “12cm”, and “Large” mixed in one field. Or colors like “RAL 3020”, “Crimson”, “Red”, and “Dark Red” without a consistent structure. Multiply these inconsistencies across dozens of attributes per product, and the problem grows exponentially.
Filters behave unpredictably, search relevance drops, and customer navigation becomes frustrating. At the same time, merchants drown in manual data cleanup.
The Solution: Intelligent Hybrid Pipelines with Control Mechanisms
Instead of a black-box AI that arbitrarily sorts data, an architecture with three pillars was developed:
The result was a hybrid pipeline combining LLM intelligence with clear rules and data persistence. It acts intelligently but remains controllable—AI with guardrails, not uncontrolled.
Offline Processing Instead of Real-Time Pipelines
A critical design decision was to use background jobs over live systems. This may sound like a compromise, but it was strategically sound:
Real-time processing would mean:
Offline jobs offered:
Separating customer-facing systems from data processing pipelines is crucial at the millions-of-SKUs scale.
Architecture with Persistence and Consistency
All data persistence was handled via MongoDB as the central operational store:
This persistence structure enabled easy verification, overwrites, and resynchronization with other systems.
Hybrid Control: AI Meets Merchant Decisions
Not every attribute requires AI intelligence. Therefore, each category could be tagged:
This dual tagging system built trust. Humans retained control over critical business attributes, while AI handled routine work—and without pipeline interruption.
Data Cleanup as a Foundation
Before applying AI, a critical preprocessing step was performed:
This seemingly simple cleanup dramatically improved LLM accuracy. Clean inputs led to consistent results—a fundamental principle at scale.
Transformation in Practice
The pipeline transformed chaotic raw data into structured outputs:
Business Impact
The results were substantial:
This was not just a technical victory—it was a win for user experience and revenue.
Key Takeaways
Conclusion
Sorting attribute values may seem trivial, but it becomes a real problem with millions of products. By combining LLM intelligence, explicit rules, persistence, and merchant control, an elegant system was created that addresses complex, hidden challenges. It reminds us that the greatest successes often come from solving boring, overlooked problems—those that impact every product page.