When retailers talk about scaling, they think of search engines, real-time inventory, and checkout optimization. These are visible problems. But beneath the surface lurks a more stubborn one: attribute values that simply don’t match. In real product catalogs, these values are rarely consistent. They are formatted differently, semantically ambiguous, or just incorrect. And when you multiply this across millions of products, a small annoyance becomes a systemic disaster.
The Problem: Small individually, but grandiose in scale
Let’s take concrete examples:
Size: “XL”, “Small”, “12cm”, “Large”, “M”, “S” — all mixed together
Each of these examples seems harmless on its own. But once you’re working with more than 3 million SKUs, each with dozens of attributes, a real problem arises:
Filters behave unpredictably
Search engines lose relevance
Customer searches become frustrating
Teams drown in manual data cleaning
This is the silent suffering lurking behind almost every large e-commerce catalog.
The approach: AI with guardrails instead of chaos algorithms
I didn’t want a black-box solution that sorts mysterious things nobody understands. Instead, I aimed for a hybrid pipeline that:
remains explainable
works predictably
truly scales
can be controlled by humans
The result: AI that thinks intelligently but always remains transparent.
The architecture: Offline jobs instead of real-time madness
All attribute processing runs in the background—not in real time. This was not a quick fix but a strategic design decision.
Real-time pipelines sound tempting but lead to:
unpredictable delays
expensive compute peaks
fragile dependencies
operational chaos
Offline jobs, on the other hand, provide:
Massive throughput (huge data volumes without stressing live systems)
Fault tolerance (failures never reach customers)
Cost control (computations during traffic-light times)
Consistency (atomic, predictable updates)
Separating customer-facing systems from data processing is crucial at this scale.
The process: From trash to clean data
Before AI touches the data, a critical cleaning step occurs:
Trim whitespace
Remove empty values
Remove duplicates
Format category context as clean strings
This guarantees that the LLM works with clean inputs. The principle is simple: Garbage in, garbage out. Small errors at this scale lead to big problems later.
The LLM service: Smarter than just sorting
The LLM doesn’t work blindly alphabetically. It thinks contextually.
It receives:
Cleaned attribute values
Category breadcrumbs
Attribute metadata
With this context, the model understands:
That “Voltage” in power tools is numeric
That “Size” in clothing follows a known progression
That “Color” may follow RAL standards
That “Material” has semantic relationships
It returns:
Ordered values
Refined attribute names
A decision: deterministic or AI-driven sorting
This allows handling different attribute types without coding each category individually.
Deterministic fallbacks: Not everything needs AI
Many attributes work better without artificial intelligence:
The pipeline automatically detects these cases and uses deterministic logic. This keeps the system efficient and avoids unnecessary LLM calls.
Human vs. machine: Dual control
Retailers need control over critical attributes. Therefore, each category can be marked as:
LLM_SORT — the model decides
MANUAL_SORT — merchants define the order
This system distributes the workload: AI handles the bulk, humans make final decisions. It also builds trust, as teams can override the model when needed.
Infrastructure: Simple, centralized, scalable
All results are stored directly in a MongoDB database—the only operational storage for:
Sorted attribute values
Refined attribute names
Category tags
Product-specific sort orders
This makes it easy to review changes, overwrite values, reprocess categories, and synchronize with other systems.
Search integration: Where quality becomes visible
After sorting, values flow into two search assets:
Elasticsearch for keyword search
Vespa for semantic and vector-based search
This ensures:
Filters appear in logical order
Product pages show consistent attributes
Search engines rank more accurately
Customers navigate categories more easily
Here, in search, good attribute sorting becomes visible.
The results: From chaos to clarity
Attribute
Raw values
Sorted output
Size
XL, Small, 12cm, Large, M, S
Small, M, Large, XL, 12cm
Color
RAL 3020, Crimson, Red, Dark Red
Red, Dark Red, Crimson, RAL 3020(
Material
Steel, Carbon Steel, Stainless, Stainless Steel
Steel, Stainless Steel, Carbon Steel
Numeric
5cm, 12cm, 2cm, 20cm
2cm, 5cm, 12cm, 20cm
The impact was measurable:
Consistent sorting across 3M+ SKUs
Predictable numeric sequences
Full merchant control via tagging
Intuitive filters and cleaner pages
Better search relevance
Higher customer conversion
Key lessons
Hybrid beats pure AI: Guardrails are critical at scale
Context is king: It dramatically improves model accuracy
Offline processing wins: Necessary for throughput and reliability
Human control builds trust: Override mechanisms are not bugs, they are features
Clean inputs are foundational: No shortcuts in data cleaning
Sorting attribute values may seem trivial, but it becomes a real challenge with millions of products. Combining LLM intelligence with clear rules and merchant control creates a system that transforms invisible chaos into scalable clarity.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
The Invisible Chaos: How Inconsistent Product Attributes Sabotage E-Commerce at Scale
When retailers talk about scaling, they think of search engines, real-time inventory, and checkout optimization. These are visible problems. But beneath the surface lurks a more stubborn one: attribute values that simply don’t match. In real product catalogs, these values are rarely consistent. They are formatted differently, semantically ambiguous, or just incorrect. And when you multiply this across millions of products, a small annoyance becomes a systemic disaster.
The Problem: Small individually, but grandiose in scale
Let’s take concrete examples:
Each of these examples seems harmless on its own. But once you’re working with more than 3 million SKUs, each with dozens of attributes, a real problem arises:
This is the silent suffering lurking behind almost every large e-commerce catalog.
The approach: AI with guardrails instead of chaos algorithms
I didn’t want a black-box solution that sorts mysterious things nobody understands. Instead, I aimed for a hybrid pipeline that:
The result: AI that thinks intelligently but always remains transparent.
The architecture: Offline jobs instead of real-time madness
All attribute processing runs in the background—not in real time. This was not a quick fix but a strategic design decision.
Real-time pipelines sound tempting but lead to:
Offline jobs, on the other hand, provide:
Separating customer-facing systems from data processing is crucial at this scale.
The process: From trash to clean data
Before AI touches the data, a critical cleaning step occurs:
This guarantees that the LLM works with clean inputs. The principle is simple: Garbage in, garbage out. Small errors at this scale lead to big problems later.
The LLM service: Smarter than just sorting
The LLM doesn’t work blindly alphabetically. It thinks contextually.
It receives:
With this context, the model understands:
It returns:
This allows handling different attribute types without coding each category individually.
Deterministic fallbacks: Not everything needs AI
Many attributes work better without artificial intelligence:
These receive:
The pipeline automatically detects these cases and uses deterministic logic. This keeps the system efficient and avoids unnecessary LLM calls.
Human vs. machine: Dual control
Retailers need control over critical attributes. Therefore, each category can be marked as:
This system distributes the workload: AI handles the bulk, humans make final decisions. It also builds trust, as teams can override the model when needed.
Infrastructure: Simple, centralized, scalable
All results are stored directly in a MongoDB database—the only operational storage for:
This makes it easy to review changes, overwrite values, reprocess categories, and synchronize with other systems.
Search integration: Where quality becomes visible
After sorting, values flow into two search assets:
This ensures:
Here, in search, good attribute sorting becomes visible.
The results: From chaos to clarity
The impact was measurable:
Key lessons
Sorting attribute values may seem trivial, but it becomes a real challenge with millions of products. Combining LLM intelligence with clear rules and merchant control creates a system that transforms invisible chaos into scalable clarity.