The Invisible Chaos: How Inconsistent Product Attributes Sabotage E-Commerce at Scale

When retailers talk about scaling, they think of search engines, real-time inventory, and checkout optimization. These are visible problems. But beneath the surface lurks a more stubborn one: attribute values that simply don’t match. In real product catalogs, these values are rarely consistent. They are formatted differently, semantically ambiguous, or just incorrect. And when you multiply this across millions of products, a small annoyance becomes a systemic disaster.

The Problem: Small individually, but grandiose in scale

Let’s take concrete examples:

  • Size: “XL”, “Small”, “12cm”, “Large”, “M”, “S” — all mixed together
  • Color: “RAL 3020”, “Crimson”, “Red”, “Dark Red” — partly standards, partly colloquial
  • Material: “Steel”, “Carbon Steel”, “Stainless”, “Stainless Steel” — redundant and unclear

Each of these examples seems harmless on its own. But once you’re working with more than 3 million SKUs, each with dozens of attributes, a real problem arises:

  • Filters behave unpredictably
  • Search engines lose relevance
  • Customer searches become frustrating
  • Teams drown in manual data cleaning

This is the silent suffering lurking behind almost every large e-commerce catalog.

The approach: AI with guardrails instead of chaos algorithms

I didn’t want a black-box solution that sorts mysterious things nobody understands. Instead, I aimed for a hybrid pipeline that:

  • remains explainable
  • works predictably
  • truly scales
  • can be controlled by humans

The result: AI that thinks intelligently but always remains transparent.

The architecture: Offline jobs instead of real-time madness

All attribute processing runs in the background—not in real time. This was not a quick fix but a strategic design decision.

Real-time pipelines sound tempting but lead to:

  • unpredictable delays
  • expensive compute peaks
  • fragile dependencies
  • operational chaos

Offline jobs, on the other hand, provide:

  • Massive throughput (huge data volumes without stressing live systems)
  • Fault tolerance (failures never reach customers)
  • Cost control (computations during traffic-light times)
  • Consistency (atomic, predictable updates)

Separating customer-facing systems from data processing is crucial at this scale.

The process: From trash to clean data

Before AI touches the data, a critical cleaning step occurs:

  • Trim whitespace
  • Remove empty values
  • Remove duplicates
  • Format category context as clean strings

This guarantees that the LLM works with clean inputs. The principle is simple: Garbage in, garbage out. Small errors at this scale lead to big problems later.

The LLM service: Smarter than just sorting

The LLM doesn’t work blindly alphabetically. It thinks contextually.

It receives:

  • Cleaned attribute values
  • Category breadcrumbs
  • Attribute metadata

With this context, the model understands:

  • That “Voltage” in power tools is numeric
  • That “Size” in clothing follows a known progression
  • That “Color” may follow RAL standards
  • That “Material” has semantic relationships

It returns:

  • Ordered values
  • Refined attribute names
  • A decision: deterministic or AI-driven sorting

This allows handling different attribute types without coding each category individually.

Deterministic fallbacks: Not everything needs AI

Many attributes work better without artificial intelligence:

  • Numeric ranges (e.g., 5cm, 12cm, 20cm sort themselves)
  • Unit-based values
  • Simple quantities

These receive:

  • Faster processing
  • Predictable sorting
  • Lower costs
  • Zero ambiguity

The pipeline automatically detects these cases and uses deterministic logic. This keeps the system efficient and avoids unnecessary LLM calls.

Human vs. machine: Dual control

Retailers need control over critical attributes. Therefore, each category can be marked as:

  • LLM_SORT — the model decides
  • MANUAL_SORT — merchants define the order

This system distributes the workload: AI handles the bulk, humans make final decisions. It also builds trust, as teams can override the model when needed.

Infrastructure: Simple, centralized, scalable

All results are stored directly in a MongoDB database—the only operational storage for:

  • Sorted attribute values
  • Refined attribute names
  • Category tags
  • Product-specific sort orders

This makes it easy to review changes, overwrite values, reprocess categories, and synchronize with other systems.

Search integration: Where quality becomes visible

After sorting, values flow into two search assets:

  • Elasticsearch for keyword search
  • Vespa for semantic and vector-based search

This ensures:

  • Filters appear in logical order
  • Product pages show consistent attributes
  • Search engines rank more accurately
  • Customers navigate categories more easily

Here, in search, good attribute sorting becomes visible.

The results: From chaos to clarity

Attribute Raw values Sorted output
Size XL, Small, 12cm, Large, M, S Small, M, Large, XL, 12cm
Color RAL 3020, Crimson, Red, Dark Red Red, Dark Red, Crimson, RAL 3020(
Material Steel, Carbon Steel, Stainless, Stainless Steel Steel, Stainless Steel, Carbon Steel
Numeric 5cm, 12cm, 2cm, 20cm 2cm, 5cm, 12cm, 20cm

The impact was measurable:

  • Consistent sorting across 3M+ SKUs
  • Predictable numeric sequences
  • Full merchant control via tagging
  • Intuitive filters and cleaner pages
  • Better search relevance
  • Higher customer conversion

Key lessons

  1. Hybrid beats pure AI: Guardrails are critical at scale
  2. Context is king: It dramatically improves model accuracy
  3. Offline processing wins: Necessary for throughput and reliability
  4. Human control builds trust: Override mechanisms are not bugs, they are features
  5. Clean inputs are foundational: No shortcuts in data cleaning

Sorting attribute values may seem trivial, but it becomes a real challenge with millions of products. Combining LLM intelligence with clear rules and merchant control creates a system that transforms invisible chaos into scalable clarity.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin

Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)