Home Blog MDPs and POMDPs in Retail: Sequential Decisions, Reward Design, and Failure Modes

MDPs and POMDPs in Retail: Sequential Decisions, Reward Design, and Failure Modes

June 15, 2025 By Fatih Nayebi

Retail AISequential Decision-MakingReinforcement LearningAgentic AISimulation

MDPs and POMDPs in Retail: Sequential Decisions, Reward Design, and Failure Modes

Series: Foundations of Agentic AI for Retail (Part 4 of 10)
Based on the book: Foundations of Agentic AI for Retail

A one-week decision can create a three-week hangover.

You discount to clear inventory. Demand jumps. The DC drains faster than expected. Next week you're short in stores, and now every "smart" move is a trade: margin, availability, brand, calendar.

That's sequential decision-making. Today's action changes tomorrow's state.

MDPs and POMDPs are a disciplined way to write that reality as a contract: define state, actions, transitions, reward (and, for POMDPs, observations), then choose policies that perform over time, not just today.

Where projects usually fail isn't the math. It's reward design and the simulator you trust too early.

Jump to: Definition | MDP contract | Reward design | Simulation | 30-day checklist

TL;DR

MDPs are useful when today's action changes tomorrow's options (pricing, inventory risk, promotions).
POMDPs matter when you cannot observe the true state (true demand, substitution, intent).
Reward design and simulation fidelity are the two most common failure points.

The One-Sentence Definition

An MDP (or POMDP) is a decision contract for sequential retail problems: define state, actions, transitions, reward, and then choose policies that perform well over time, not just today.

If your decision has a tomorrow, you need to model the consequences of today.

Two Canonical Retail Sequential Problems

Inventory risk over time: ordering today affects OOS risk and cash tied up tomorrow.
Pricing/promo dynamics: price changes influence demand, competitor response, and future inventory position.

If the decision has consequences that last multiple periods, you are in sequential territory.

MDP vs POMDP (Plain English)

Framework	What you assume	Retail intuition
MDP	you can observe the state you need	you know on-hand, price, seasonality, constraints
POMDP	the true state is hidden; you infer it	you do not know true demand, substitution, or intent

In retail, many problems are "MDP-like" and "POMDP-ish" at the edges.

So, the minimum contract. Not the math. The contract.

The Minimum MDP Contract (State, Action, Reward)

This is the scaffolding you need before any RL or planning discussion.

from dataclasses import dataclass
from typing import Literal

Action = Literal['raise_price', 'lower_price', 'hold_price']

@dataclass(frozen=True)
class State:
    on_hand: int
    price: float
    season_week: int
    competitor_index: float


def reward(state: State, action: Action) -> float:
    # Stub: encode your actual trade-offs.
    # Think: margin - stockout_penalty - volatility_penalty.
    return 0.0

What teams miss: the state is not "everything". It is "everything required to choose the next safe action."

A Typed MDP Contract (Copy/Paste)

In production, I prefer to make the contract explicit in code.

Two notes that will save you pain:

Constraints belong in a policy gate, not hidden inside a reward.
Reward should be decomposed so you can debug why a policy prefers one action over another.

export type MdpState = {
  onHand: number;
  price: number;
  seasonWeek: number;
  competitorIndex: number;
};

export type MdpAction =
  | { kind: 'raise_price'; deltaPct: number }
  | { kind: 'lower_price'; deltaPct: number }
  | { kind: 'hold_price' };

export type RewardBreakdown = {
  marginContribution: number;
  stockoutPenalty: number;
  volatilityPenalty: number;
  total: number;
};

export function reward(b: RewardBreakdown): number {
  return b.total;
}

The contract is the point. The model can change later.

Then comes the part that quietly determines whether you ship: reward design.

Reward Design (Where Retail Projects Quietly Fail)

If your reward is wrong, the policy can be brilliantly wrong.

A practical decomposition:

reward = margin_contribution
       - stockout_penalty
       - inventory_risk_penalty
       - rule_violation_penalty (infinite/blocked)
       - volatility_penalty ("do not thrash")

In retail, you often want a reward that is:

monotonic where possible (more violations -> worse)
bounded (avoid one term dominating due to units)
aligned with KPIs you can measure and explain

Transitions and Simulation (The Non-Negotiable Requirement)

Sequential models require a transition story: how does the world evolve when you act?

You do not need a perfect simulator. You need one that is:

good enough to rank policy choices
honest about uncertainty
connected to a baseline policy

When not to use an MDP/POMDP

If you cannot answer these, pause:

Can we simulate transitions at a useful fidelity?
Can we constrain the action space to be safe?
Can we evaluate offline before acting online?

If the answer is "no" to any of them, you might be better off with planning/optimization and explicit guardrails.

POMDPs: Belief State Is the Real State

In a POMDP, the agent maintains a belief about hidden variables.

Retail example: you do not observe "true demand". You observe sales that are censored by OOS.

Hidden state: true demand, substitution patterns, intent
Observations: sales, OOS flags, traffic, competitor promos
Belief: a probability distribution over hidden state
Action: price / replenishment / promo choice

If you ignore partial observability, you will overfit to what is easy to measure.

Failure Modes (And What To Watch For)

Failure mode	What you will see	Mitigation
reward hacking	policy exploits proxy metrics	add penalties + constraints + audits
simulator mismatch	offline wins, online losses	baseline comparisons + stress tests
action explosion	unbounded action space	discretize + guardrails + approvals
non-stationarity ignored	policy decays quickly	monitoring + periodic retraining/retuning

Implementation Checklist (30 Days)

Choose one sequential decision (inventory risk or pricing dynamics).
Write the MDP contract: state, actions, reward, constraints.
Build a toy simulator and validate it against a baseline policy.
Run offline evaluation (backtests) before any live execution.
Ship as gated autonomy: propose actions -> approve -> execute.

FAQ

Do I need RL to use MDPs?
No. MDPs are a modeling language. Planning and dynamic programming are valid approaches too.

Are POMDPs too academic?
Not if you treat them as a reminder: the world is partially observable, and belief matters.

What is the fastest retail win from this mindset?
Reward decomposition + guardrails + a baseline policy. Those three prevent most failure.

Talk Abstract (You Can Reuse)

Retail decisions are rarely one-shot. A one-week move can create a three-week hangover.

This talk explains MDPs and POMDPs in plain language for retail, then gets practical: define the minimum contract (state, action, reward, constraints) and design an offline evaluation story before you touch production. The focus is not math theater. It is the real failure modes: reward hacking, simulator mismatch, action space explosion, and partial observability.

Talk title ideas:

MDPs for Retail: Sequential Decisions Without the Math Theater
Reward Design: The Hidden Killer of Agent Projects
POMDP Thinking: When You Do Not Observe the Real State

Next in the Series

Next: Planning vs Reinforcement Learning in Retail: A Production Ladder (Simulation, Guardrails, Hybrid Systems)

Series Navigation

Previous: /blog/decision-theory-for-retail-agents
Hub: /blog
Next: /blog/planning-vs-rl-retail-production-ladder

Work With Me

Deep-dive session on sequential decisions in retail (MDPs, reward design, simulation failure modes): /contact (topics: /conferences)
Book: /publications/foundations-of-agentic-ai-for-retail
If you need simulation + evaluation gates for pricing/replenishment decisions: OODARIS AI