Definition

A state machine, in the context of an agentic system, is the orchestration contract that governs what the system is allowed to do at every step of execution.

This note covers state machines as the runtime control plane for agents: durability, idempotency, retries, approval gates, tool authority, and recovery.

It does not define the state machine formalism itself — states, events, transitions, guards, terminal states, and the comparison to alternative behavior formalisms (Hierarchical State Machines, Behavior Trees, Rule Engines, PDDL, HTN) are covered in 1.2.1 Finite State Machines (FSM).

Note Lens Question answered
1.1.1 Distributed systems thinking How is the state machine run safely?
1.2.1 Planning + decision systems What is the formal model? Why FSM vs others?

The Orchestration Question

The structural question for any agentic system is:

Who controls the next transition?

Three valid answers:

  1. The workflow controls it. (Deterministic orchestration.)
  2. The LLM proposes it and the workflow validates it. (Bounded LLM control.)
  3. The LLM controls it inside a sandbox. (Autonomous loop.)

Most production agentic systems use answer 2.

Each answer corresponds to one of the patterns below.


Architectural Patterns

Deterministic workflow with LLM as worker

The workflow controls every transition. The LLM performs bounded tasks (classify, extract, summarize, draft) inside predefined states.

Use for:

  1. regulated workflows
  2. support automation
  3. document assistants
  4. approval systems
  5. production operations

This is the safest default.


LLM-controlled transitions

The workflow exposes a finite list of allowed transitions. The LLM selects one.

Prompt contract:

{
  "current_state": "planning",
  "allowed_transitions": [
    "retrieve_context",
    "ask_clarifying_question",
    "request_approval",
    "fail"
  ]
}

Any output outside allowed_transitions is rejected at the event boundary.

Use for:

  1. dynamic routing
  2. flexible research workflows
  3. triage
  4. multi-step tool selection

Do not use for:

  1. irreversible actions
  2. financial movement
  3. production changes without approval
  4. security-sensitive operations

LLM-planned, machine-executed

The LLM produces a plan. The workflow validates and executes each step under deterministic rules.

{
  "plan": [
    { "step": "search_docs", "query": "refund policy" },
    { "step": "read_customer_record", "customer_id": "123" },
    { "step": "draft_refund_decision" }
  ]
}

The workflow validates each step:

  1. Is the operation allowed?
  2. Does the user have permission?
  3. Is approval required?
  4. Are the arguments valid?
  5. Is the risk level acceptable?

The LLM proposes. The orchestration disposes.


Supervisor and actors

A supervisor state machine coordinates one or more LLM actor agents.

supervisor
├── researcher_agent
├── drafter_agent
├── checker_agent
└── tool_agent

The supervisor owns transitions, retries, approval gates, and termination.

Each actor has a bounded role.

Microsoft 2026 Azure agent orchestration guidance: use the lowest level of complexity that meets requirements. Multi-agent systems add coordination overhead, latency, and failure modes; justify them only when a single agent is insufficient.


LLM ↔ Orchestration Communication

State snapshot

A state snapshot is the structured subset of orchestration state exposed to the LLM for a single decision.

The LLM receives only decision-relevant state.

{
  "current_state": "planning",
  "user_goal": "answer employee question about refund policy",
  "available_context": ["employee_handbook", "refund_policy"],
  "allowed_transitions": ["retrieve_context", "ask_clarifying_question", "draft_answer"],
  "forbidden_actions": ["issue_refund", "change_policy"]
}

The state snapshot is input. It is not authority.


Structured LLM output

The LLM emits a structured event proposal.

{
  "event": "NEED_RETRIEVAL",
  "payload": { "query": "refund policy for damaged products", "sources": ["policy_docs"] },
  "confidence": 0.82
}

The orchestration converts the proposal into an event the state machine accepts.

The LLM does not mutate state directly.


Event boundary

The event boundary is the deterministic interface where LLM output becomes a state-machine input.

Required checks:

  1. schema validation
  2. allowed-event validation
  3. permission validation
  4. tool-argument validation
  5. risk classification
  6. audit logging
const allowedEvents = ['NEED_RETRIEVAL', 'ASK_USER', 'DRAFT_READY', 'FAIL'];
if (!allowedEvents.includes(llmOutput.event)) {
  send({ type: 'LLM_OUTPUT_REJECTED' });
}

Memory vs state

State is the operational execution record.

Memory is information used by the agent across steps or sessions.

They must be separated.

State examples:

current step
retry count
approval status
tool result
active plan

Memory examples:

user preference
prior conversation summary
company policy fact
retrieved document excerpt

LangGraph treats these as separate concerns to support durable execution, human-in-the-loop, and replay.


Distributed Systems Concerns

Durability

State and context must persist after every transition. A crash in awaiting_tool_result should resume in awaiting_tool_result, not restart planning.

Persisted records:

  1. current state
  2. pending event
  3. context (counters, tool results, plan)
  4. approval decisions
  5. transition log

Storage:

  • Local: SQLite, Postgres, Redis
  • Google: Firestore, Cloud SQL, BigQuery
  • Workflow-native: Temporal, AWS Step Functions, Google Cloud Workflows, Inngest, Trigger.dev

Idempotency

Network retries cause tools to execute more than once.

Either the tool is idempotent, or the orchestration tracks an attempt ID and the tool deduplicates.

Non-idempotent operations require stronger approval controls.


Out-of-order events

Tool results may arrive after the orchestration has moved on. The state machine accepts or rejects events based on current state, not blindly.

Time itself is an event source. timeout, tick, and deadline_passed are first-class inputs, not afterthoughts.

Most agent failures in production are missed timeouts.


Tool authority

Tool authority is the deterministic permission boundary that determines whether a model-proposed action may execute.

The LLM may propose:

{ "tool": "refund_customer", "amount": 500 }

The orchestration checks:

  1. Is refund_customer allowed in the current state?
  2. Is the amount below the auto-approval threshold?
  3. Does the current user have permission?
  4. Is the customer record loaded?
  5. Is the action idempotent?
  6. Is human approval required?

Only then can the tool run.


Human approval gates

Human approval is a durable transition checkpoint, not a prompt instruction.

ready_to_execute_sensitive_tool
→ awaiting_human_approval
→ approved
→ execute_tool

The orchestration must persist at the approval state and resume without replaying prior agent work.

Mandatory human-in-the-loop gates make orchestration synchronous at that step.


Tool-call loops and termination

A tool-call loop is repeated tool selection without progress toward completion.

Required controls:

  1. max iterations
  2. max tool calls
  3. max cost
  4. repeated-tool detection
  5. progress scoring
  6. forced termination state

Microsoft guidance: guard against infinite tool-call loops with iteration limits.


State compression

State compression transforms accumulated execution history into a compact representation for model context.

Use when:

  1. workflows are long
  2. context windows fill
  3. tool outputs are large
  4. multi-agent transcripts accumulate noise

Store full state externally. Pass compressed state to the LLM.

{
  "summary": "User asked about refund eligibility. Policy says damaged products qualify within 30 days. Customer order is 18 days old.",
  "open_decision": "Determine whether refund needs manager approval.",
  "allowed_transitions": ["request_approval", "draft_response", "fail"]
}

Production Architectures

Node / TypeScript

  1. XState (orchestration)
  2. Fastify or NestJS
  3. Postgres (state persistence)
  4. Redis
  5. BullMQ (durable queues)
  6. OpenTelemetry
  7. Docker

Google stack

  1. Cloud Run
  2. Pub/Sub
  3. Firestore
  4. BigQuery
  5. Vertex AI
  6. Cloud Logging
  7. IAM

Workflow-native

  1. Temporal
  2. AWS Step Functions
  3. Google Cloud Workflows
  4. Inngest
  5. Trigger.dev

The orchestration model remains the same.

Only execution infrastructure changes.


Architecture Selection

Pattern LLM role Orchestration role Use when
Deterministic workflow Worker Full controller Compliance, support, operations
LLM-controlled transitions Chooser Validator and executor Flexible routing, triage
LLM-planned workflow Planner Plan validator and executor Variable-step tasks
Supervisor + actors Specialist workers Coordinator Multi-agent systems
Fully autonomous loop Controller Minimal guard Research demos, low-risk sandboxes

Minimal Orchestration Shape

A durable orchestration shape using XState. The model itself (states, events, transitions, guards) is covered in 1.2.1; this snippet shows the runtime concerns: async invocation, validation, retry, terminal states.

import { createMachine, assign } from 'xstate';

const agentMachine = createMachine({
  id: 'agent',
  initial: 'planning',
  context: {
    userGoal: '',
    plan: null,
    toolResults: [],
    retryCount: 0
  },
  states: {
    planning: {
      invoke: {
        src: 'callPlannerLLM',
        onDone: {
          target: 'validatingLLMOutput',
          actions: assign({ proposedEvent: ({ event }) => event.output })
        },
        onError: 'failed'
      }
    },

    validatingLLMOutput: {
      always: [
        { guard: 'isAllowedRetrievalRequest', target: 'retrieving' },
        { guard: 'isAllowedToolRequest',      target: 'awaitingApproval' },
        { guard: 'isValidFinalAnswer',        target: 'completed' },
        { target: 'repairing' }
      ]
    },

    retrieving:       { invoke: { src: 'retrieveContext', onDone: 'planning', onError: 'repairing' } },
    awaitingApproval: { on: { APPROVED: 'executingTool', REJECTED: 'completed' } },
    executingTool:    { invoke: { src: 'executeTool', onDone: 'planning', onError: 'repairing' } },
    repairing:        { always: [{ guard: 'canRetry', target: 'planning' }, { target: 'failed' }] },

    completed: { type: 'final' },
    failed:    { type: 'final' }
  }
});

The shape is:

LLM output → proposed event → validation → transition

Not:

LLM output → direct state mutation

Final Rule

The orchestration question is not "do we use a state machine."

It is "who controls the next transition."

For enterprise agentic systems, the workflow validates LLM-proposed transitions. The LLM does not control authority.

For the formal model itself — states, events, transitions, guards, terminal states, FSM versus alternatives — see 1.2.1 Finite State Machines (FSM).


References

  1. Anthropic — Building Effective Agents — anthropic.com/research/building-effective-agents
  2. LangGraph overview — docs.langchain.com/oss/python/langgraph/overview
  3. Microsoft Azure Architecture Center — AI Agent Design Patterns — learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/ai-agent-design-patterns
  4. Temporal — temporal.io
  5. XState — stately.ai/docs/xstate