Error Handling
Fail fast, inspect step output, and recover from workflow execution issues.
Workflow reliability depends on honest failure handling. Overseer should stop on real execution failure, not gloss over it.
Fail Fast
Current workflow runs are designed to fail fast when a step fails underneath. That is better than continuing with bad assumptions and producing a polished but unreliable result.
Step Status
Each step should reflect the actual execution status of the orchestrated run. A completed label without evidence is a bug, not a convenience.
Common Failures
- Model runtime unavailable, such as Ollama not running.
- Tool execution timeouts or malformed inputs.
- Supervisor returns unverified output when evidence is expected.
- Persistence issues that prevent run state from being stored cleanly.
Recovery
Start recovery by reading the step output and trace event timeline. Fix the concrete runtime or routing problem first, then rerun. Avoid editing the workflow blindly without checking what failed.
Next Steps
Continue with Trace Analysis to make failure investigation part of the normal operator loop.