Simulations are easy to run. Simulations that you can explain and reproduce six months later? That’s harder than it sounds.
Here’s the scenario: a regulatory question comes in about a study you ran last year. Or a customer disputes a result. Or a colleague wants to build on your analysis but gets different numbers.
If you can’t reproduce the exact run, you’re in trouble. Not “looks roughly similar” — the exact run.
Why reproducibility fails
It’s rarely malice or incompetence. It’s death by a thousand cuts:
- Someone edited the input file “just to test something” and forgot to revert
- The solver was upgraded and nobody pinned the version
- Random seeds weren’t set, so reruns give different results
- The “final” dataset turned out to have three more versions after it
- Parameters were tweaked in a Jupyter notebook that nobody saved
Each of these seems minor in isolation. Together, they make your simulation results indefensible.
The minimum viable reproducibility setup
Version everything. Inputs, configuration, code. Not just “in Git somewhere” — linked to specific runs. When you look at a result, you should be able to trace back to the exact versions that produced it.
Pin your environment. Containerize. Specify solver versions. If you can’t run the same code on the same dependencies two years from now, your reproducibility is an illusion.
Control randomness. Set seeds explicitly. If your simulation has stochastic elements, they should be deterministic given the same seed.
Store inputs immutably. Raw data goes in a bucket and doesn’t change. Derived datasets get versioned. If someone needs to update an input, they create a new version — they don’t edit in place.
Link outputs to inputs. Every result should include metadata: scenario ID, input versions, code commit, configuration hash. The goal is “click here to see exactly what produced this.”
Scenario management
Don’t let scenarios proliferate without structure. Maintain a catalog:
- Scenario name and purpose
- Input datasets (with versions)
- Configuration parameters
- Expected outputs (for regression testing)
- Owner and approval status
When someone asks “which scenario did we use for the Q3 analysis?” you should be able to answer in seconds, not hours.
The grid constraint study example
A team runs a grid constraint study for a new storage site. They need:
- Network model (versioned, stored in artifact registry)
- Demand forecast (versioned, with clear provenance)
- Storage parameters (explicit configuration file)
The simulation runs in a container with pinned solver versions. Output includes scenario ID, input versions, and key results.
Three months later, a question comes up. They rerun the exact scenario. Same inputs, same code, same environment. Numbers match. Question answered. No scrambling.
Validation isn’t optional
Don’t just run simulations — sanity-check them.
- Energy balance: does input equal output plus losses?
- Constraint violations: are physical limits respected?
- Historical comparison: do results align with known baselines?
If any check fails, the run is flagged. Results don’t get published until someone understands why.
Common failure modes
- Editing inputs in place without versioning
- Running on local machines with unpinned dependencies
- Forgetting random seeds (non-deterministic runs)
- Storing outputs without linking them to inputs
- No regression tests, so pipeline changes break silently
The payoff
Reproducibility takes effort upfront. But when you need to defend a result, extend an analysis, or debug a discrepancy, that effort pays back 10x.
The alternative is spending days reconstructing what you did, hoping you can match the original numbers, and explaining to stakeholders why “it’s close enough.”
Close enough isn’t good enough when decisions depend on it.
XIThing builds simulation and data pipelines for energy companies. Get in touch if you’re tired of irreproducible results.




