← Field Manual

SEC-001

Data Provenance

Hard to argue that knowing where your data came from is simply good hygiene!

The Problem With Trust

Provenance is an old concept. Archivists and art historians have been tracing the chain of custody of documents and paintings for centuries. The word itself comes from the French provenir — to come from. When a museum acquires a painting, it wants to know every hand that held it, every wall it hung on, every auction it passed through. Not because those facts change the painting, but because they change how much you can trust the painting is what it claims to be.

Data works the same way. The difference is scale.

A single Sentinel-2 satellite generates around 1.6 terabytes of data per day. Landsat, MODIS, SAR constellations, commercial providers like Planet and Maxar, the combined output of the global Earth observation infrastructure is staggering. Every one of those datasets passes through processing chains, gets derived into products, gets shared across institutions, and ends up informing real decisions about real places.

Now ask yourself: at the end of that chain, how do you know what you're looking at?

If someone hands you a GeoTIFF and says it represents the normalized difference vegetation index for a particular region on a particular date, you are taking their word for it. You can check if the file looks reasonable. You can compare it against your expectations. But you cannot verify from the file itself that the atmospheric correction was applied correctly, that the cloud mask didn't clip valid data, that the resampling method didn't introduce artifacts, or that the coordinate reference system is what the metadata claims.

You are trusting the pipeline. And in most of the geospatial world today, that trust is implicit.


What Provenance Actually Tracks

A proper provenance record for a geospatial dataset answers a series of basic questions. They seem obvious, but the fact that most processing systems do not answer them reliably is the whole problem.

Where did the input data come from? Not just "Sentinel-2," but which specific granule, from which orbit, captured at what time, downloaded from which archive, with what processing level? If multiple sources were fused, what were all of them?

What transformations were applied? Every reprojection, resampling, correction, classification, masking, clipping, and fusion operation. In what order. With what parameters. Using what software version.

When did processing occur? Timestamps matter because the same algorithm run on different dates might use different calibration coefficients, different ancillary data, or different model weights.

Who or what performed the processing? A human analyst? An automated pipeline? Which version of which code? On what infrastructure?

What was the output? A hash of the final product, so that any subsequent alteration — even a single flipped bit — can be detected.

Taken together, these answers form a chain. That chain is only as strong as its weakest link, which is why partial provenance (recording some steps but not others) provides a false sense of security that may be worse than no provenance at all.


Why This Is Hard

If the problem were simply record-keeping, it would have been solved decades ago. The reason provenance remains a persistent gap in geospatial data infrastructure has less to do with technology and more to do with how the ecosystem evolved.

Remote sensing grew up in institutions. Space agencies built their own processing pipelines, developed their own formats, maintained their own archives. Each pipeline was internally consistent, but interoperability between pipelines was never a design priority. When ESA processes a Sentinel-2 granule and NASA processes a Landsat scene, the provenance metadata they generate is structured differently, stored differently, and in some cases captures different information entirely.

The commercial sector added another layer. Private satellite operators process data through proprietary pipelines where the transformation steps are trade secrets. The customer receives a clean product with limited metadata about what happened inside the black box. This is not malicious; it is standard practice in an industry where processing algorithms are competitive advantages. But it means that downstream consumers are, by design, unable to verify the chain.

And then there is the integration problem. Most real-world applications do not use a single dataset. Flood mapping might combine SAR imagery, optical imagery, terrain models, and hydrological data. Each of those inputs has its own provenance chain. The moment you fuse them, you need a provenance system that can represent not just linear chains but branching, merging graphs of transformation. Very few systems do this well.

The result is an industry where the most consequential data products, the ones that inform disaster response, agricultural subsidies, carbon credit verification, and military planning, often have the weakest provenance records.


The Difference Between a Claim and a Proof

There is a useful distinction to draw here. Most provenance systems that exist today produce claims. They generate metadata that asserts what happened to a dataset. Those claims might be accurate. They might even be detailed and well-structured. But they are still claims, statements made by the system about itself, which could in principle be altered, fabricated, or simply wrong.

A claim says: "This dataset was atmospherically corrected using LaSRC v3.2 on January 15th, 2026."

A proof says: "Here is a cryptographic signature, generated by hardware that cannot be tampered with, confirming that this specific transformation code was applied to this specific input data at this specific time, and the hardware environment was verified to be unmodified."

That distinction, between provenance-as-metadata and provenance-as-proof, is where the field is heading. Cryptographic techniques, hardware security modules, and reproducible processing environments are beginning to make it possible to generate provenance records that are not just detailed but verifiable. Not "we wrote down what happened" but "here is mathematical evidence of what happened, and you can check it yourself."

This matters most in contexts where the stakes are high and the trust is low. Defense and intelligence applications, where data might be deliberately manipulated. Insurance and finance, where the incentive to misrepresent conditions is real. Carbon credit markets, where the entire value proposition depends on whether the satellite-derived measurement is accurate. Climate monitoring, where policy decisions rest on long-term data integrity.

In these contexts, a claim is not enough. You need a proof.


Standards and the State of Play

The geospatial community has not been asleep on this problem. Several standards and frameworks have emerged to address aspects of data provenance.

The W3C PROV data model provides a general-purpose framework for representing provenance information (entities, activities, and agents) that has been adapted for scientific data workflows. The Open Geospatial Consortium (OGC) has worked on provenance specifications through its various working groups. STAC (SpatioTemporal Asset Catalog) has become a de facto standard for cataloging geospatial assets, and its extension mechanism allows provenance metadata to be attached, though the depth and consistency of that metadata varies wildly across implementations.

ISO 19115, the international standard for geographic metadata, includes provisions for lineage information, a record of the processing steps applied to a dataset. In practice, lineage fields are often either empty or filled with boilerplate text that describes the general type of processing without the specificity needed to reproduce or verify it.

The gap is not in the standards themselves. It is in adoption, enforcement, and the tooling that would make comprehensive provenance tracking a default rather than an afterthought.


What Good Provenance Looks Like

A well-provenance dataset is one where an independent party, with no prior relationship to the data producer, can verify the complete chain from raw observation to final product.

This means the provenance record is machine-readable, not buried in a PDF report that accompanies the dataset. It means every transformation step is recorded with enough specificity to be reproduced. It means the record itself is immutable — once written, it cannot be quietly edited. And ideally, it means the record is cryptographically bound to the data it describes, so that you cannot separate the provenance from the product.

In practical terms, this looks like a processing receipt that travels with the data. Open it, and you can trace every input, every operation, every intermediate product, all the way back to the original sensor reading. If someone altered the data after processing, the receipt's signature would break and you would know.

This is not science fiction. The cryptographic primitives exist. The challenge is building processing systems that generate these records natively, rather than bolting provenance on as an afterthought.


Provenance and Reproducibility

There is a natural connection between provenance and scientific reproducibility. If your provenance record is detailed enough, someone else should be able to take the same inputs, apply the same transformations, and arrive at the same outputs. If they cannot, something in the chain was not recorded — or something in the chain was not what it claimed to be.

This is where provenance meets open science. The push for reproducible research in remote sensing, driven by the same concerns about methodological transparency that have swept through other scientific disciplines, is fundamentally a push for better provenance. Every time a paper describes a processing methodology in prose rather than publishing the actual pipeline, an opportunity for verification is lost.

The ideal is a world where every geospatial data product carries enough provenance information that anyone, anywhere, can independently verify or reproduce the result. We are not there yet. But the trajectory is clear, and the tools are arriving faster than the culture is shifting to use them.


Further Reading

W3C PROV Data Model — The foundational specification for representing provenance as entities, activities, and agents. The starting point for anyone designing provenance systems. w3.org/TR/prov-dm

STAC Specification — The SpatioTemporal Asset Catalog is becoming the common language for geospatial data cataloging. Its extension model is where provenance metadata increasingly lives. stacspec.org

ISO 19115: Geographic Information — Metadata — The international standard governing geospatial metadata, including lineage provisions. Worth understanding even if the implementation gap is wide.

"Provenance in Earth Science Data" (Tilmes et al.) — An overview of provenance challenges specific to Earth science, including the tension between comprehensive tracking and practical overhead.

OGC Standards Baseline — The Open Geospatial Consortium maintains the interoperability standards that underpin most geospatial data exchange. Provenance work intersects with several of their working groups. ogc.org/standards