Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add project overview #1

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions project-overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
## Background

Data providers and users need to have confidence in their data. I've found that within earth science this need has often been met by data providers developing bespoke, highly specific validation tools which are often not accessible to users. Deviations from standards and expectations (i.e., data quirks) still slip through the cracks, leading to a status quo for downstream users of designing pipelines that use the dataset and watching it fail part-way through, then debugging to find out why. Occasionally, there are technical documents accompanying datasets that describe known issues, but this isn't especially common. When these documents exist, they still require human reading and interpretation and may not be comprehensive.

ndquirk is a proposal for an extensible tool that quickly identifies quirks in N-dimensional datasets, where a 'quirk' is a deviation from a defined set of expectations.

## Goals
Primary goal
- Provide a accessible, extensible, and performant tool for identifying and communicating data deviations from a set of expectations for N-D datasets.

Sub-goals
- Provide simple mechanisms to run validation proximal to the data.
- Provide mechanisms for distributed data validation.
- Leverage existing software and validators whenever possible.
- Minimize as much as possible the number of get requests and amount of data transferred during data validation.
- Provide simple storage option configuration (i.e., no pass-through kwargs).
- Provide simple run-time configuration.
- Provide accessible documentation and demonstrations for all common use cases.
- Maintain 100% coverage in testing utilities.

## Partner and Stakeholders

This utility would be relevant for any data provider or user of N-D data. For example, the goals should include making the tool accessible and useful for data providers at NASA, NOAA, USGS, ESA, etc...

This proposal has grown from many conversations over the years, all of whom would be likely either partners or stakeholders in the project. These people include:

- @abarciauskas-bgse
- @sharkinsspatial
- @jhamman
- @norlandrhagen
- @rabernat
- @briannapagan
- @andersy005
- @eni-awowale
- @omshinde

## Milestones

- [ ] Define project goals
- [ ] Construct a list of use cases for the library
- [ ] Define stakeholders and communication pathways for soliciting feedback
- [ ] Complete an API design document, including establishing whether a new library is necessary or an extension for great expectations would be sufficient
- [ ] Implement a mechanism for defining and connecting to a data source
- [ ] Implement, test, and document mechanisms for defining expectations
- [ ] Demonstrate defining data format expectation (e.g., Cloud-Optimized GeoTIFF)
- [ ] Define mechanisms for versioning data and metadata formats expectations
- [ ] Demonstrate defining a chunk schema expectation (e.g., constant chunk sizes)
- [ ] Demonstrate defining an encoding expectation (e.g., entirely zlib compression)
- [ ] Demonstrate defining metadata standards (e.g., CF, Croissant)
- [ ] Demonstrate defining data hierarchy expectations (e.g., Dataset vs. DataTree)
- [ ] Demonstrate defining data type expectations (e.g., all float64)
- [ ] Demonstrate defining data value expectations (e.g., no NaNs)
- [ ] Demonstrate validating expectations in parallel
- [ ] Demonstrate running the expectations using data proximate computing
- [ ] Demonstrate building a browser-based UI for defining and validating expectations

### Resources

- [https://github.com/great-expectations/great_expectations/issues/1942](https://github.com/great-expectations/great_expectations/issues/1942)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you took xarray-schema and added support for types of validation that xarray could not perform (e.g. union types, functional validators) then you would end up with something that's pretty similar to calling great expectations on a pandas dataframe

https://docs.greatexpectations.io/docs/0.18/oss/guides/connecting_to_your_data/fluent/in_memory/connect_in_memory_data/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh or like Pandera I guess

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach would work off-the-bat for a subset of the expectations, but I don't expect it to be entirely sufficient. xarray-schema currently relies on having loaded or at least opened an xarray dataset for validation. IMO a lot of the value of this tool would be providing explanations for why datasets cannot be simply opened with Xarray (e.g., https://github.com/briannapagan/quirky-data-checker/blob/main/results/results_GES_DISC_total_quirks.png)

- [https://cfchecker.ncas.ac.uk/](https://cfchecker.ncas.ac.uk/)
- [https://github.com/zarr-developers/pydantic-zarr](https://github.com/zarr-developers/pydantic-zarr)
- [https://discourse.pangeo.io/t/tool-for-validating-geo-data-services-moved-to-the-cloud/4118](https://discourse.pangeo.io/t/tool-for-validating-geo-data-services-moved-to-the-cloud/4118)
- [https://github.com/briannapagan/quirky-data-checker](https://github.com/briannapagan/quirky-data-checker)
- [https://www.earthdata.nasa.gov/learn/earth-observation-data-basics/data-maturity-levels](https://www.earthdata.nasa.gov/learn/earth-observation-data-basics/data-maturity-levels)
- [https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2017RG000562](https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2017RG000562)
- [https://www.earthdata.nasa.gov/about/competitive-programs/access/data-quality-screening-service](https://www.earthdata.nasa.gov/about/competitive-programs/access/data-quality-screening-service)
- [https://podaac-tools.jpl.nasa.gov/mcc/](https://podaac-tools.jpl.nasa.gov/mcc/)
- [https://wiki.earthdata.nasa.gov/display/ESDSWG/Dataset+Interoperability+Recommendations+for+Earth+Science](https://wiki.earthdata.nasa.gov/display/ESDSWG/Dataset+Interoperability+Recommendations+for+Earth+Science)
- [https://github.com/xarray-contrib/xarray-schema](https://github.com/xarray-contrib/xarray-schema)