-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add project overview #1
Open
maxrjones
wants to merge
5
commits into
main
Choose a base branch
from
overview
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
## Background | ||
|
||
Data providers and users need to have confidence in their data. I've found that within earth science this need has often been met by data providers developing bespoke, highly specific validation tools which are often not accessible to users. Deviations from standards and expectations (i.e., data quirks) still slip through the cracks, leading to a status quo for downstream users of designing pipelines that use the dataset and watching it fail part-way through, then debugging to find out why. Occasionally, there are technical documents accompanying datasets that describe known issues, but this isn't especially common. When these documents exist, they still require human reading and interpretation and may not be comprehensive. | ||
|
||
ndquirk is a proposal for an extensible tool that quickly identifies quirks in N-dimensional datasets, where a 'quirk' is a deviation from a defined set of expectations. | ||
|
||
## Goals | ||
Primary goal | ||
- Provide a accessible, extensible, and performant tool for identifying and communicating data deviations from a set of expectations for N-D datasets. | ||
|
||
Sub-goals | ||
- Provide simple mechanisms to run validation proximal to the data. | ||
- Provide mechanisms for distributed data validation. | ||
- Leverage existing software and validators whenever possible. | ||
- Minimize as much as possible the number of get requests and amount of data transferred during data validation. | ||
- Provide simple storage option configuration (i.e., no pass-through kwargs). | ||
- Provide simple run-time configuration. | ||
- Provide accessible documentation and demonstrations for all common use cases. | ||
- Maintain 100% coverage in testing utilities. | ||
|
||
## Partner and Stakeholders | ||
|
||
This utility would be relevant for any data provider or user of N-D data. For example, the goals should include making the tool accessible and useful for data providers at NASA, NOAA, USGS, ESA, etc... | ||
|
||
This proposal has grown from many conversations over the years, all of whom would be likely either partners or stakeholders in the project. These people include: | ||
|
||
- @abarciauskas-bgse | ||
- @sharkinsspatial | ||
- @jhamman | ||
- @norlandrhagen | ||
- @rabernat | ||
- @briannapagan | ||
- @andersy005 | ||
- @eni-awowale | ||
- @omshinde | ||
|
||
## Milestones | ||
|
||
- [ ] Define project goals | ||
- [ ] Construct a list of use cases for the library | ||
- [ ] Define stakeholders and communication pathways for soliciting feedback | ||
- [ ] Complete an API design document, including establishing whether a new library is necessary or an extension for great expectations would be sufficient | ||
- [ ] Implement a mechanism for defining and connecting to a data source | ||
- [ ] Implement, test, and document mechanisms for defining expectations | ||
- [ ] Demonstrate defining data format expectation (e.g., Cloud-Optimized GeoTIFF) | ||
- [ ] Define mechanisms for versioning data and metadata formats expectations | ||
- [ ] Demonstrate defining a chunk schema expectation (e.g., constant chunk sizes) | ||
- [ ] Demonstrate defining an encoding expectation (e.g., entirely zlib compression) | ||
- [ ] Demonstrate defining metadata standards (e.g., CF, Croissant) | ||
- [ ] Demonstrate defining data hierarchy expectations (e.g., Dataset vs. DataTree) | ||
- [ ] Demonstrate defining data type expectations (e.g., all float64) | ||
- [ ] Demonstrate defining data value expectations (e.g., no NaNs) | ||
- [ ] Demonstrate validating expectations in parallel | ||
- [ ] Demonstrate running the expectations using data proximate computing | ||
- [ ] Demonstrate building a browser-based UI for defining and validating expectations | ||
|
||
### Resources | ||
|
||
- [https://github.com/great-expectations/great_expectations/issues/1942](https://github.com/great-expectations/great_expectations/issues/1942) | ||
- [https://cfchecker.ncas.ac.uk/](https://cfchecker.ncas.ac.uk/) | ||
- [https://github.com/zarr-developers/pydantic-zarr](https://github.com/zarr-developers/pydantic-zarr) | ||
- [https://discourse.pangeo.io/t/tool-for-validating-geo-data-services-moved-to-the-cloud/4118](https://discourse.pangeo.io/t/tool-for-validating-geo-data-services-moved-to-the-cloud/4118) | ||
- [https://github.com/briannapagan/quirky-data-checker](https://github.com/briannapagan/quirky-data-checker) | ||
- [https://www.earthdata.nasa.gov/learn/earth-observation-data-basics/data-maturity-levels](https://www.earthdata.nasa.gov/learn/earth-observation-data-basics/data-maturity-levels) | ||
- [https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2017RG000562](https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2017RG000562) | ||
- [https://www.earthdata.nasa.gov/about/competitive-programs/access/data-quality-screening-service](https://www.earthdata.nasa.gov/about/competitive-programs/access/data-quality-screening-service) | ||
- [https://podaac-tools.jpl.nasa.gov/mcc/](https://podaac-tools.jpl.nasa.gov/mcc/) | ||
- [https://wiki.earthdata.nasa.gov/display/ESDSWG/Dataset+Interoperability+Recommendations+for+Earth+Science](https://wiki.earthdata.nasa.gov/display/ESDSWG/Dataset+Interoperability+Recommendations+for+Earth+Science) | ||
- [https://github.com/xarray-contrib/xarray-schema](https://github.com/xarray-contrib/xarray-schema) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you took xarray-schema and added support for types of validation that xarray could not perform (e.g. union types, functional validators) then you would end up with something that's pretty similar to calling great expectations on a pandas dataframe
https://docs.greatexpectations.io/docs/0.18/oss/guides/connecting_to_your_data/fluent/in_memory/connect_in_memory_data/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh or like Pandera I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach would work off-the-bat for a subset of the expectations, but I don't expect it to be entirely sufficient. xarray-schema currently relies on having loaded or at least opened an xarray dataset for validation. IMO a lot of the value of this tool would be providing explanations for why datasets cannot be simply opened with Xarray (e.g., https://github.com/briannapagan/quirky-data-checker/blob/main/results/results_GES_DISC_total_quirks.png)