Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create geosoft_gxf_io.py #539

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

ThomasMGeo
Copy link

GXF File Format IO is a common file type for grav/mag for USGS releases. It's an ASCII file type with variable header structures.

On the dataset that I tested this on, it is upside down in the decoding to xarray. This is not impossible to fix, but I do like keeping the option for numpy and metadata output.

I bet this could be optimized further, will try to find more datasets to test with (smaller the better!). Any comments/optimizations are welcome.

Relevant issues/PRs:: This addresses #538 that has some more information about the format itself.

@ThomasMGeo
Copy link
Author

I found a small file that works from here: https://pubs.usgs.gov/of/2000/ofr-00-0198/html/wyoming.htm

Specifically: mag5001_gxf.gz | 2000-04-24 16:50 | 57K

Potential citation:

Kucks, R.P., and Hill, P.L., 2000, Wyoming aeromagnetic and gravity maps and data—A web site for distribution of data: U.S. Geological Survey Open-File Report 00-0198, https://pubs.usgs.gov/of/2000/ofr-00-0198/html/wyoming.htm

@ThomasMGeo
Copy link
Author

ToDo: Licensing of data file from USGS.

@ThomasMGeo
Copy link
Author

ToDo: Add Attribution to gist

@ThomasMGeo
Copy link
Author

For USGS, looks like we just need to add attribution as well: https://www.usgs.gov/faqs/are-usgs-reportspublications-copyrighted

@ThomasMGeo
Copy link
Author

@santisoler added some text about the USGS, the attribution to the gist was already there. Let me know if I need to do anything else! I probably won't edit this further until I have some more feedback.

@ThomasMGeo
Copy link
Author

@santisoler , is there anything else you want to see?

@santisoler
Copy link
Member

Hi @ThomasMGeo. Thanks for pushing these changes. Let me take a look at this and I'll come back to you.

Copy link
Member

@santisoler santisoler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ThomasMGeo for opening this PR!

I took a quick look at it and I have the following comments.

I'm not sure what's the benefit by not making the read_gxf to always return an xarray.DataArray. Once we have the DataArray, it's trivial to get the underlying Numpy array and the metadata as a dict. And by doing so, we only need to add one public function to Harmonica (read_gxf), instead of read_gxf and gxf_to_xarray (which has only a single use). In my opinion we should make read_gxf to behave similarly to the other readers in Harmonica and return a DataArray.

BTW, we need to remember to add any new public function to harmonica/__init__.py and to the API reference in doc/api/index.rst.

One little thing to note is that the docstrings should follow numpydoc's style, so they can be nicely rendered in Sphinx.

I left some more comments below.

Let me know what do you think. We can iterate over it later on! And feel free to ask for help if you need so.

Comment on lines +10 to +15
1. Grid eXchange Format (*.gxf)

GXF (Grid eXchange File) is a standard ASCII file format for
exchanging gridded data among different software systems.
Software that supports the GXF standard will be able to import
properly formatted GXF files and export grids in GXF format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that these paragraphs are coming from this PDF: https://pubs.usgs.gov/of/1999/of99-514/grids/gxf.pdf. Am I right?

Do we have permission to reproduce the same text here? Can we double check the license of those files? In doubt, I would remove these lines from this file, a reference should be enough. If we do have permission we can leave them, but I would make sure we are compliant with the license (reference the document, acknowledgement, include license text, etc?).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to remove it.

harmonica/_io/geosoft_gxf_io.py Show resolved Hide resolved
Comment on lines +187 to +188
if reading_data:
data_list.append(line)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understood these lines are populating the data_list with the values of the grid. Since the number of lines for the grid are usually going to be way larger than the header lines, why not using np.loadtxt to read them? It performs much faster than appending lines to a list, it already returns an array in the right shape, and it also has some sanity checks (the number of elements per row are consistent) that we don't have to take care of.

I think we could start reading the header lines until we find the #GRID line, keep record of the number of rows for the header and stop the iteration. Then we can use np.loadtxt to read the rest of the file by passing the number of roads to skiprows (+/- 1, we need to figure out the correct indexing).

Comment on lines +238 to +240
# Get grid dimensions
nrows = int(headers['ROWS'])
ncols = int(headers['POINTS'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In lines https://github.com/fatiando/harmonica/pull/539/files#diff-5ac31ec9d8d11b47ff39f59bf25e1cad43484857580dd7b6c65cddd8e0f504cfR219-R229 we populate the metadata dictionary with parsed values (float, int, stripped str). Isn't it better to use metadata to get the number of rows and columns, instead of having to parse them again as ints?

That being said, maybe we want the keys in metadata to be lowercase (as you do in following lines with nx, ny, x_inc, etc.).

Comment on lines +283 to +308
def get_grid_info(metadata: Dict[str, Any]) -> None:
"""
Print comprehensive information about the GXF grid.

Parameters:
metadata (dict): Metadata dictionary from read_gxf
"""
print("=== Grid Information ===")
print(f"Title: {metadata.get('TITLE', 'Not specified')}")
print(f"Grid size: {metadata['nx']} x {metadata['ny']} points")
print(f"Cell size: {metadata['x_inc']} x {metadata['y_inc']}")
print(f"Origin: ({metadata['x_min']}, {metadata['y_min']})")
print(f"Rotation: {metadata.get('ROTATION', 0)}")
print(f"Dummy value: {metadata.get('DUMMY', 'Not specified')}")

if 'projection' in metadata:
print("\n=== Projection Information ===")
print(f"Type: {metadata['projection']['type']}")
print(f"Units: {metadata['projection']['units']}")

params = metadata['projection']['parameters']
if any(params.values()):
print("\nProjection Parameters:")
for key, value in params.items():
if value is not None:
print(f"{key}: {value}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function is nice for debugging the reader. But I hardly see users using it (we would need to expose it as well). Instead, if read_gxf returns a DataArray, then we don't need this function: xarray can nicely display all the data.

What do you think? Is there any special reason to include this function?

@ThomasMGeo
Copy link
Author

Hi @santisoler , thanks for the review, I just want to touch on this few comments before I dig into the rest of the issues.

I'm not sure what's the benefit by not making the read_gxf to always return an xarray.DataArray

I am happy to have it be a very plain DataArray without the CRS (usually the CRS is not given as a code in the files that I have seen). For some of my work, I have had to do some weird warping to get them to match, but this might be rare and isolated.

The get_grid_info was just used to figure out the projections (again, these are not standard CRS's with modern EPSG codes from what I have seen). So my general workflow would be to load it as a numpy array, look at that output, and then load it into xarray with all of the EPSG codes attached.

Let me know what you think is best, and I can tackle it!

@santisoler
Copy link
Member

Hi @ThomasMGeo. I see your point. Since the gxf files don't have a strict specification for setting the CRS, I think we should just add to the attrs in the xarray.DataArray whatever information we can find in the header of the file. If that information is enough to determine the CRS, that's cool. But if it's not, too bad. I don't think there's nothing the reader function can do to solve that, am I right?

I do like having the information about the projection in the metadata. A nested dict (like the one you already implemented) should be ok (we might want to make sure that that structure follows the CF metadata conventions).

Another thing we were discussing today during the meeting is to use easting and northing for the x and y directions. But use x and y in case the grid is rotated (very much like the read_oasis_grd function does).

@ThomasMGeo
Copy link
Author

Awesome, that all makes sense. Let me work on it for a bit and get back to you. Thanks for giving it a look and for all the feedback! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants