Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for higher frequency precipitation data #132

Closed
ptoews opened this issue Oct 24, 2022 · 5 comments
Closed

Support for higher frequency precipitation data #132

ptoews opened this issue Oct 24, 2022 · 5 comments

Comments

@ptoews
Copy link

ptoews commented Oct 24, 2022

Are there any plans to support providing higher frequency data, e.g. 1-minute intervals from here: https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/precipitation/recent/ ?
The directory structure seems different, but the file content structure looks similar.

@jdemaeyer
Copy link
Owner

Hi again @ptoews!

I don't think we'll be supporting this dataset through the JSON API as it doesn't fit very well into our current hourly-data structure and would pretty much explode the size of our current production database from ~25 GB to somewhere in the vicinity of a terabyte.

However, as you note, the structure of the data is quite similar, so you can re-use the parsing components in brightsky.parsers to parse these files locally, e.g. like this:

# dwd_parsing.py

import datetime

from brightsky.parsers import ObservationsParser
from dateutil.tz import tzutc


class MinutelyPrecipitationParser(ObservationsParser):

    elements = {
        'precipitation': 'RS_01',
    }

    def parse_station_id(self, zf):
        return None

    def parse_lat_lon_history(self, zf, dwd_station_id):
        return {}

    def parse_reader(self, filename, reader, lat_lon_history):
        for row in reader:
            timestamp = datetime.datetime.strptime(
                row['MESS_DATUM'], '%Y%m%d%H%M').replace(tzinfo=tzutc())
            yield {
                'timestamp': timestamp,
                **self.parse_elements(row, None, None, None),
            }


def parse_1min(url):
    parsers = MinutelyPrecipitationParser(url=url)
    parsers.download()
    records = list(parsers.parse())
    parsers.cleanup()
    return records

used like:

In [1]: from dwd_parsing import parse_1min

In [2]: url = 'https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/precipitation/recent/1minutenwerte_nieder_00902_akt.zip'

In [3]: parse_1min(url)[0]
Out[3]:
{'observation_type': 'historical',
 'dwd_station_id': None,
 'wmo_station_id': None,
 'timestamp': datetime.datetime(2022, 1, 1, 0, 0, tzinfo=tzutc()),
 'precipitation': 0.0}

I'm keeping this ticket open nonetheless to gauge if there's a lot of community interest for retrieving this data through the API.

@ptoews
Copy link
Author

ptoews commented Oct 31, 2022

Hi @jdemaeyer, thank you for the detailed example, that's already helpful!

To be honest, I wasn't even aware that all the data goes through your databases. I've read through the Readme and some of the code now, am I understanding correctly that the purpose is to have an index of station locations, to find the nearest ones for a given query? But then why store the weather data as well, for efficiency/to reduce DWD API load?

Since you're understandably not going to integrate this kind of data into the databases for now, what I think would help me is a tool that takes a location and a time, and returns the weather data from DWD. The tool would need an index to find the closest station, and then simply build a DWD file url consisting of precipitation interval, station id, and time. The file at this URL would be parsed as you described already.

Am I missing something? Do you think this would be something that would be useful as part of brightsky?

@jdemaeyer
Copy link
Owner

[...] am I understanding correctly that the purpose is to have an index of station locations, to find the nearest ones for a given query? But then why store the weather data as well, for efficiency/to reduce DWD API load?

Both of these, although the station index only plays a minor role compared to the performance of receiving weather records once we know which stations to look at.

Transparently (re-)loading and (re-)parsing the data from the DWD server for every request would be a massive waste of resources. A typical weather record contains data from nine different files on the DWD server, each of which is somewhere between 50 and 100 kilobytes (because each file holds multi-year measurements for a different parameter for that station). Bright Sky currently receives a little shy of a thousand requests per minute. So if we wouldn't store the data in our own database we'd be requesting around a terabyte per day from the DWD server (roughly 100 Mbit/s). And that hasn't even gotten us started on the majestic amount of CPU power required to parse the same files over and over again, particularly if we want to reply within 12 ms on average like we currently do.

Am I missing something? Do you think this would be something that would be useful as part of brightsky?

While the approach you outline should work (but will be very inefficient for the reasons above), I don't think it'll land in Bright Sky. Particularly because it violates fair use principles: we would be building a service that eats more and more of someone else's resources as it grows (in this case the DWD's storage and bandwidth).

The /sources endpoint allows querying Bright Sky's lat-lon-to-station-id mapping without retrieving any weather records, maybe that can help you? From there you could easily build the URL to the 1 minute precipitation data and parse it like in my post above.

@ptoews
Copy link
Author

ptoews commented Nov 4, 2022

That makes sense, that's a lot of data. The sources endpoint is very helpful, will use that for sure. Thanks!

@jdemaeyer
Copy link
Owner

(Closing in favour of #148, which contains more alternatives)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants