-
-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for higher frequency precipitation data #132
Comments
Hi again @ptoews! I don't think we'll be supporting this dataset through the JSON API as it doesn't fit very well into our current hourly-data structure and would pretty much explode the size of our current production database from ~25 GB to somewhere in the vicinity of a terabyte. However, as you note, the structure of the data is quite similar, so you can re-use the parsing components in # dwd_parsing.py
import datetime
from brightsky.parsers import ObservationsParser
from dateutil.tz import tzutc
class MinutelyPrecipitationParser(ObservationsParser):
elements = {
'precipitation': 'RS_01',
}
def parse_station_id(self, zf):
return None
def parse_lat_lon_history(self, zf, dwd_station_id):
return {}
def parse_reader(self, filename, reader, lat_lon_history):
for row in reader:
timestamp = datetime.datetime.strptime(
row['MESS_DATUM'], '%Y%m%d%H%M').replace(tzinfo=tzutc())
yield {
'timestamp': timestamp,
**self.parse_elements(row, None, None, None),
}
def parse_1min(url):
parsers = MinutelyPrecipitationParser(url=url)
parsers.download()
records = list(parsers.parse())
parsers.cleanup()
return records used like: In [1]: from dwd_parsing import parse_1min
In [2]: url = 'https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/precipitation/recent/1minutenwerte_nieder_00902_akt.zip'
In [3]: parse_1min(url)[0]
Out[3]:
{'observation_type': 'historical',
'dwd_station_id': None,
'wmo_station_id': None,
'timestamp': datetime.datetime(2022, 1, 1, 0, 0, tzinfo=tzutc()),
'precipitation': 0.0} I'm keeping this ticket open nonetheless to gauge if there's a lot of community interest for retrieving this data through the API. |
Hi @jdemaeyer, thank you for the detailed example, that's already helpful! To be honest, I wasn't even aware that all the data goes through your databases. I've read through the Readme and some of the code now, am I understanding correctly that the purpose is to have an index of station locations, to find the nearest ones for a given query? But then why store the weather data as well, for efficiency/to reduce DWD API load? Since you're understandably not going to integrate this kind of data into the databases for now, what I think would help me is a tool that takes a location and a time, and returns the weather data from DWD. The tool would need an index to find the closest station, and then simply build a DWD file url consisting of precipitation interval, station id, and time. The file at this URL would be parsed as you described already. Am I missing something? Do you think this would be something that would be useful as part of brightsky? |
Both of these, although the station index only plays a minor role compared to the performance of receiving weather records once we know which stations to look at. Transparently (re-)loading and (re-)parsing the data from the DWD server for every request would be a massive waste of resources. A typical weather record contains data from nine different files on the DWD server, each of which is somewhere between 50 and 100 kilobytes (because each file holds multi-year measurements for a different parameter for that station). Bright Sky currently receives a little shy of a thousand requests per minute. So if we wouldn't store the data in our own database we'd be requesting around a terabyte per day from the DWD server (roughly 100 Mbit/s). And that hasn't even gotten us started on the majestic amount of CPU power required to parse the same files over and over again, particularly if we want to reply within 12 ms on average like we currently do.
While the approach you outline should work (but will be very inefficient for the reasons above), I don't think it'll land in Bright Sky. Particularly because it violates fair use principles: we would be building a service that eats more and more of someone else's resources as it grows (in this case the DWD's storage and bandwidth). The |
That makes sense, that's a lot of data. The sources endpoint is very helpful, will use that for sure. Thanks! |
(Closing in favour of #148, which contains more alternatives) |
Are there any plans to support providing higher frequency data, e.g. 1-minute intervals from here: https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/precipitation/recent/ ?
The directory structure seems different, but the file content structure looks similar.
The text was updated successfully, but these errors were encountered: