-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
As a User, I would like HydroTools to expose public api tooling at the subpackage level #177
Comments
related to #141 |
This is a bit too abstract for me and could use more concrete examples (ha!). It seems we could be discussing multiple related issues. I generally agree flat package structures are better than deep. But as it pertains to public and private APIs, I'm not clear on the distinction. Is If the issue is that |
😏
Ah right, I should have been more complete and included a definition of sorts. My use of public vs private api in this context is to mean functions, methods, and constants that are or aren't exposed to users of I think what I am really trying to get at is, the most frequently used (this is subjective and will require collective agreement) tools (here tools meaning classes and methods) should be the first a user is presented with. So, as you were alluding to in your comment, @jarq6c, my critiques are more about code organization and code style. In the case example of |
I think I see what you mean, but I also think further discussion would be helpful. The "helper" methods you listed in this example will be necessary for many forms of event detection (not just To be clear, the Is this issue with the I've been meaning to add Eckhardt's filter to |
Why not resurrect a two year old issue? I thought I would add the method below to from hydrotools.nwis_client.iv import IVDataService
import pandas as pd
import matplotlib.pyplot as plt
def separate_baseflow(
series: pd.Series,
output_time_scale: pd.Timedelta,
recession_time_scale: pd.Timedelta = "1D",
minimum_recession_length: pd.Timedelta = "5D",
random_error: float = 0.01
) -> pd.Series:
"""Applies the digital recursive filter from Eckhardt (2005)
to series and returns a baseflow time series. A linear reservoir recession
constant (a) is estimated using the method found in Eckhardt (2008).
The maximum baseflow index (BFI_max) is computed using the reverse filtering
method found in Collischonn & Fan (2013).
Args:
series (pandas.Series): float Series of streamflow values with a DateTimeIndex.
output_time_scale (pandas.Timedelta): Desired output time scale of the
resulting baseflow time series.
recession_time_scale (pandas.Timedelta): Time scale over which to evaluate
potential baseflow recessions. Defaults to '1D' as in Eckhardt (2008). This
parameter can be informed by catchment scale and "signal-to-noise ratio."
Smaller catchments with faster responding streams may be able to use
a shorter recession_time_scale (say '12h'). However, smaller values will
be more susceptible to random fluctuations in streamflow.
minimum_recession_length (pandas.Timedelta): Minimum length of time to
consider when determining potential recessions. This is the amount
of time over which consecutive streamflow values should all decrease.
Defaults to '5D' as in Eckhardt (2008). Should be some multiple of
recession_time_scale, 3 or 5 times the recession_time_scale works
for many cases.
random_error (float): Expected random error of a single streamflow value
at the recession_time_scale. Defaults to 1% (0.01) as in Eckhardt (2008).
Returns:
pandas.Series: A separated baseflow time series derived from the input at
the desired output_time_scale.
"""
# Identify recessions
resampled = series.resample(recession_time_scale).first().interpolate()
difference = resampled.diff()
is_recession = difference.rolling(
window=minimum_recession_length,
center=True).apply(lambda w: (w < 0.0).all())
previous = resampled.shift(1)
# Fit line through origin and
# "upper bound" of data to
# estimate recession constant (a)
# This assumes a linear reservoir
x = previous[is_recession == 1.0]
y = resampled[is_recession == 1.0]
recession_constant = max(y / x) * (1.0 - 2.0 * random_error)
# Convert slope to output time scale
periods = pd.Timedelta(recession_time_scale) / pd.Timedelta(output_time_scale)
recession_constant = 1 - ((1 - recession_constant) / periods)
# Estimate maximum baseflow index (BFI_max)
baseflow = series.resample(output_time_scale).first().interpolate()
maximum_baseflow = baseflow[::-1]
for idx in range(1, maximum_baseflow.size):
maximum_baseflow[idx] = min(maximum_baseflow[idx-1] / recession_constant, maximum_baseflow[idx])
max_baseflow_index = sum(maximum_baseflow) / sum(baseflow)
# Estimate baseflow
for idx in range(1, baseflow.size):
estimate = (((1 - max_baseflow_index) * recession_constant * baseflow[idx-1] +
(1 - recession_constant) * max_baseflow_index * baseflow[idx]) /
(1 - recession_constant * max_baseflow_index))
baseflow[idx] = min(estimate, baseflow[idx])
return baseflow
# Get observed data
site_code = "01646500"
service = IVDataService()
observed = service.get(
sites=site_code,
startDT="2023-01-01",
endDT="2024-01-01"
)
# Preprocess data
observed = observed[['value_time', 'value']]
observed = observed.drop_duplicates(['value_time'])
observed = observed.set_index('value_time')
# Baseflow separation
observed = observed.resample("15min").first()
observed["baseflow"] = separate_baseflow(observed["value"], "15min")
observed.plot()
plt.show() |
Currently, we are a little loose with the way we organize our "public" apis'. For example, in
hydrotools
'smetrics
subpackage, a user must do the following to access the "public" functions:This can lead to confusion when "non-public" methods are exposed in subpackage submodules alongside "public" methods:
The aforementioned user story is from @Castronova's experience (please chime in with further context as you see fit).
This movement to flatter python package structures is apparent in many large python libraries (e.g.
pandas
). With this in mind, my point is not to just copy what others are doing even if these practices are becoming idiomatic. Instead, I would like to talk about the pros and cons of moving towards flattening subpackage structures to expose api's at the subpackage level.Given our scientific audience, I want to avoid introducing ambiguity into what someone might infer a method does. For example:
the above code snippet is not clear enough IMO. However, I think there is a middle ground where we adhere to scientific integrity and maximize usability.
Much of what I am trying to describe
sklearn
concretizes in their package layout. There is an initial categorization in their subpackage structure (this mirrors ours) with the majority of "public" methods being exposed in a flat structure at the subpackage level.I don't want to talk implementation detail (yet, if required), but instead I'd rather hear what others think and where you all deem it appropriate and inappropriate. Im looking forward to hearing your thoughts.
Pinging a few people who might want to provide feedback (@jarq6c, @hellkite500, @christophertubbs, @amaes3owp, @ZacharyWills, @jameshalgren)
The text was updated successfully, but these errors were encountered: