-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend variable expressions for tables #22
base: main
Are you sure you want to change the base?
Conversation
That's a very interesting proposal - I kind of like it, especially because I don't prefer working with dataframe wrappers ;) This works by parsing the string and calling to frame as appropriate? So there might be some limitations on column names, but nothing onerous? And I presume asking for buildings.* would actually be strongly discouraged as you wouldn't get any of the benefits of the dependency tracking or performance improvements? Also, does this have some relationship to orca_test? I mean, it seems like you can a priori go through the list of steps and check if dependencies are met this way, so is kind of a schema checker, but wouldn't do type checking or valid values like orca test does. Does that sound right? |
@fscottfoti, yes to all of the above. I haven't really worked with orca_test yet, but you're right, there does seem to be some overlap and perhaps that would be a better place to express the dependencies. The other thing I've been kicking around, but haven't fully fleshed out yet, is to allow for an injectable name in the expression to define the columns, for example: orca.add_injectable('some_cols', ['blah1', 'blah2'])
@orca.table()
def my_table(df='my_table.@some_cols'):
# ... Here the @ (or some other predefined symbol we might use) is used to denote that we will be pulling the column names, stored in the injectable some_cols, from the environment. This could then be used to automatically derive the columns from a configured model: def get_cols_needed(model, table):
return list(set(model.columns_used()) & set(table.columns))
@orca.injectable()
def hlcm_hh_cols(hlcm, households):
return get_cols_needed(hlcm, households)
@orca.injectable()
def hlcm_bldg_cols(hlcm, buildings):
return get_cols_needed(hlcm, buildings)
@orca.step()
def run_hlcm(hh='households.@hlcm_hh_cols', bldgs='buildings.@hlcm_bldg_cols'):
# ... I guess this is still a little cumbersome, but it allows us to express the dependencies fairly dynamically without having to hard-code anything beyond what's in the yaml file. |
This is a really interesting extension to the variable expressions, and in general I like this new optional functionality! Saves a to_frame() call and sets the stage for a column-level dependency graph. Will try this out in a bit. Agreed that the injectable name in the expressions to define the columns is an interesting idea too- the model-columns-needed example you give is cool. Just a heads up, we'll be cutting an orca release soon, and plan to start getting into a more regular release cycle (and pushing each to pip / conda). It's been awhile since the last one! Since this variable expression PR represents more significant functionality and also so as to give plenty of time for discussion/evaluation of this, we're thinking of cutting the release prior to this getting folded in so as to have a tagged release that captures the last couple year's worth of more minor changes. |
I've been using this branch of orca for a little while for testing, and I like this new functionality quite a bit. Saves on a lot of to_frame() calls and opens up interesting possibilities! Has anyone else had a chance to try this out? |
The more I look at this, I feel like the whole expressions approach should be moved out of the function defintion and into the step/injectable/table/column definition. To me, it's kind of confusing to have the functions's default values be defined as strings but then expect the ultimate value to be something else. It also renders functions inoperable outside of the sim framework. So instead of: @orca.table()
def some_table(df='my_table.[col1, col2]', s='other_table.col1'):
"""
Does something.
Parameters:
-------------
some_table: ?? a table view that looks like a string?
s: a pandas.Series that looks like a string?
"""
# ... I think this should instead be: @orca.table(df='my_table.[col1, col2]', s='other_table.col1')
def some_table(df, s):
"""
Does something.
Parameters:
-------------
some_table: orca table view
s: pandas.Series
"""
# ... Or for a non-decorated version: orca.add_table(some_table, df='my_table.[col1, col2]', s='other_table.col1') |
This PR extends the existing variable expressions framework, to allow table to_frame() calls to be evaluated at the time of injection. This is intended to partially address #15, with the primary goal of making it easier to specify column-level dependencies so that we can eventually build dependency graphs with more dynamic caches. This is intended to provide a new potential workflow, and shouldn't impact existing workflows and configurations.
Typically, when using tables, we do something like this:
This is nice because it's super flexible, but it's really difficult to capture column-level dependencies, since at the time injection, only the wrapper is collected and any column could be used later within the function. Using the new expressions, we essentially move all to_frame calls into the arguments so we know exactly which columns are being used:
Note that what is injected here is a Pandas DataFrame, not an orca TableWrapper. However, the data frame is appended with the following, so it can update the underlying data:
The following expressions are supported for table to_frame() calls:
hh='households.*'
hh='households.local'
hh='households[col1, col2]'
hh='households[local, col2]'
This can also be used outside of a function argument as shorthand for to_frame calls:
See the tests for worked examples:
Thoughts?