Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend variable expressions for tables #22

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

bridwell
Copy link
Contributor

This PR extends the existing variable expressions framework, to allow table to_frame() calls to be evaluated at the time of injection. This is intended to partially address #15, with the primary goal of making it easier to specify column-level dependencies so that we can eventually build dependency graphs with more dynamic caches. This is intended to provide a new potential workflow, and shouldn't impact existing workflows and configurations.

Typically, when using tables, we do something like this:

@orca.step
def my_step(households, buildings):
  # get data frome the wrapper
  hh_df = households.to_frame([‘col1’, ‘col2’])
  bldgs_df = buildings.to_frame()
  
  # some logic using the fetched data
  new_col = hh_df['col2'] / 2

  # update the underlying data
  households.update_col('col2', new_col)

This is nice because it's super flexible, but it's really difficult to capture column-level dependencies, since at the time injection, only the wrapper is collected and any column could be used later within the function. Using the new expressions, we essentially move all to_frame calls into the arguments so we know exactly which columns are being used:

@orca.step
def my_step(hh_df='households[col1, col2]', bldgs_df='buildings.*'):
  # some logic ...
  new_col = hh_df['col2'] / 2

  # update the underlying data
  hh_df.update_col('col2', new_col)

Note that what is injected here is a Pandas DataFrame, not an orca TableWrapper. However, the data frame is appended with the following, so it can update the underlying data:

  • .wrapper: returns a reference to the orca table wrapper
  • .update_col: calls TableWrapper.update_col
  • .update_col_from_series: calls TableWrapper.update_col_from_series

The following expressions are supported for table to_frame() calls:

  • fetch all columns: hh='households.*'
  • fetch only local columns: hh='households.local'
  • fetch specific columns: hh='households[col1, col2]'
  • fetch all local columns and specific extra columns: hh='households[local, col2]'

This can also be used outside of a function argument as shorthand for to_frame calls:

hh_view = orca.get_table_view('households', ['local', 'col1'])
new_col = hh_view['col1'] + hh_view['col2']
hh_view.update_col('col3', new_col)

See the tests for worked examples:

Thoughts?

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 96.628% when pulling 920aa3e on AZMAG:variable_expressions into f49398e on UDST:master.

@fscottfoti
Copy link
Contributor

That's a very interesting proposal - I kind of like it, especially because I don't prefer working with dataframe wrappers ;)

This works by parsing the string and calling to frame as appropriate? So there might be some limitations on column names, but nothing onerous?

And I presume asking for buildings.* would actually be strongly discouraged as you wouldn't get any of the benefits of the dependency tracking or performance improvements?

Also, does this have some relationship to orca_test? I mean, it seems like you can a priori go through the list of steps and check if dependencies are met this way, so is kind of a schema checker, but wouldn't do type checking or valid values like orca test does. Does that sound right?

@bridwell
Copy link
Contributor Author

bridwell commented Mar 29, 2017

@fscottfoti, yes to all of the above. I haven't really worked with orca_test yet, but you're right, there does seem to be some overlap and perhaps that would be a better place to express the dependencies.

The other thing I've been kicking around, but haven't fully fleshed out yet, is to allow for an injectable name in the expression to define the columns, for example:

orca.add_injectable('some_cols', ['blah1', 'blah2'])

@orca.table()
def my_table(df='my_table.@some_cols'):
  # ...

Here the @ (or some other predefined symbol we might use) is used to denote that we will be pulling the column names, stored in the injectable some_cols, from the environment. This could then be used to automatically derive the columns from a configured model:

def get_cols_needed(model, table):
  return list(set(model.columns_used()) & set(table.columns))

@orca.injectable()
def hlcm_hh_cols(hlcm, households):
  return get_cols_needed(hlcm, households)

@orca.injectable()
def hlcm_bldg_cols(hlcm, buildings):
  return get_cols_needed(hlcm,  buildings)

@orca.step()
def run_hlcm(hh='households.@hlcm_hh_cols', bldgs='buildings.@hlcm_bldg_cols'):
  # ... 

I guess this is still a little cumbersome, but it allows us to express the dependencies fairly dynamically without having to hard-code anything beyond what's in the yaml file.

@janowicz
Copy link
Contributor

This is a really interesting extension to the variable expressions, and in general I like this new optional functionality! Saves a to_frame() call and sets the stage for a column-level dependency graph. Will try this out in a bit. Agreed that the injectable name in the expressions to define the columns is an interesting idea too- the model-columns-needed example you give is cool.

Just a heads up, we'll be cutting an orca release soon, and plan to start getting into a more regular release cycle (and pushing each to pip / conda). It's been awhile since the last one! Since this variable expression PR represents more significant functionality and also so as to give plenty of time for discussion/evaluation of this, we're thinking of cutting the release prior to this getting folded in so as to have a tagged release that captures the last couple year's worth of more minor changes.

@janowicz
Copy link
Contributor

I've been using this branch of orca for a little while for testing, and I like this new functionality quite a bit. Saves on a lot of to_frame() calls and opens up interesting possibilities!

Has anyone else had a chance to try this out?

@bridwell
Copy link
Contributor Author

bridwell commented Jul 31, 2017

The more I look at this, I feel like the whole expressions approach should be moved out of the function defintion and into the step/injectable/table/column definition. To me, it's kind of confusing to have the functions's default values be defined as strings but then expect the ultimate value to be something else. It also renders functions inoperable outside of the sim framework.

So instead of:

@orca.table()
def some_table(df='my_table.[col1, col2]', s='other_table.col1'):
   """
   Does something.
   
   Parameters:
   -------------
   some_table:  ?? a table view that looks like a string?
   s:  a pandas.Series that looks like a string?

   """
  # ...

I think this should instead be:

@orca.table(df='my_table.[col1, col2]', s='other_table.col1')
def some_table(df, s):
   """
   Does something.

   Parameters:
   -------------
   some_table:  orca table view
   s:  pandas.Series
   """
  # ...

Or for a non-decorated version:

orca.add_table(some_table, df='my_table.[col1, col2]', s='other_table.col1')

Base automatically changed from master to main March 25, 2021 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants