Extend variable expressions for tables #22

bridwell · 2017-03-28T18:44:09Z

This PR extends the existing variable expressions framework, to allow table to_frame() calls to be evaluated at the time of injection. This is intended to partially address #15, with the primary goal of making it easier to specify column-level dependencies so that we can eventually build dependency graphs with more dynamic caches. This is intended to provide a new potential workflow, and shouldn't impact existing workflows and configurations.

Typically, when using tables, we do something like this:

@orca.step
def my_step(households, buildings):
  # get data frome the wrapper
  hh_df = households.to_frame([‘col1’, ‘col2’])
  bldgs_df = buildings.to_frame()
  
  # some logic using the fetched data
  new_col = hh_df['col2'] / 2

  # update the underlying data
  households.update_col('col2', new_col)

This is nice because it's super flexible, but it's really difficult to capture column-level dependencies, since at the time injection, only the wrapper is collected and any column could be used later within the function. Using the new expressions, we essentially move all to_frame calls into the arguments so we know exactly which columns are being used:

@orca.step
def my_step(hh_df='households[col1, col2]', bldgs_df='buildings.*'):
  # some logic ...
  new_col = hh_df['col2'] / 2

  # update the underlying data
  hh_df.update_col('col2', new_col)

Note that what is injected here is a Pandas DataFrame, not an orca TableWrapper. However, the data frame is appended with the following, so it can update the underlying data:

.wrapper: returns a reference to the orca table wrapper
.update_col: calls TableWrapper.update_col
.update_col_from_series: calls TableWrapper.update_col_from_series

The following expressions are supported for table to_frame() calls:

fetch all columns: hh='households.*'
fetch only local columns: hh='households.local'
fetch specific columns: hh='households[col1, col2]'
fetch all local columns and specific extra columns: hh='households[local, col2]'

This can also be used outside of a function argument as shorthand for to_frame calls:

hh_view = orca.get_table_view('households', ['local', 'col1'])
new_col = hh_view['col1'] + hh_view['col2']
hh_view.update_col('col3', new_col)

See the tests for worked examples:

Thoughts?

coveralls · 2017-03-28T18:49:11Z

Coverage increased (+0.1%) to 96.628% when pulling 920aa3e on AZMAG:variable_expressions into f49398e on UDST:master.

fscottfoti · 2017-03-29T17:45:17Z

That's a very interesting proposal - I kind of like it, especially because I don't prefer working with dataframe wrappers ;)

This works by parsing the string and calling to frame as appropriate? So there might be some limitations on column names, but nothing onerous?

And I presume asking for buildings.* would actually be strongly discouraged as you wouldn't get any of the benefits of the dependency tracking or performance improvements?

Also, does this have some relationship to orca_test? I mean, it seems like you can a priori go through the list of steps and check if dependencies are met this way, so is kind of a schema checker, but wouldn't do type checking or valid values like orca test does. Does that sound right?

bridwell · 2017-03-29T19:47:45Z

@fscottfoti, yes to all of the above. I haven't really worked with orca_test yet, but you're right, there does seem to be some overlap and perhaps that would be a better place to express the dependencies.

The other thing I've been kicking around, but haven't fully fleshed out yet, is to allow for an injectable name in the expression to define the columns, for example:

orca.add_injectable('some_cols', ['blah1', 'blah2'])

@orca.table()
def my_table(df='my_table.@some_cols'):
  # ...

Here the @ (or some other predefined symbol we might use) is used to denote that we will be pulling the column names, stored in the injectable some_cols, from the environment. This could then be used to automatically derive the columns from a configured model:

def get_cols_needed(model, table):
  return list(set(model.columns_used()) & set(table.columns))

@orca.injectable()
def hlcm_hh_cols(hlcm, households):
  return get_cols_needed(hlcm, households)

@orca.injectable()
def hlcm_bldg_cols(hlcm, buildings):
  return get_cols_needed(hlcm,  buildings)

@orca.step()
def run_hlcm(hh='households.@hlcm_hh_cols', bldgs='buildings.@hlcm_bldg_cols'):
  # ...

I guess this is still a little cumbersome, but it allows us to express the dependencies fairly dynamically without having to hard-code anything beyond what's in the yaml file.

janowicz · 2017-03-30T17:06:38Z

This is a really interesting extension to the variable expressions, and in general I like this new optional functionality! Saves a to_frame() call and sets the stage for a column-level dependency graph. Will try this out in a bit. Agreed that the injectable name in the expressions to define the columns is an interesting idea too- the model-columns-needed example you give is cool.

Just a heads up, we'll be cutting an orca release soon, and plan to start getting into a more regular release cycle (and pushing each to pip / conda). It's been awhile since the last one! Since this variable expression PR represents more significant functionality and also so as to give plenty of time for discussion/evaluation of this, we're thinking of cutting the release prior to this getting folded in so as to have a tagged release that captures the last couple year's worth of more minor changes.

janowicz · 2017-06-21T16:19:11Z

I've been using this branch of orca for a little while for testing, and I like this new functionality quite a bit. Saves on a lot of to_frame() calls and opens up interesting possibilities!

Has anyone else had a chance to try this out?

bridwell · 2017-07-31T20:12:18Z

The more I look at this, I feel like the whole expressions approach should be moved out of the function defintion and into the step/injectable/table/column definition. To me, it's kind of confusing to have the functions's default values be defined as strings but then expect the ultimate value to be something else. It also renders functions inoperable outside of the sim framework.

So instead of:

@orca.table()
def some_table(df='my_table.[col1, col2]', s='other_table.col1'):
   """
   Does something.
   
   Parameters:
   -------------
   some_table:  ?? a table view that looks like a string?
   s:  a pandas.Series that looks like a string?

   """
  # ...

I think this should instead be:

@orca.table(df='my_table.[col1, col2]', s='other_table.col1')
def some_table(df, s):
   """
   Does something.

   Parameters:
   -------------
   some_table:  orca table view
   s:  pandas.Series
   """
  # ...

Or for a non-decorated version:

orca.add_table(some_table, df='my_table.[col1, col2]', s='other_table.col1')

bridwell added 2 commits March 27, 2017 13:57

add table view and tests

0dab264

extend expressions for table views

920aa3e

hanase mentioned this pull request Apr 4, 2017

Allow the "run" function to store local columns only #23

Merged

Base automatically changed from master to main March 25, 2021 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend variable expressions for tables #22

Extend variable expressions for tables #22

bridwell commented Mar 28, 2017

coveralls commented Mar 28, 2017

fscottfoti commented Mar 29, 2017

bridwell commented Mar 29, 2017 •

edited

Loading

janowicz commented Mar 30, 2017

janowicz commented Jun 21, 2017

bridwell commented Jul 31, 2017 •

edited

Loading

Extend variable expressions for tables #22

Are you sure you want to change the base?

Extend variable expressions for tables #22

Conversation

bridwell commented Mar 28, 2017

coveralls commented Mar 28, 2017

fscottfoti commented Mar 29, 2017

bridwell commented Mar 29, 2017 • edited Loading

janowicz commented Mar 30, 2017

janowicz commented Jun 21, 2017

bridwell commented Jul 31, 2017 • edited Loading

bridwell commented Mar 29, 2017 •

edited

Loading

bridwell commented Jul 31, 2017 •

edited

Loading