Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pack_formula() #518

Open
sumny opened this issue Oct 10, 2020 · 1 comment
Open

pack_formula() #518

sumny opened this issue Oct 10, 2020 · 1 comment
Assignees

Comments

@sumny
Copy link
Member

sumny commented Oct 10, 2020

provide a helper that strips an environment of a formula of all variables that are not referred to by the formula.
useful for i.e., PipeOpMutate and #410

@sumny sumny self-assigned this Oct 10, 2020
@mb706
Copy link
Collaborator

mb706 commented Oct 12, 2020

This is a good idea. One thing in particular I was worried about was things where the user constructs a formula where he refers to a certain variable, and then changes the variable value before executing the pipeop.
A few pointers:

  • this should be a helper for the user, I don't think we should be doing this internally
  • there is the all.vars() function that determines what variables are used inside an expression. The problem is, it does not recognize if the reference is as a function call or as an ordinary variable. This makes trouble in case the user has a function and a variable with both the same name (typical candidate here is c). I think we won't get around having to walk the AST to do this well.
  • a possible problem is that even if we recognize a value is being used inside a formula, it could be that the value is "masked" by the task. So the user does ~ x == y, and y occurs both in the environment of the formula, but also in the task, the task's y will have precedence, but the function won't know that at the time of packing
  • there is the general problem that variables in a formula could be "surprise masked" by the task, e.g. if the user doesn't expect there to be a y column in the task. Or if the task changes and suddenly the user has to rename his variables just because of name clashes. Maybe we want to offer a ..() function (like cd .. in a shell) that lets the user access things outside the formula. variables in ..() get precedence before columns in a task. Variables that are not in a ..() are dropped by pack_formula() and are only available if they are in a package (so function calls still work).
    I think a way to implement this would be to walk the AST, check for ..()-calls, check if they are at the place of a function call or a value access, and get the respective value. The formula itself should not be changed, instead the ..() function should be an actual function that gets a value from the environment above the task column environment. This way
    1. the pack_formula() doesn't actually modify the formula
    2. it is not necessary to use pack_formula(), e.g. when the user doesn't worry about formula environments
    3. the ..() thing is still useful even if pack_formula() is not used (i.e. to avoid surprise variable name clashes)
  1. What do you think of the design?
  2. Should I write this or do you feel up to the task?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants