-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(api): support rollup
, cube
and grouping_sets
APIs
#9945
base: main
Are you sure you want to change the base?
Conversation
This seems like a reasonable approach to me. I think it would be nice to allow passing in bare strings, values, or deferreds to ibis.grouping_sets(("a", "b"), "a", "b") this would better mimic the SQL syntax (god, i can't believe I'm suggesting that...) and also avoids the awkward spelling of ibis.grouping_sets(("a", "b"), ("a",), ("b",)) And on the naming front, I thing we should take a pointer from DuckDB and rename but overall seems solid! |
All good points, definitely changing |
For a full featured and correct implementation, grouping sets will depend on the earliest version of sqlglot contains tobymao/sqlglot#3985. |
Ok, now that I think about this some more, I think having separate What do people think about making these keyword arguments to t.group_by(
...,
rollup=(_.a, "b"),
grouping_sets=("a", "b", ("a", "b")),
cube=(_.c, _.d)
) The separate objects would make sense if there were some way to use this functionality outside of a group by, but AFAICT they are strictly only valid in that case. This would also eliminate having to do the partitioning, since the API requires the user spell it out. |
This would break users who having columns named |
I slightly dislike that. Not because it might break existing users (seems unlikely), but because it's harder to visually distinguish keyword-arguments serving as aliases in t.group_by(one=t.a, cube=(t.b,), three=t.c) # intentionally terrible formatting Another potential downside is it doesn't let users as easily use multiple of these constructs in the same query. Following this line from the docs:
With separate objects, I'd to be able to pass in multiple of the same kind of object, while kwargs would force the user to handle the composition of them themselves (confession - I'm new to this whole concept, so apologies if this is a dumb example): t.group_by(ibis.cube("a", "b"), ibis.cube("c", "d")) Individual objects also may be easier for programmatic usage - a system would only need a single list of things to group by, rather than splitting out the various grouping set kind of things. None of these are blockers of course, and all can be worked around. Just noting a few potential downsides. |
Hm, all good points. I think I'll stick with the objects for now. |
1719532
to
176adc8
Compare
orderings: VarTuple[ops.SortKey] = () | ||
havings: VarTuple[ops.Value[dt.Boolean]] = () | ||
|
||
grouping_sets: VarTuple[VarTuple[VarTuple[ir.Value]]] = () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me sad, and I'll add a comment.
return GroupedTable(self, groups) | ||
groups, grouping_sets, rollups, cubes = partition_groups(*by, **key_exprs) | ||
if not (groups or grouping_sets or rollups or cubes): | ||
raise com.IbisInputError("No grouping keys provided") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we were using an Annotated
type hint to enforce that groups
was non-empty, but we can't do that anymore because it can be empty if the user provides at least one grouping set, rollup, or cube and we don't have a way to spell that kind of relationship in the type system (nor do I think we need to build that right now).
@@ -1040,6 +1069,9 @@ def aggregate( | |||
metrics: Sequence[ir.Scalar] | None = (), | |||
by: Sequence[ir.Value] | None = (), | |||
having: Sequence[ir.BooleanValue] | None = (), | |||
grouping_sets=(), | |||
rollups=(), | |||
cubes=(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like exposing these arguments in the aggregate
. Here's why:
- I have to bind again, similar to what I am doing in the
group_by
method. - Putting the grouping set arguments alongside
by
means there's potential for confusion and mixing: e.g., someone can passingibis.cube
intoby
, and/or they can pass it intocube
. If we ultimately go this route then we'll have to handle that.
I suppose neither of these things is the actual worst, but I'm open to suggestions on how to avoid exposing these additional arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't get away from having to bind again (and actually adds a bit more work), but what if we just have all grouping expressions passed in to by
?
agg(_.some_col.sum(), by=_.a, _.b, ibis.cube(_.a, _.b))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll poke at it a bit. It might be easier if there were bound and unbound versions of the rollup/cube/grouping sets objects. Bound would be used internally only, while the user facing APIs are all unbound. The current ones are effectively Unbound.
@@ -1128,8 +1166,29 @@ def aggregate( | |||
else: | |||
metrics[metric.name] = metric | |||
|
|||
# construct the aggregate node | |||
agg = ops.Aggregate(node, groups, metrics).to_expr() | |||
keys = frozendict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit is far too optimistic and very proof-of-concept. More unit tests are required to throw some chaos at this code.
gs = ibis.cube("a", "b") | ||
expr = t.group_by(gs).agg(n=_.count()) | ||
result = ibis.to_sql(expr, dialect="duckdb") | ||
assert len(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are woefully naive and optimistic, need to make them grumpier.
Keeping this as draft until I can at least add some less sophomoric tests. |
86d21fa
to
441ec7a
Compare
441ec7a
to
1c0553c
Compare
I wish I hadn't learned this while working through reviewing this, but I guess it's valid to pass a |
Which of course means you can nest grouping sets arbitrarily.
|
This is a special level of hell. |
91612a7
to
5f18749
Compare
394b5e8
to
bf14a1e
Compare
e7e11a6
to
db0ebd8
Compare
0363d07
to
38ad719
Compare
38ad719
to
418401f
Compare
Another "fun" thing with these constructs is that some backends implement rollup/cube/grouping sets on an empty table such that the "grand total" group (usually spelled For example: create table t (c int);
select * from t group by rollup (c); |
I think I'm going to cut this from the 10.0 milestone since it's not a breaking change and we need to get a release out. |
Description of changes
WIP PR implementing support for
rollup
,cube
andgrouping sets
.For background on what these are used for and how they work, see
https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-GROUPING-SETS.
In discussing the implementation, I'm going to refer to groupings sets only,
but everything applies to rollups and cubes as well (rollups and cubes are
shorthand for a longer form
GROUPING SET
).These concepts represent a new kind of "thing" for Ibis. They are neither
column nor scalar expressions that can be evaluated, nor are they tables
really.
The approach I took here was to add a non-
Value
operation for each construct.These are desugared into
tuple
s oftuple
s ofValue
expressions (becausemultiple grouping sets specifications are allowed) wherever grouping sets are
supported.
One tricky thing is that I had to partition the key specifically requested in
group by, as distinct from those unique value expressions in grouping sets, to
support constructs like this:
which is not equivalent to
Right now, I'm looking for feedback on the approach before fleshing out the
test suite.
Issues closed