-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change dot notation in add column documentation to tuple #1433
Change dot notation in add column documentation to tuple #1433
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jeppe-dos for fixing this 🙌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there might be a bug with this change. I tried to follow the docs
from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, DoubleType, LongType
warehouse_path = "/tmp/warehouse"
catalog = SqlCatalog(
"default",
**{
"uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
"warehouse": f"file://{warehouse_path}",
},
)
schema = Schema(
NestedField(1, "city", StringType(), required=False),
NestedField(2, "lat", DoubleType(), required=False),
NestedField(3, "long", DoubleType(), required=False),
)
catalog.create_namespace_if_not_exists("default")
try:
catalog.drop_table("default.locations")
except:
pass
table = catalog.create_table("default.locations", schema)
# with table.update_schema() as update:
# # In a struct
# update.add_column("details.confirmed_by", StringType(), "Name of the exchange")
with table.update_schema() as update:
update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")
errors
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/Users/kevinliu/repos/iceberg-python/pyiceberg/table/update/schema.py", line 192, in add_column
parent_field = self._schema.find_field(parent_full_path, self._case_sensitive)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kevinliu/repos/iceberg-python/pyiceberg/schema.py", line 215, in find_field
raise ValueError(f"Could not find field with name {name_or_id}, case_sensitive={case_sensitive}")
ValueError: Could not find field with name details, case_sensitive=True
Heres where the errors happens iceberg-python/pyiceberg/table/update/schema.py Lines 184 to 192 in b0ea716
And some debugging statements:
is where it errors. Seems like we're missing the case where no "parent" is present |
Yes, the struct has to exist before you can insert anything into it. This can be adjusted in the code to automatically create the parent. For now, it is detailed in the documentation changes. Should I write more explicitly? |
ah i see, that makes sense. in that case, can we edit the example so that it works out of the box? Also i think its valuable to move the comment to the top level docs of "Add Column". We can include both the details about dot notation and struct parent |
i found another dot notion in |
I assume so. I'll test and update accordingly. |
The struct is now created first in the add column section. I have also changed from dot to tuple in move and rename column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might still be some issues, i wasn't able to run the example
from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import DoubleType, IntegerType, NestedField, StringType, StructType
warehouse_path = "/tmp/warehouse"
catalog = SqlCatalog(
"default",
uri=f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", warehouse=f"file://{warehouse_path}",
)
schema = Schema(
NestedField(1, "city", StringType(), required=False),
NestedField(2, "lat", DoubleType(), required=False),
NestedField(3, "long", DoubleType(), required=False),
)
catalog.create_namespace_if_not_exists("default")
try:
catalog.drop_table("default.locations")
except:
pass
table = catalog.create_table("default.locations", schema)
with table.update_schema() as update:
update.add_column("retries", IntegerType(), "Number of retries to place the bid")
# In a struct
update.add_column("details", StructType())
update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")
print(table.refresh().schema())
with table.update_schema() as update:
update.rename_column("retries", "num_retries")
# This will rename `confirmed_by` to `exchange`
update.rename_column("properties.confirmed_by", "exchange")
print(table.refresh().schema())
with table.update_schema() as update:
update.move_first("symbol")
# This will move `bid` after `ask`
update.move_after("bid", "ask")
# This will move `confirmed_by` before `exchange` in the `details` struct
update.move_before(("details", "confirmed_by"), ("details", "exchange"))
print(table.refresh().schema())
with table.update_schema(allow_incompatible_changes=True) as update:
update.delete_column("some_field")
# In a struct
update.delete_column(("details", "confirmed_by"))
print(table.refresh().schema())
What if you create the struct first, and then add the nested field like so:
|
that works, but i think the first example should work too. We can track this in a separate issue.
|
Agreed. Should I open an issue on this? Thank you for reviewing the changes. |
@jeppe-dos that would be great! |
im having trouble running the new statements in the docs, could you give it a try ? |
The code doesn't work, as "confirmed_by" has been changed to "exchange". Exchange can therefore not move before confirmed_by as it no longer exist. I have changed the renamed field to processed_by to make it a bit more clear. In your opinion, should you be able to copy the whole documentation and make it work en sequence? It wasn't the case before, but I can change it to be the case, if you would like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your opinion, should you be able to copy the whole documentation and make it work en sequence? It wasn't the case before, but I can change it to be the case, if you would like.
ah looks like the documentation is sectioned by functionality.
https://py.iceberg.apache.org/api/#schema-evolution
It would be nice to have them work in sequence. If im exploring the documentation, i can just copy and paste each statements.
LGTM!
Co-authored-by: Kevin Liu <[email protected]>
mkdocs/docs/api.md
Outdated
# This will rename `confirmed_by` to `exchange` | ||
update.rename_column("properties.confirmed_by", "exchange") | ||
# This will rename `confirmed_by` to `processed_by` in the `details` struct | ||
update.rename_column(("details", "confirmed_by"), ("detail", "processed_by")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this example doesnt work
>>> with table.update_schema() as update:
... update.rename_column(("details", "confirmed_by"), ("detail", "processed_by"))
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/Users/kevinliu/repos/iceberg-python/pyiceberg/table/update/schema.py", line 278, in rename_column
self._updates[field_from.field_id] = NestedField(
^^^^^^^^^^^^
File "/Users/kevinliu/repos/iceberg-python/pyiceberg/types.py", line 333, in __init__
super().__init__(**data)
File "/Users/kevinliu/Library/Caches/pypoetry/virtualenvs/pyiceberg-Is5Rt7Ah-py3.12/lib/python3.12/site-packages/pydantic/main.py", line 214, in __init__
validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for NestedField
name
Input should be a valid string [type=string_type, input_value=('detail', 'processed_by'), input_type=tuple]
For further information visit https://errors.pydantic.dev/2.10/v/string_type
even though table has details.confirmed_by
>>> print(table.schema())
table {
1: city: optional string
2: lat: optional double
3: long: optional double
4: details: optional struct<5: confirmed_by: optional string (Name of the exchange)>
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course. The renamed field shouldn't be in a tuple. I have fixed it now. Each part now individually works, except for add_column, as discussed.
Thanks for your help improving the docs @jeppe-dos! |
A tuple must be used to make columns in structs as described in add_column:
"Because "." may be interpreted as a column path separator or may be used in field names, it is not allowed to add nested column by passing in a string. To add to nested structures or to add fields with names that contain "." use a tuple instead to indicate the path."
This PR corrects the documentation to use tuples instead of dot notation.
From issue 1407