Change dot notation in add column documentation to tuple #1433

jeppe-dos · 2024-12-16T11:02:26Z

A tuple must be used to make columns in structs as described in add_column:
"Because "." may be interpreted as a column path separator or may be used in field names, it is not allowed to add nested column by passing in a string. To add to nested structures or to add fields with names that contain "." use a tuple instead to indicate the path."
This PR corrects the documentation to use tuples instead of dot notation.

From issue 1407

Fokko

Thanks @jeppe-dos for fixing this 🙌

kevinjqliu

Looks like there might be a bug with this change. I tried to follow the docs

from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, DoubleType, LongType

warehouse_path = "/tmp/warehouse"
catalog = SqlCatalog(
    "default",
    **{
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

schema = Schema(
    NestedField(1, "city", StringType(), required=False),
    NestedField(2, "lat", DoubleType(), required=False),
    NestedField(3, "long", DoubleType(), required=False),
)
catalog.create_namespace_if_not_exists("default")
try:
    catalog.drop_table("default.locations")
except:
    pass

table = catalog.create_table("default.locations", schema)

# with table.update_schema() as update:
#     # In a struct
#     update.add_column("details.confirmed_by", StringType(), "Name of the exchange")

with table.update_schema() as update:
    update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")

errors


Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/kevinliu/repos/iceberg-python/pyiceberg/table/update/schema.py", line 192, in add_column
    parent_field = self._schema.find_field(parent_full_path, self._case_sensitive)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kevinliu/repos/iceberg-python/pyiceberg/schema.py", line 215, in find_field
    raise ValueError(f"Could not find field with name {name_or_id}, case_sensitive={case_sensitive}")
ValueError: Could not find field with name details, case_sensitive=True

kevinjqliu · 2024-12-17T18:03:27Z

Heres where the errors happens

iceberg-python/pyiceberg/table/update/schema.py

Lines 184 to 192 in b0ea716

    
           name = path[-1] 
        
           parent = path[:-1] 
        
           full_name = ".".join(path) 
        
           parent_full_path = ".".join(parent) 
        
           parent_id: int = TABLE_ROOT_ID 
        
           if len(parent) > 0: 
        
               parent_field = self._schema.find_field(parent_full_path, self._case_sensitive)

And some debugging statements:

(Pdb) path
('details', 'confirmed_by')
(Pdb) name
'confirmed_by'
(Pdb) parent
('details',)
(Pdb) parent_full_path
'details'
(Pdb) parent_id
-1
(Pdb) len(parent) > 0
True

parent_field = self._schema.find_field(parent_full_path, self._case_sensitive)

is where it errors.

Seems like we're missing the case where no "parent" is present

jeppe-dos · 2024-12-20T13:55:55Z

Looks like there might be a bug with this change. I tried to follow the docs

from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, DoubleType, LongType

warehouse_path = "/tmp/warehouse"
catalog = SqlCatalog(
    "default",
    **{
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

schema = Schema(
    NestedField(1, "city", StringType(), required=False),
    NestedField(2, "lat", DoubleType(), required=False),
    NestedField(3, "long", DoubleType(), required=False),
)
catalog.create_namespace_if_not_exists("default")
try:
    catalog.drop_table("default.locations")
except:
    pass

table = catalog.create_table("default.locations", schema)

# with table.update_schema() as update:
#     # In a struct
#     update.add_column("details.confirmed_by", StringType(), "Name of the exchange")

with table.update_schema() as update:
    update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")

errors


Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/kevinliu/repos/iceberg-python/pyiceberg/table/update/schema.py", line 192, in add_column
    parent_field = self._schema.find_field(parent_full_path, self._case_sensitive)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kevinliu/repos/iceberg-python/pyiceberg/schema.py", line 215, in find_field
    raise ValueError(f"Could not find field with name {name_or_id}, case_sensitive={case_sensitive}")
ValueError: Could not find field with name details, case_sensitive=True

Yes, the struct has to exist before you can insert anything into it. This can be adjusted in the code to automatically create the parent. For now, it is detailed in the documentation changes. Should I write more explicitly?

kevinjqliu · 2024-12-20T16:04:44Z

Yes, the struct has to exist before you can insert anything into it.

ah i see, that makes sense. in that case, can we edit the example so that it works out of the box?

Also i think its valuable to move the comment to the top level docs of "Add Column". We can include both the details about dot notation and struct parent

kevinjqliu · 2024-12-20T16:05:26Z

i found another dot notion in Move column, do we need to change this too?
https://py.iceberg.apache.org/api/#move-column

jeppe-dos · 2025-01-06T09:53:42Z

I assume so. I'll test and update accordingly.

jeppe-dos · 2025-01-06T10:59:24Z

The struct is now created first in the add column section. I have also changed from dot to tuple in move and rename column.

kevinjqliu

There might still be some issues, i wasn't able to run the example

from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import DoubleType, IntegerType, NestedField, StringType, StructType

warehouse_path = "/tmp/warehouse"
catalog = SqlCatalog(
    "default",
    uri=f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", warehouse=f"file://{warehouse_path}",
)

schema = Schema(
    NestedField(1, "city", StringType(), required=False),
    NestedField(2, "lat", DoubleType(), required=False),
    NestedField(3, "long", DoubleType(), required=False),
)
catalog.create_namespace_if_not_exists("default")
try:
    catalog.drop_table("default.locations")
except:
    pass

table = catalog.create_table("default.locations", schema)

with table.update_schema() as update:
    update.add_column("retries", IntegerType(), "Number of retries to place the bid")
    # In a struct
    update.add_column("details", StructType())
    update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")

print(table.refresh().schema())

with table.update_schema() as update:
    update.rename_column("retries", "num_retries")
    # This will rename `confirmed_by` to `exchange`
    update.rename_column("properties.confirmed_by", "exchange")
print(table.refresh().schema())

with table.update_schema() as update:
    update.move_first("symbol")
    # This will move `bid` after `ask`
    update.move_after("bid", "ask")
    # This will move `confirmed_by` before `exchange` in the `details` struct
    update.move_before(("details", "confirmed_by"), ("details", "exchange"))
print(table.refresh().schema())

with table.update_schema(allow_incompatible_changes=True) as update:
    update.delete_column("some_field")
    # In a struct
    update.delete_column(("details", "confirmed_by"))
print(table.refresh().schema())

jeppe-dos · 2025-01-06T15:14:21Z

What if you create the struct first, and then add the nested field like so:

with table.update_schema() as update:
    update.add_column("retries", IntegerType(), "Number of retries to place the bid")
    # In a struct
    update.add_column("details", StructType())

with table.update_schema() as update:
    update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")

kevinjqliu · 2025-01-06T15:17:37Z

that works, but i think the first example should work too. We can track this in a separate issue.

>>> with table.update_schema() as update:
...     update.add_column("retries", IntegerType(), "Number of retries to place the bid")
...     # In a struct
...     update.add_column("details", StructType())
...
<pyiceberg.table.update.schema.UpdateSchema object at 0x11fc59880>
<pyiceberg.table.update.schema.UpdateSchema object at 0x11fc59880>
>>> with table.update_schema() as update:
...     update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")
...
<pyiceberg.table.update.schema.UpdateSchema object at 0x1189d9370>
>>> print(table.refresh().schema())
table {
  1: city: optional string
  2: lat: optional double
  3: long: optional double
  4: retries: optional int (Number of retries to place the bid)
  5: details: optional struct<6: confirmed_by: optional string (Name of the exchange)>
}

mkdocs/docs/api.md

jeppe-dos · 2025-01-06T15:26:57Z

that works, but i think the first example should work too. We can track this in a separate issue.

>>> with table.update_schema() as update:
...     update.add_column("retries", IntegerType(), "Number of retries to place the bid")
...     # In a struct
...     update.add_column("details", StructType())
...
<pyiceberg.table.update.schema.UpdateSchema object at 0x11fc59880>
<pyiceberg.table.update.schema.UpdateSchema object at 0x11fc59880>
>>> with table.update_schema() as update:
...     update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")
...
<pyiceberg.table.update.schema.UpdateSchema object at 0x1189d9370>
>>> print(table.refresh().schema())
table {
  1: city: optional string
  2: lat: optional double
  3: long: optional double
  4: retries: optional int (Number of retries to place the bid)
  5: details: optional struct<6: confirmed_by: optional string (Name of the exchange)>
}

Agreed. Should I open an issue on this?

Thank you for reviewing the changes.

kevinjqliu · 2025-01-06T15:30:01Z

@jeppe-dos that would be great!

kevinjqliu · 2025-01-06T15:36:25Z

im having trouble running the new statements in the docs, could you give it a try ?

jeppe-dos · 2025-01-06T17:22:04Z

im having trouble running the new statements in the docs, could you give it a try ?

The code doesn't work, as "confirmed_by" has been changed to "exchange". Exchange can therefore not move before confirmed_by as it no longer exist. I have changed the renamed field to processed_by to make it a bit more clear.

In your opinion, should you be able to copy the whole documentation and make it work en sequence? It wasn't the case before, but I can change it to be the case, if you would like.

kevinjqliu

In your opinion, should you be able to copy the whole documentation and make it work en sequence? It wasn't the case before, but I can change it to be the case, if you would like.

ah looks like the documentation is sectioned by functionality.
https://py.iceberg.apache.org/api/#schema-evolution

It would be nice to have them work in sequence. If im exploring the documentation, i can just copy and paste each statements.

LGTM!

mkdocs/docs/api.md

Co-authored-by: Kevin Liu <[email protected]>

mkdocs/docs/api.md

kevinjqliu · 2025-01-08T04:36:12Z

mkdocs/docs/api.md

-    # This will rename `confirmed_by` to `exchange`
-    update.rename_column("properties.confirmed_by", "exchange")
+    # This will rename `confirmed_by` to `processed_by` in the `details` struct
+    update.rename_column(("details", "confirmed_by"), ("detail", "processed_by"))


this example doesnt work

>>> with table.update_schema() as update: ... update.rename_column(("details", "confirmed_by"), ("detail", "processed_by")) ... Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/Users/kevinliu/repos/iceberg-python/pyiceberg/table/update/schema.py", line 278, in rename_column self._updates[field_from.field_id] = NestedField( ^^^^^^^^^^^^ File "/Users/kevinliu/repos/iceberg-python/pyiceberg/types.py", line 333, in __init__ super().__init__(**data) File "/Users/kevinliu/Library/Caches/pypoetry/virtualenvs/pyiceberg-Is5Rt7Ah-py3.12/lib/python3.12/site-packages/pydantic/main.py", line 214, in __init__ validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ pydantic_core._pydantic_core.ValidationError: 1 validation error for NestedField name Input should be a valid string [type=string_type, input_value=('detail', 'processed_by'), input_type=tuple] For further information visit https://errors.pydantic.dev/2.10/v/string_type

even though table has details.confirmed_by

>>> print(table.schema()) table { 1: city: optional string 2: lat: optional double 3: long: optional double 4: details: optional struct<5: confirmed_by: optional string (Name of the exchange)> }

Of course. The renamed field shouldn't be in a tuple. I have fixed it now. Each part now individually works, except for add_column, as discussed.

mkdocs/docs/api.md

kevinjqliu · 2025-01-09T18:14:27Z

Thanks for your help improving the docs @jeppe-dos!

Change dot notation in add column documentation to tuple

c377bff

jeppe-dos mentioned this pull request Dec 16, 2024

[Request] Area of Improvements for Documentation #1407

Open

3 tasks

Fokko approved these changes Dec 17, 2024

View reviewed changes

kevinjqliu requested changes Dec 17, 2024

View reviewed changes

Update move and rename column struct in api.md

95f44bc

kevinjqliu reviewed Jan 6, 2025

View reviewed changes

mkdocs/docs/api.md Outdated Show resolved Hide resolved

kevinjqliu reviewed Jan 6, 2025

View reviewed changes

mkdocs/docs/api.md Outdated Show resolved Hide resolved

kevinjqliu reviewed Jan 6, 2025

View reviewed changes

mkdocs/docs/api.md Outdated Show resolved Hide resolved

Correct rename_column, move_before and delete_column in api.md

841d4dc

Change exchange to processed by on rename_column in api.md

ac6ba09

kevinjqliu approved these changes Jan 6, 2025

View reviewed changes

mkdocs/docs/api.md Outdated Show resolved Hide resolved

Update mkdocs/docs/api.md

51f9c15

Co-authored-by: Kevin Liu <[email protected]>

jeppe-dos closed this Jan 7, 2025

jeppe-dos reopened this Jan 7, 2025

kevinjqliu reviewed Jan 7, 2025

View reviewed changes

mkdocs/docs/api.md Show resolved Hide resolved

kevinjqliu reviewed Jan 8, 2025

View reviewed changes

kevinjqliu mentioned this pull request Jan 8, 2025

UpdateSchema does not respect transaction abort #1497

Open

jeppe-dos and others added 2 commits January 8, 2025 10:18

Fix rename column in api.md

2fef448

Update mkdocs/docs/api.md

fe1af10

kevinjqliu reviewed Jan 9, 2025

View reviewed changes

mkdocs/docs/api.md Outdated Show resolved Hide resolved

Update mkdocs/docs/api.md

d536cd8

kevinjqliu merged commit a95f9ee into apache:main Jan 9, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change dot notation in add column documentation to tuple #1433

Change dot notation in add column documentation to tuple #1433

jeppe-dos commented Dec 16, 2024

Fokko left a comment

kevinjqliu left a comment

kevinjqliu commented Dec 17, 2024

jeppe-dos commented Dec 20, 2024

kevinjqliu commented Dec 20, 2024 •

edited

Loading

kevinjqliu commented Dec 20, 2024

jeppe-dos commented Jan 6, 2025

jeppe-dos commented Jan 6, 2025

kevinjqliu left a comment

jeppe-dos commented Jan 6, 2025

kevinjqliu commented Jan 6, 2025

jeppe-dos commented Jan 6, 2025 •

edited

Loading

kevinjqliu commented Jan 6, 2025

kevinjqliu commented Jan 6, 2025

jeppe-dos commented Jan 6, 2025

kevinjqliu left a comment

kevinjqliu Jan 8, 2025

jeppe-dos Jan 8, 2025

kevinjqliu commented Jan 9, 2025

Change dot notation in add column documentation to tuple #1433

Change dot notation in add column documentation to tuple #1433

Conversation

jeppe-dos commented Dec 16, 2024

Fokko left a comment

Choose a reason for hiding this comment

kevinjqliu left a comment

Choose a reason for hiding this comment

kevinjqliu commented Dec 17, 2024

jeppe-dos commented Dec 20, 2024

kevinjqliu commented Dec 20, 2024 • edited Loading

kevinjqliu commented Dec 20, 2024

jeppe-dos commented Jan 6, 2025

jeppe-dos commented Jan 6, 2025

kevinjqliu left a comment

Choose a reason for hiding this comment

jeppe-dos commented Jan 6, 2025

kevinjqliu commented Jan 6, 2025

jeppe-dos commented Jan 6, 2025 • edited Loading

kevinjqliu commented Jan 6, 2025

kevinjqliu commented Jan 6, 2025

jeppe-dos commented Jan 6, 2025

kevinjqliu left a comment

Choose a reason for hiding this comment

kevinjqliu Jan 8, 2025

Choose a reason for hiding this comment

jeppe-dos Jan 8, 2025

Choose a reason for hiding this comment

kevinjqliu commented Jan 9, 2025

kevinjqliu commented Dec 20, 2024 •

edited

Loading

jeppe-dos commented Jan 6, 2025 •

edited

Loading