Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change dot notation in add column documentation to tuple #1433

Merged
merged 8 commits into from
Jan 9, 2025

Conversation

jeppe-dos
Copy link
Contributor

A tuple must be used to make columns in structs as described in add_column:
"Because "." may be interpreted as a column path separator or may be used in field names, it is not allowed to add nested column by passing in a string. To add to nested structures or to add fields with names that contain "." use a tuple instead to indicate the path."
This PR corrects the documentation to use tuples instead of dot notation.

From issue 1407

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jeppe-dos for fixing this 🙌

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there might be a bug with this change. I tried to follow the docs

from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, DoubleType, LongType

warehouse_path = "/tmp/warehouse"
catalog = SqlCatalog(
    "default",
    **{
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

schema = Schema(
    NestedField(1, "city", StringType(), required=False),
    NestedField(2, "lat", DoubleType(), required=False),
    NestedField(3, "long", DoubleType(), required=False),
)
catalog.create_namespace_if_not_exists("default")
try:
    catalog.drop_table("default.locations")
except:
    pass

table = catalog.create_table("default.locations", schema)

# with table.update_schema() as update:
#     # In a struct
#     update.add_column("details.confirmed_by", StringType(), "Name of the exchange")

with table.update_schema() as update:
    update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")

errors


Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/kevinliu/repos/iceberg-python/pyiceberg/table/update/schema.py", line 192, in add_column
    parent_field = self._schema.find_field(parent_full_path, self._case_sensitive)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kevinliu/repos/iceberg-python/pyiceberg/schema.py", line 215, in find_field
    raise ValueError(f"Could not find field with name {name_or_id}, case_sensitive={case_sensitive}")
ValueError: Could not find field with name details, case_sensitive=True

@kevinjqliu
Copy link
Contributor

Heres where the errors happens

name = path[-1]
parent = path[:-1]
full_name = ".".join(path)
parent_full_path = ".".join(parent)
parent_id: int = TABLE_ROOT_ID
if len(parent) > 0:
parent_field = self._schema.find_field(parent_full_path, self._case_sensitive)

And some debugging statements:

(Pdb) path
('details', 'confirmed_by')
(Pdb) name
'confirmed_by'
(Pdb) parent
('details',)
(Pdb) parent_full_path
'details'
(Pdb) parent_id
-1
(Pdb) len(parent) > 0
True
parent_field = self._schema.find_field(parent_full_path, self._case_sensitive)

is where it errors.

Seems like we're missing the case where no "parent" is present

@jeppe-dos
Copy link
Contributor Author

Looks like there might be a bug with this change. I tried to follow the docs

from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, DoubleType, LongType

warehouse_path = "/tmp/warehouse"
catalog = SqlCatalog(
    "default",
    **{
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

schema = Schema(
    NestedField(1, "city", StringType(), required=False),
    NestedField(2, "lat", DoubleType(), required=False),
    NestedField(3, "long", DoubleType(), required=False),
)
catalog.create_namespace_if_not_exists("default")
try:
    catalog.drop_table("default.locations")
except:
    pass

table = catalog.create_table("default.locations", schema)

# with table.update_schema() as update:
#     # In a struct
#     update.add_column("details.confirmed_by", StringType(), "Name of the exchange")

with table.update_schema() as update:
    update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")

errors


Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/kevinliu/repos/iceberg-python/pyiceberg/table/update/schema.py", line 192, in add_column
    parent_field = self._schema.find_field(parent_full_path, self._case_sensitive)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kevinliu/repos/iceberg-python/pyiceberg/schema.py", line 215, in find_field
    raise ValueError(f"Could not find field with name {name_or_id}, case_sensitive={case_sensitive}")
ValueError: Could not find field with name details, case_sensitive=True

Yes, the struct has to exist before you can insert anything into it. This can be adjusted in the code to automatically create the parent. For now, it is detailed in the documentation changes. Should I write more explicitly?

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Dec 20, 2024

Yes, the struct has to exist before you can insert anything into it.

ah i see, that makes sense. in that case, can we edit the example so that it works out of the box?

Also i think its valuable to move the comment to the top level docs of "Add Column". We can include both the details about dot notation and struct parent

@kevinjqliu
Copy link
Contributor

i found another dot notion in Move column, do we need to change this too?
https://py.iceberg.apache.org/api/#move-column

@jeppe-dos
Copy link
Contributor Author

I assume so. I'll test and update accordingly.

@jeppe-dos
Copy link
Contributor Author

The struct is now created first in the add column section. I have also changed from dot to tuple in move and rename column.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might still be some issues, i wasn't able to run the example

from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import DoubleType, IntegerType, NestedField, StringType, StructType

warehouse_path = "/tmp/warehouse"
catalog = SqlCatalog(
    "default",
    uri=f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", warehouse=f"file://{warehouse_path}",
)

schema = Schema(
    NestedField(1, "city", StringType(), required=False),
    NestedField(2, "lat", DoubleType(), required=False),
    NestedField(3, "long", DoubleType(), required=False),
)
catalog.create_namespace_if_not_exists("default")
try:
    catalog.drop_table("default.locations")
except:
    pass

table = catalog.create_table("default.locations", schema)

with table.update_schema() as update:
    update.add_column("retries", IntegerType(), "Number of retries to place the bid")
    # In a struct
    update.add_column("details", StructType())
    update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")

print(table.refresh().schema())

with table.update_schema() as update:
    update.rename_column("retries", "num_retries")
    # This will rename `confirmed_by` to `exchange`
    update.rename_column("properties.confirmed_by", "exchange")
print(table.refresh().schema())

with table.update_schema() as update:
    update.move_first("symbol")
    # This will move `bid` after `ask`
    update.move_after("bid", "ask")
    # This will move `confirmed_by` before `exchange` in the `details` struct
    update.move_before(("details", "confirmed_by"), ("details", "exchange"))
print(table.refresh().schema())

with table.update_schema(allow_incompatible_changes=True) as update:
    update.delete_column("some_field")
    # In a struct
    update.delete_column(("details", "confirmed_by"))
print(table.refresh().schema())

@jeppe-dos
Copy link
Contributor Author

What if you create the struct first, and then add the nested field like so:

with table.update_schema() as update:
    update.add_column("retries", IntegerType(), "Number of retries to place the bid")
    # In a struct
    update.add_column("details", StructType())

with table.update_schema() as update:
    update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")

@kevinjqliu
Copy link
Contributor

that works, but i think the first example should work too. We can track this in a separate issue.

>>> with table.update_schema() as update:
...     update.add_column("retries", IntegerType(), "Number of retries to place the bid")
...     # In a struct
...     update.add_column("details", StructType())
...
<pyiceberg.table.update.schema.UpdateSchema object at 0x11fc59880>
<pyiceberg.table.update.schema.UpdateSchema object at 0x11fc59880>
>>> with table.update_schema() as update:
...     update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")
...
<pyiceberg.table.update.schema.UpdateSchema object at 0x1189d9370>
>>> print(table.refresh().schema())
table {
  1: city: optional string
  2: lat: optional double
  3: long: optional double
  4: retries: optional int (Number of retries to place the bid)
  5: details: optional struct<6: confirmed_by: optional string (Name of the exchange)>
}

mkdocs/docs/api.md Outdated Show resolved Hide resolved
mkdocs/docs/api.md Outdated Show resolved Hide resolved
mkdocs/docs/api.md Outdated Show resolved Hide resolved
@jeppe-dos
Copy link
Contributor Author

jeppe-dos commented Jan 6, 2025

that works, but i think the first example should work too. We can track this in a separate issue.

>>> with table.update_schema() as update:
...     update.add_column("retries", IntegerType(), "Number of retries to place the bid")
...     # In a struct
...     update.add_column("details", StructType())
...
<pyiceberg.table.update.schema.UpdateSchema object at 0x11fc59880>
<pyiceberg.table.update.schema.UpdateSchema object at 0x11fc59880>
>>> with table.update_schema() as update:
...     update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")
...
<pyiceberg.table.update.schema.UpdateSchema object at 0x1189d9370>
>>> print(table.refresh().schema())
table {
  1: city: optional string
  2: lat: optional double
  3: long: optional double
  4: retries: optional int (Number of retries to place the bid)
  5: details: optional struct<6: confirmed_by: optional string (Name of the exchange)>
}

Agreed. Should I open an issue on this?

Thank you for reviewing the changes.

@kevinjqliu
Copy link
Contributor

@jeppe-dos that would be great!

@kevinjqliu
Copy link
Contributor

im having trouble running the new statements in the docs, could you give it a try ?

@jeppe-dos
Copy link
Contributor Author

im having trouble running the new statements in the docs, could you give it a try ?

The code doesn't work, as "confirmed_by" has been changed to "exchange". Exchange can therefore not move before confirmed_by as it no longer exist. I have changed the renamed field to processed_by to make it a bit more clear.

In your opinion, should you be able to copy the whole documentation and make it work en sequence? It wasn't the case before, but I can change it to be the case, if you would like.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your opinion, should you be able to copy the whole documentation and make it work en sequence? It wasn't the case before, but I can change it to be the case, if you would like.

ah looks like the documentation is sectioned by functionality.
https://py.iceberg.apache.org/api/#schema-evolution

It would be nice to have them work in sequence. If im exploring the documentation, i can just copy and paste each statements.

LGTM!

mkdocs/docs/api.md Outdated Show resolved Hide resolved
@jeppe-dos jeppe-dos closed this Jan 7, 2025
@jeppe-dos jeppe-dos reopened this Jan 7, 2025
# This will rename `confirmed_by` to `exchange`
update.rename_column("properties.confirmed_by", "exchange")
# This will rename `confirmed_by` to `processed_by` in the `details` struct
update.rename_column(("details", "confirmed_by"), ("detail", "processed_by"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example doesnt work

>>> with table.update_schema() as update:
...     update.rename_column(("details", "confirmed_by"), ("detail", "processed_by"))
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/kevinliu/repos/iceberg-python/pyiceberg/table/update/schema.py", line 278, in rename_column
    self._updates[field_from.field_id] = NestedField(
                                         ^^^^^^^^^^^^
  File "/Users/kevinliu/repos/iceberg-python/pyiceberg/types.py", line 333, in __init__
    super().__init__(**data)
  File "/Users/kevinliu/Library/Caches/pypoetry/virtualenvs/pyiceberg-Is5Rt7Ah-py3.12/lib/python3.12/site-packages/pydantic/main.py", line 214, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for NestedField
name
  Input should be a valid string [type=string_type, input_value=('detail', 'processed_by'), input_type=tuple]
    For further information visit https://errors.pydantic.dev/2.10/v/string_type

even though table has details.confirmed_by

>>> print(table.schema())
table {
  1: city: optional string
  2: lat: optional double
  3: long: optional double
  4: details: optional struct<5: confirmed_by: optional string (Name of the exchange)>
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. The renamed field shouldn't be in a tuple. I have fixed it now. Each part now individually works, except for add_column, as discussed.

mkdocs/docs/api.md Outdated Show resolved Hide resolved
@kevinjqliu kevinjqliu merged commit a95f9ee into apache:main Jan 9, 2025
3 checks passed
@kevinjqliu
Copy link
Contributor

Thanks for your help improving the docs @jeppe-dos!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants