Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial commit of MongoDB backend #2

Merged
merged 13 commits into from
May 7, 2024
Merged

Conversation

timgraham
Copy link
Collaborator

No description provided.

@timgraham timgraham marked this pull request as ready for review April 23, 2024 16:28
@timgraham timgraham force-pushed the init branch 2 times, most recently from f2cd190 to d154f74 Compare April 25, 2024 15:03
@blink1073 blink1073 mentioned this pull request Apr 26, 2024
Copy link
Collaborator

@Jibola Jibola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving not to hold up merge.
Will retroactively ask more questions as things progress!

Great work thus far!

Comment on lines +72 to +79
if attr in ["connection", "database"]:
assert not self.connected
self._connect()
return getattr(self, attr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we want connection or database raise an attribute error when the class has established a connection?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may have read it wrong. AttributeError is raised for any attribute besides connection and database. The purpose of this code is the connect to the database whenever connection or database are first accessed. This implementation is copied from Django MongoDB Engine and is different from other SQL backends. It may need to be reworked in the future or there might be good reasons for keeping it this way.

if not any(k in ["save", "delete", "update"] for k in self.operation_flags):
# Flags apply to all operations.
flags = self.operation_flags
self.operation_flags = {"save": flags, "delete": flags, "update": flags}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc, this will completely override ignore any flags for operations outside of save, delete, update. If these are the only three options available, that's fine, but if there are other operations that get flags and need to be set, is it okay to do that here? For instance is {"OPERATIONS": {'read': flags}} was set, this would be ignored.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was copied uncritically from Django MongoDB Engine. See documentation: https://django-mongodb-engine.readthedocs.io/en/latest/reference/settings.html#acknowledged-operations. If it looks deficient, we can omit it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR:

  • We'll leave it for now, but can you leave a comment under the function to re-evaluate? Let it be a bonafide JIRA ticket for ease of tracking.

Got it. I don't know the best strategy here yet, so we'll pencil it for now. I think propagating the same flags to each write operation is fine, it's really the fact that we can propagate other flags and would want to keep them. The docs you linked (and thank you for linking them) show me that this was really meant for just save/delete/update operations at first.

Comment on lines 91 to 92
"document_class": dict,
"tz_aware": False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"document_class": dict,
"tz_aware": False,

These are the default connection options, so I'd say unless the options provide changes we can remove this part.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove.

if user and password and not self.database.authenticate(user, password):
raise ImproperlyConfigured("Invalid username or password.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Since mongodb usually takes a URI and aforementioned URI generally has the user & password already provided, I'm wondering if having these pieces of information are even worthwhile because any secure cloud connection needs the username and password already in the URI.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. This code is copied from Django MongoDB Engine. We can discuss it more later. Django doesn't normally accept a URL for its configuration, but there's a third-party package to allow it: https://pypi.org/project/dj-database-url/

Comment on lines +125 to +121
def cursor(self):
return Cursor()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this comment here just as a note for later.
Need to understand the differences between a Django Cursor and a MongoDB Cursor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a quick hack to prevent the need to override some Django internals where a "nodb" cursor is created but (for MongoDB purposes) not used. It may or may not stay around long-term.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to find more information on cursors in django, any good documentation reference?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's up to the database driver (psycopg, mysqlclient, etc.) to implement, but the interface is described in PEP 249.

@timgraham timgraham force-pushed the init branch 5 times, most recently from 44e490b to 2be3bc0 Compare April 28, 2024 01:34

def __getattr__(self, attr):
if attr in ["connection", "database"]:
assert not self.connected
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not getting the point. I think it should be 'if' instead of 'assert'. Why should it fail if I call 'self.connection' twice?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__getattr__() is only called the first time when the attribute doesn't exist.

from .query import MongoQuery, safe_call


class SQLCompiler(compiler.SQLCompiler):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think It would be good to rename this class, I know that some of the methods are public interface so we can't override, execute_sql is one of the thing that We cannot rename, can we?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jib made the same comment shortly after you. Here is my reply in case you missed that thread: These names are names are set by convention (hardcoded) in Django's code, e.g. https://github.com/django/django/blob/85c154da2f07485a1cdc4d886eee4c1a1ef56137/django/db/models/sql/query.py#L226

django_mongodb/compiler.py Show resolved Hide resolved
Comment on lines 92 to 102
Queries requiring JOINs (many-to-many relations, abstract model bases,
or model spanning filtering), using DISTINCT (QuerySet.distinct()) or
using the SQL-specific QuerySet.extra() aren't supported.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make sure there's explicit documentation on the queries we will NOT be supporting. We do have an equivalent for JOIN behavior, but I think that should be deferred to later development.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few things to the README's" Known issues and limitations" section. In time it'll become clearer what we can and cannot support.

from django.utils.translation import gettext_lazy as _


class MongoAutoField(AutoField):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is this automatically generating an ObjectId? Or is it remaining empty UNTIL the database write occurs and an ObjectId is returned?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter

django_mongodb/features.py Show resolved Hide resolved
return getattr(collection, method)(criteria, update_spec, **options).matched_count


class SQLAggregateCompiler(SQLCompiler):
Copy link
Collaborator

@WaVEV WaVEV Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not support annotations ($group or group by in SQL), does it?
If It so, are we going to support it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll try to support at least some simple aggregations, but I think it'll be complicated. I haven't looked into the details yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, aggregate operations are pretty interesting and need a full design evaluation. I'd say we defer this and focus on the aforementioned compilers for now. Leaving this class makes sense.

Comment on lines +9 to +11
@classmethod
def settings_to_cmd_args_env(cls, settings_dict, parameters):
raise NotImplementedError
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's raise an issue for this ticket as well. Not sure to what extent this will get used, but I do see this as the proxy for making any command line calls.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is #3. During our last call, I suggested this as a good starter task for @WaVEV.

Copy link
Collaborator

@Jibola Jibola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last section to look at is query.py but I've left more comments. Let me know what you think. Great work from what I've seen!

def execute_sql(
self, result_type=MULTI, chunked_fetch=False, chunk_size=GET_ITERATOR_CHUNK_SIZE
):
self.pre_sql_setup()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read into this step. This preps the query to have:
SELECT
WHERE
HAVING
QUALIFY
ORDERBY
GROUPBY

and order it if available.

I think this function could be more useful down the line as of right now, I don't think it contributes to any functionality.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Django's SQL compilers have a function with the same name that each compiler calls. This backend overrides it and each custom compiler similarly calls it.

Comment on lines 78 to 91
converters = self.connection.ops.get_db_converters(col) + field.get_db_converters(
self.connection
)
for converter in converters:
value = converter(value, col, self.connection)
result.append(value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know if these get_db_converter calls are mutually exclusive? Or is it that it should always convert field.get_db_converters as an override to the connection.ops.get_db_converters.

Noting to see if there would be a performance win by making this an if statement that calls convert if one OR the other is present rather than a for loop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was on my mind to optimize this a bit more. Take a look at commit "fetch database converters once per query rather than for each result." To answer your question, there could be both types of converters. The logic comes from django.db.models.sql.SQLCompiler.get_converters().

Comment on lines 220 to 231
if len(doc) == 1:
# insert with empty model
doc.clear()
else:
raise DatabaseError("Can't save entity with _id set to None")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this, the case where someone does:
{_id: None} will upload to a collection as {} --> {_id: ObjectId}

Should we not just raise the DatabaseError regardless?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll also note that _id doesn't have to be the exclusive primary key in every collection, but that's a fix to put in later. This should probably have some class specific assignment such as self.pk or self.index

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment one: This code seems to have been obsoleted since its introduction. The test now passes without it.

Comment two: Django doesn't support composite primary keys.

Comment on lines 229 to 240
if returning_fields:
return [collection.insert_one(doc, **options).inserted_id]

collection.insert_one(doc, **options)
return []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if returning_fields:
return [collection.insert_one(doc, **options).inserted_id]
collection.insert_one(doc, **options)
return []
inserted_id = collection.insert_one(doc, **options).inserted_id
return [inserted_id] if returning_fields else []

return getattr(collection, method)(criteria, update_spec, **options).matched_count


class SQLAggregateCompiler(SQLCompiler):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, aggregate operations are pretty interesting and need a full design evaluation. I'd say we defer this and focus on the aforementioned compilers for now. Leaving this class makes sense.

Comment on lines +272 to +258
if field.unique:
multi = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of this, is there any way to disambiguate updating a unique field? I would have done so by specifying the id, but I can't tell where I'd put the find vs. the update part in the build_query step. I'll come back to this.

Copy link
Collaborator Author

@timgraham timgraham May 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like something like Model.objects.filter(id=1).update(name="foo") will use update_many. We could certainly try to take the filtering fields into account but is the complication worth it?

Actually, I'm not certain this current logic that uses field.unique is appropriate.
Under the current logic, if name is unique, then Model.objects.update(name="foo") will update just one (random) document. In SQL-land that QuerySet tries to update all rows but fails and raises IntegrityError due to violation of the unique constraint. Created #14 for this bug.

return value
if lookup in ("in", "range", "year"):
return [
self._prep_lookup_value(subvalue, field, field_kind, db_type, lookup)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually expect this to call a mixture of self.prep_lookup_value rather than self._prep_lookup_value just to capture all the possible field-level conversions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DatabaseOperations.prep_lookup_value() is called with value as a single value or a list of values. _prep_lookup_value() processes each value.

prep_lookup_value() isn't a standard Django method. It's what's left of Mongo Engine's DatabaseOperations._value_from_db(). I started out copying the entire thing but later found much of it was unnecessary. It may be expanded or obsoleted as we continue to triage the Django test suite.

Comment on lines 14 to 24
def safe_call(func):
@wraps(func)
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except DuplicateKeyError as e:
raise IntegrityError from e
except PyMongoError as e:
raise DatabaseError from e

return wrapper
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why the naming of safe_call is the purpose just to communicate that it's raising an error that won't crash the backend server?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is copied from the original Mongo Engine but it's not a good one. I will rename to wrap_database_errors which is the same name used for a similar purpose in SQL backends.

"lte": lambda val: {"$lte": val},
"in": lambda val: {"$in": val},
"range": lambda val: {"$gte": val[0], "$lte": val[1]},
"isnull": lambda val: None if val else {"$ne": None},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be {"$eq": None}?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I tested, the queries {'field': None} and {'field': {'$eq': None}} give the same results. Is that your concern?

Copy link
Collaborator

@Jibola Jibola May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I misread the output. Yes, that's fine.

So showing {"$ne": None} probably occurs when the query evaluation is something like...
Models.object.filter(field__isnull=False)

Something to note:
I know in MongoDB the query for { bar: None }
Will surface

{
 _id: 1234,
 team: "bar"
},
{
 _id: 1235,
 team: "baz",
 bar: None
}

in its results. In the relational world, this isn't an issue because all fields exist; this is doing the same thing, functionally, that its relational counterpart will do, but the field doesn't exist which has historically been a problem for some ODMs that expect the field. No changes needed, just wanted to note the logical behavior.

Copy link
Collaborator

@Jibola Jibola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last of the comments!

The last function that looks like it needs clarification is MongoQuery._decode_child as it works with a lot of assumptions about the data structure coming through

Comment on lines 184 to 187
if lookup_type in ("month", "day"):
raise DatabaseError("MongoDB does not support month/day queries.")
if self._negated and lookup_type == "range":
raise DatabaseError("Negated range lookups are not supported.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have these be part of the lambda functions? The lambda result is raising the DatabaseError, etc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

month/day is dead code. In modern versions of Django, an error is raised elsewhere for these lookups (and will be fixed in #9).

And I'm not sure why the restriction about range queries was added. Perhaps MongoDB has added support since 2011.
IntegerModel.objects.exclude(integer__range=(2, 4)) translates to {'integer': {'$not': {'$gte': 2, '$lte': 4}}} which seems to work just fine.

Comment on lines +192 to +188
if self._negated and lookup_type in NEGATED_OPERATORS_MAP:
op_func = NEGATED_OPERATORS_MAP[lookup_type]
already_negated = True
else:
op_func = OPERATORS_MAP[lookup_type]
if self._negated:
already_negated = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking here for understanding:
If we're negating an operation -- using NOT-- this will check that the lookup_type is supported and then say we've run a negation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

cursor.limit(int(self.query.high_mark - self.query.low_mark))
return cursor

def add_filters(self, filters, query=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should cut a task to break up and simplify this function. It's far too lengthy and confusing, especially as we get into the negation logic. It isn't encapsulated well, so it gets hard to remember what was happening previously.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created #13

Comment on lines 210 to 258
if not isinstance(existing, dict):
if not self._negated:
# {'a': o1} + {'a': o2} --> {'a': {'$all': [o1, o2]}}
assert not isinstance(lookup, dict)
subquery[column] = {"$all": [existing, lookup]}
else:
# {'a': o1} + {'a': {'$not': o2}} --> {'a': {'$all': [o1], '$nin': [o2]}}
if already_negated:
assert lookup.keys() == ["$ne"]
lookup = lookup["$ne"]
assert not isinstance(lookup, dict)
subquery[column] = {"$all": [existing], "$nin": [lookup]}
else:
not_ = existing.pop("$not", None)
if not_:
assert not existing
if isinstance(lookup, dict):
assert lookup.keys() == ["$ne"]
lookup = lookup.values()[0]
assert not isinstance(lookup, dict), (not_, lookup)
if self._negated:
# {'not': {'a': o1}} + {'a': {'not': o2}} --> {'a': {'nin': [o1, o2]}}
subquery[column] = {"$nin": [not_, lookup]}
else:
# {'not': {'a': o1}} + {'a': o2} -->
# {'a': {'nin': [o1], 'all': [o2]}}
subquery[column] = {"$nin": [not_], "$all": [lookup]}
else:
if isinstance(lookup, dict):
if "$ne" in lookup:
if "$nin" in existing:
# {'$nin': [o1, o2]} + {'$ne': o3} --> {'$nin': [o1, o2, o3]}
assert "$ne" not in existing
existing["$nin"].append(lookup["$ne"])
elif "$ne" in existing:
# {'$ne': o1} + {'$ne': o2} --> {'$nin': [o1, o2]}
existing["$nin"] = [existing.pop("$ne"), lookup["$ne"]]
else:
existing.update(lookup)
else:
if "$in" in lookup and "$in" in existing:
# {'$in': o1} + {'$in': o2} --> {'$in': o1 union o2}
existing["$in"] = list(set(lookup["$in"] + existing["$in"]))
else:
# {'$gt': o1} + {'$lt': o2} --> {'$gt': o1, '$lt': o2}
assert all(key not in existing for key in lookup), [
lookup,
existing,
]
existing.update(lookup)
else:
key = "$nin" if self._negated else "$all"
existing.setdefault(key, []).append(lookup)

if filters.connector == OR and subquery:
or_conditions.append(subquery)

if filters.negated:
self._negated = not self._negated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think the logic here works, but I think we should heavily consider refactoring this for readability.

  1. Placing it in its own function with a clear statement of function
  2. Adding docstring to explain what it's doing (combining queries and handling negation logic)

Comment on lines 334 to 316
# Remove percent signs added by Field.get_db_prep_lookup() for LIKE
# queries.
if lookup_type in ("startswith", "istartswith"):
value = value[:-1]
elif lookup_type in ("endswith", "iendswith"):
value = value[1:]
elif lookup_type in ("contains", "icontains"):
value = value[1:-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably still check if there's a percent sign there. That or could we subclass Field to remove this layer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it wouldn't be much extra computation to be defensive, I can't imagine a way it could break, especially in a way that wouldn't be caught by the test suite.

The comment is actually out of date. The percent signs are now added by PatternLookup.process_rhs(). I don't think we can monkey patch that without breaking usage of this backend alongside a SQL backend (and the same comment applies if the logic was in Field.get_db_prep_lookup()).

@timgraham timgraham force-pushed the init branch 6 times, most recently from 69e5f46 to 8ce323c Compare May 7, 2024 00:17
@timgraham timgraham merged commit 28f52b1 into mongodb-labs:main May 7, 2024
3 checks passed
@timgraham timgraham deleted the init branch May 7, 2024 00:36
@blink1073
Copy link
Member

🎉 thanks @timgraham!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants