Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve query speed for IslandoraUtils::getTermForUri() #1067

Open
wants to merge 1 commit into
base: 2.x
Choose a base branch
from

Conversation

joecorall
Copy link
Member

@joecorall joecorall commented Jan 17, 2025

Continuation of #997
Closes Islandora/documentation#2272

What does this Pull Request do?

Improve performance for IslandoraUtils::getTermForUri()

How should this be tested?

Made a simple drush script

$ cat << EOF > scripts/1067.php
<?php

\$i = 0;
while (++\$i < 100) {
  \$term = \Drupal::service('islandora.utils')->getTermForUri("http://purl.org/coar/resource_type/c_18cc");
}
EOF

Then ran it and it took 17s to complete

$ time drush scr scripts/1067.php 
real    0m 17.46s
user    0m 0.52s
sys     0m 0.38s

Applied this PR and ran again and execution dropped to 1s

$ cd path/to/drupal/root
$ jq '.extra.patches."drupal/islandora" += {"Speed up getTermForUri": "https://github.com/Islandora/islandora/pull/1067.patch"}' composer.json > composer.json.tmp
$ mv composer.json.tmp composer.json
$ composer install
$ time drush scr scripts/1067.php 
real    0m 0.99s
user    0m 0.46s
sys     0m 0.18s

It's definitely the query here causing the slowdown, as if I replace the query with $results = [12]; I get similar 1s executions.

Here's the query that entityTypeManager is running:

SELECT t.revision_id, tid FROM taxonomy_term_data t
LEFT JOIN taxonomy_term__field_authority_link a ON a.entity_id = t.tid
LEFT JOIN taxonomy_term__field_external_uri e ON e.entity_id = t.tid
WHERE (a.field_authority_link_uri = 'http://purl.org/coar/resource_type/c_18cc') or (e.field_external_uri_uri = 'http://purl.org/coar/resource_type/c_18cc')

I think due to the orGroup this is slowing down these queries. It appears be more efficient to run several queries.

Documentation Status

  • Does this change existing behaviour that's currently documented?
  • Does this change require new pages or sections of documentation?
  • Who does this need to be documented for?
  • Associated documentation pull request(s): ___ or documentation issue ___

Additional Notes:

Any additional information that you think would be helpful when reviewing this
PR.

Interested parties

Tag (@ mention) interested parties or, if unsure, @Islandora/committers

Comment on lines 265 to 274
foreach ($fields as $field) {
$orGroup->condition("$field.uri", $uri);
}

$results = $query
->accessCheck(TRUE)
->condition($orGroup)
->execute();

if (empty($results)) {
return NULL;
$query = $storage->getQuery();
$results = $query
->accessCheck(TRUE)
->condition("$field.uri", $uri)
->execute();
if (!empty($results)) {
return $storage->load(reset($results));
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a general solution to the issue? Sure, it might be faster if it happens to try querying with a field containing the URI earlier; however, if the field happens to be last (i.e, EXTERNAL_URI_FIELD, with (m)any other populated authority_link fields) , it would still have to perform all the other queries first? Would be entirely up to the given sites configuration and content?

I feel like the mention of improving the indexing of uri values was very much overlooked (as originally mentioned in #997 (comment)); Only the first 30 characters are indexed by default. With this default configuration, longer values cannot be uniquely identified by the index, probably falling back to table scans.

Another potential easy optimization might be to provide a limit to the query? If we only expect one value (and it is all we plan to use), the DB engine might be able to return early after it finds a matching value. Vague thoughts about avoiding sorts (which would require the full result set in order to calculate the ordering) or possibly reworking to use UNION queries across the fields; however, unsure if it is really necessary to go that far.

Really, the indexing is probably the big thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it might be faster if it happens to try querying with a field containing the URI earlier; however, if the field happens to be last

FWIW in my particular example case it is the second table that gets called in my little test script that has the value. I'd suspect whatever N is (N being the number of EXTERNAL_URI_FIELD fields on the site) will grow the query speed with the current implementation or the proposal in this PR. Tested this by manually adding 8 new authority field and timing it.

N=2

Current implementation

time drush scr scripts/1067.php 
real    0m 10.39s
user    0m 0.54s
sys     0m 0.26s

This PR

time drush scr scripts/1067.php 
real    0m 1.18s
user    0m 0.52s
sys     0m 0.25s

N=10

Current implementation

time drush scr scripts/1067.php 
real    0m 26.47s
user    0m 0.57s
sys     0m 0.24s

This PR

(confirmed the found result was in the last table out of 10 in the iteration)

time drush scr scripts/1067.php 
real    0m 1.19s
user    0m 0.57s
sys     0m 0.18s

So we actually saw worse performance as N grew with the current implementation.

I feel like the mention of improving the indexing of uri values was very much overlooked

I tried manually adding indexes to the two tables that get called with this function on my site and it had no effect on improving the query speed

e.g.

CREATE TABLE `taxonomy_term__field_external_uri` (
  `bundle` varchar(128) CHARACTER SET ascii COLLATE ascii_general_ci NOT NULL DEFAULT '' COMMENT 'The 
  ...
  ...
  KEY `field_external_uri_uri_2` (`field_external_uri_uri`(768))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci COMMENT='Data storage for taxonomy_term field field_external_uri.';
time drush scr scripts/1067.php 
real    0m 10.39s
user    0m 0.54s
sys     0m 0.26s

Also tried adding ->range(0, 1) to the entity query with no effect to query speed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... things might be more from the left joins used by the query builder to make things the or-group happen. The UNION approach would allow to move away from them, but isn't supported by Drupal "entity query" mechanism, so would be somewhat of a bigger refactor with different access checking.

Just yeah, I find that I'm really not a fan of the idea of starting to make arbitrarily many queries as more fields are introduced. Might end up looking something like: 2.x...adam-vessey:islandora:fix/union-query ? Will probably give this a bit more of a look next week... not seeing the same slowness here, but unsure if due to different DB or fields or what.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I don't have an opinion on the implementation details as long as it makes this hot path faster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked somewhat wider (looking to cover #1055 as well), submitted as a PR to get tests run on it more generally: #1068

@joecorall joecorall closed this Jan 22, 2025
@joecorall joecorall deleted the improve-query branch January 22, 2025 22:16
@joecorall joecorall restored the improve-query branch January 22, 2025 22:38
@joecorall joecorall reopened this Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: Cache terms returned by IslandoraUtils::getTermForUri()
2 participants