Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update log handling to ensure metrics are calculated correctly across versions #4904

Closed
NateWr opened this issue Jul 9, 2019 · 28 comments
Closed
Assignees
Milestone

Comments

@NateWr
Copy link
Contributor

NateWr commented Jul 9, 2019

The code which reads access logs and stores metrics data needs to be updated to ensure stats are calculated correctly across different versions.

The main URLs for a submission should stay the same. However, new URLs will be introduced for each version and its galleys (see #4870). Visits to these URLs should go toward a submission's total.

Also, COUNTER may have some rules regarding counting duplicate visits within a time period to the same resource. We need to figure out what these rules specify and how to correctly count visits to two versions of the same item in a short period.

@NateWr
Copy link
Contributor Author

NateWr commented Jul 9, 2019

@ctgraham if you have any thoughts on this, that'd be much appreciated!

@NateWr NateWr added this to the OJS/OMP 3.2 milestone Jul 9, 2019
@ctgraham
Copy link
Collaborator

ctgraham commented Jul 9, 2019

This may be a bit broad for this ticket, but I think it is strongly relevant....

James and I chatted with Paul Needham of IRUS-UK last month about some of the considerations in log processing. This includes:

  • Processing logfiles locally vs. transmitting raw logs for external processing
  • Processing bot traffic for exclusion
    • via User Agent
    • via IP
    • via crawl pattern
  • Internal datastructures for representing usage at scale
    • We use a flat table amalgamating multiple statistical dimensions
    • IRUS-UK uses tables: suppliers, items, and statistics, with statistics represented by item-month.
  • Increased interdependency in R5 of COUNTER-SUSHI as a unified webservice.

There might be opportunities to collaborate with IRUS-UK to architect a common library which can process logs into statistics (or perhaps just perform the bot-exclusions). It also made we wonder why there wouldn't be a shared library for representing statistics ontop of COUNTER-SUSHI R5 (or why this wouldn't be outsourced to an Electronic Resource Management system), but that didn't seem to get much uptake in our conversation.

I can definitely check in on what the interpretation of "double clicks" of versioned items is from a COUNTER perspective.

@NateWr
Copy link
Contributor Author

NateWr commented Jan 8, 2020

To my surprise, I think we are ok-ish on the versioning front without any changes. However, there is one possible issue I uncovered while looking into it.

From my investigations, here's how the URLs are parsed:

  • Anything with article/view is considered ASSOC_TYPE_SUBMISSION.
  • Anything with article/download is considered ASSOC_TYPE_SUBMISSION_FILE.

This works for PDF/HTML galleys because even though they load at article/view/<submission-id>/<galley-id>, the PDF or HTML file is loaded in an iframe at article/download/<submission-id>/<galley-id>.

Our versioned URLs look like article/view/<submission-id>/version/<publication-id>[/<galley-id>], but the underlying PDF or HTML file is still loaded in article/download/<submission-id>/<galley-id>. So visits to the landing page and galleys of old versions are counted correctly.

However, this means that when someone directly visits the page of a PDF or HTML galley, without going through the article landing page, they record two entries in the logs: article/view/<submission-id>/<galley-id> and article/download/<submission-id>/<galley-id>.

UsageStatsLoader::_getUrlMatches() returns both lines as a match, one for submission and one for the file (what we call in the backend stats area an "abstract" hit and a "files" hit). However, once the file is parsed, it appears in the usage_stats_temporary_records table as a single hit to the submission (not the file):

mysql> select * from usage_stats_temporary_records where day="20191219";
+----------+------------+----------+------------+--------+------------+--------+------+---------------------------+-----------+
| assoc_id | assoc_type | day      | entry_time | metric | country_id | region | city | load_id                   | file_type |
+----------+------------+----------+------------+--------+------------+--------+------+---------------------------+-----------+
|        5 |    1048585 | 20191219 | 1576762274 |      1 | NULL       | NULL   | NULL | usage_events_20191219.log |         0 |
+----------+------------+----------+------------+--------+------------+--------+------+---------------------------+-----------+

This is not a new issue with versioning, but I thought I would double-check with you @asmecher and @ctgraham to see if this is intended behaviour. It seems to me like this should count as a single view of the file, not the submission.

To test this I constructed the following fake log to parse, which includes only two URL hits. One to the URL to view a PDF (article/view/5/1) and one to the URL that actually loads the PDF (article/download/5/1/15). I believe that this simulates what a single request to view a PDF would generate in a real usage log.

127.0.0.1 administrative 1 "2019-12-19 14:32:08" http://localhost:8000/publicknowledge/article/view/5/1 200 "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0"
127.0.0.1 administrative 1 "2019-12-19 14:32:08" http://localhost:8000/publicknowledge/article/download/5/1/15 200 "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0"

(To use this for your own testing you would need to update the submission, galley and file ids to ones that exist in your system.)

@asmecher
Copy link
Member

asmecher commented Jan 9, 2020

I think this is probably tied up with COUNTER, which I'm not very familiar with -- it has specifications for things like debouncing. And I'm not sure whether COUNTER business rules are applied when metrics are recorded, or when they're processed. @ctgraham, are you familiar with this aspect? If not, maybe I could follow up with Bozana.

@ctgraham
Copy link
Collaborator

ctgraham commented Jan 9, 2020

I don't think COUNTER cares about abstract views, nor has it in my memory. I reviewed COUNTER R5, R4, and R3 and each focuses on fulltext downloads/views. Back in R3 there was a distinction between fulltext HTML, fulltext PDF, and fulltext "other", but that distinction goes away in later releases.

If a user views fulltext via HTML and then fulltext via PDF, or downloads the fulltext via the same medium twice, this only counts a one view.

If I were more helpful I would check to make sure the COUNTER reports are using only ASSOC_TYPE_SUBMISSION_FILE for calculations (I think this is the case), check for legacy reports in OJS which describe abstract views separately from fulltext views (hopefully not), and would verify that the inline display of HTML fulltext is registered correctly (I think I recall fancy jiggering in the a plugin for this). But first I want to get that Crossref testing out the door, and even that simple task is eluding me right now.

@NateWr
Copy link
Contributor Author

NateWr commented Jan 13, 2020

Thanks @ctgraham. I think this is not an urgent question to address for 3.2, because is not something newly introduced. I'll defer it from 3.2 for now but would like to keep this conversation open.

I'm not sure if I understand you correctly, but perhaps there is some divergence here between how OJS keeps statistics and how you describe COUNTER expectations.

First, OJS does make a distinction between abstract and file views. It tracks both separately in the metrics type by the assoc_type and we display them independently to editors in the new article stats UI.

Second, from my brief test, it appears that OJS is treating a single visit to the view PDF page as a single visit to the article abstract page. If COUNTER ignores abstract views entirely, usage may be under-reported. (This was an isolated test. More investigation would be necessary to see what happens in different scenarios and how the rows in a metrics table get compiled into COUNTER reports.)

@bozana
Copy link
Collaborator

bozana commented Feb 4, 2021

Maybe I can take a further look at the problems described here i.e. what is needed to be solve for the next release...
First I will test a few things to get the current status. I will do it incrementally because I can only work a little bit every day, will write the (part) results here, and then at the end come back to you all. Thus, please ignore the text till my testing is finished.

OJS:

  1. Galley view and download:
    "However, this means that when someone directly visits the page of a PDF or HTML galley, without going through the article landing page, they record two entries in the logs: article/view// and article/download//"
    When going to a galley directly, e.g. http://.../index.php/publicknowledge/article/view/1/1 I only see one log entry: ... http://.../index.php/publicknowledge/article/download/1/1/2 200 ...
    i.e. the galley landing page 'http://.../index.php/publicknowledge/article/view/1/1' is never logged.
    This conforms to the code in the UsageEventPlugin: https://github.com/pkp/ojs/blob/master/plugins/generic/usageEvent/UsageEventPlugin.inc.php#L57-L71.
    However, journal might use apache logs where the galley view page is logged, so I will also check how the log file is processed. It should actually follow the same logic, but... it does not:
    Here https://github.com/pkp/usageStats/blob/master/UsageStatsLoader.inc.php#L418-L427 the logic from https://github.com/pkp/ojs/blob/master/plugins/generic/usageEvent/UsageEventPlugin.inc.php#L67-L71 is not considered. So yes, the galley view page will count as submission abstract page :-( This should be corrected so that the galley view page is not counted.
    I do not understand why Nate's example from above would count only as abstract view, looking into the code, and with the bug above, it should count both (abstract and file download). I will test that.

  2. Abstract pages and downloads:
    We log both, i.e.: journal landing page, issue landing page, article landing/abstract page, as well as file downloads (article and issue galleys).
    The following file types are considered: pdf, html, doc, and other.
    How are they considered in the current report tools?

  • PKP Usage statistics report exports everything collected (pretty same as the DB table metrics) aggregated by month.
  • COUNTER Reports seem to consider only article file (PDF, HTML and other) downloads, no abstracts. Journal Report 4.1 does still differentiate between ft_pdf, ft_html and other. The Article Report 4.1 does not, jut ft_total. Is this correct?
  • View Report considers all article counts: abstract views and all existing file (types) downloads.
  • Custom Report Generator allows a report about everything collected, i.e. journal, issue and article landing page, as well as article (and issue) file downloads.
  • Article statistics (provided for the journal managers and editors) provide (similar to the View Report) all article counts: abstract views, file (PDF, HTML and other) downloads, as well as total counts.
  • Paperbuzz plugin considers only PDF, HTML and other downloads.
    Thus, this all seems to be fine, I think. Internally we like to have the landing pages too, but for COUNTER (and other) reports only file downloads are relevant.
  1. Considering versioning:
    The 'version' URL paths seem not to be considered. It works for file downloads because the same download hook is used, but an article abstract page of an old version (http://.../index.php/publicknowledge/article/view/2/version/3) is not logged and thus not counted.
    This should be corrected.
    Currently we do not differentiate the statistics for different article versions. If needed the file downloads can be aggregated, because we probably know to which version/publication a file ID (that we store) belongs. This is not possible for article abstract pages -- once the problem above is corrected there will be just a total number for article abstracts generally i.e. not knowing for which version. Can this stay so or should we maybe add a new column to the DB table metrics that will store e.g. publication ID?
    (The question in No. 5b -- if a version is an unique item -- is only relevant when implementing the Release 5).

  2. Processing rules for COUNTER Release 4 reports, s. https://www.projectcounter.org/code-of-practice-sections/data-processing/
    a) Double click filtering (s. section 7.2):
    "When two requests are made for one and the same article within the above time limits (10 seconds for HTML, 30 seconds for PDF), the first request should be removed and the second retained."
    This is also how we implement it. It is implemented here: https://github.com/pkp/usageStats/blob/master/UsageStatsLoader.inc.php#L194-L224.
    b) Protocol for internet robots and crawlers
    COUNTER maintains the current list of internet robots and crawlers at https://github.com/atmire/COUNTER-Robots.
    We use it as module in lib/pkp/lib/counterBots, assign the file to the variable COUNTER_USER_AGENTS_FILE (https://github.com/pkp/pkp-lib/blob/master/classes/core/Core.inc.php#L23) and implement the function isUserAgentBot in https://github.com/pkp/pkp-lib/blob/master/classes/core/Core.inc.php#L100. The function is then used when the log files are processed (https://github.com/pkp/usageStats/blob/master/UsageStatsLoader.inc.php#L170).
    We should define the strategy when we get the most recent version of the list.
    Thus, everything seems to be OK for now, for the COUNTER Release 4 that we currently support.

  3. COUNTER Release 5:
    It seems the Release 5 with lots of changes is out there. Here a guide for journals: https://www.projectcounter.org/wp-content/uploads/2020/08/Module_2_Journal_Usage_20200811.pdf.
    Thus maybe to only fix the problems Nate encountered here in this release and then implement the support for the new R5.
    5.1. Processing rules for COUNTER 5 reports, s. https://www.projectcounter.org/code-of-practice-five-sections/7-processing-rules-underlying-counter-reporting-data/:
    a) Double click filtering (s. section 7.2):
    This is implemented here: https://github.com/pkp/usageStats/blob/master/UsageStatsLoader.inc.php#L194-L224. We differentiate between the access of HTML, PDF and other. This seems not to be needed any more -- We can change it to consider 30 seconds for any link i.e. file?
    b) Unique Items (s. section 7.3):
    In our case Item is an article. The matching report is AR1. And the rule is: "If multiple transactions qualifying for the Metric_Type in question represent the same item and occur in the same user-sessions, only one unique activity MUST be counted for that item." Where user-session seems to be defined for an hour, as far as I understand it?
    This is what @ctgraham wrote above: "If a user views fulltext via HTML and then fulltext via PDF, or downloads the fulltext via the same medium twice, this only counts a one view.".
    We will need to implement it. At the moment I do not know how would it best to do it -- I believe that the journal managers and editors would like to have the separate counts for each file, to see how they are used.
    The question if the article versions do belong to the same Item is still open. Due to the way we represent them internally I would say they do belong to the same Item.
    c) Unique Titles (s. section 7.4):
    In the case of a journal Title = a journal and the report = Title Master Report. Similar to the rule for the unique item above, the rule here is: "If multiple transactions qualifying for the Metric_Type in question represent the same title and occur in the same user-session only one unique activity MUST be counted for that title.". Where the user-session seems to be defined for an hour? I.e. here, if a user accesses one article and then another in the same session, it would only count once.
    This rule i.e. report seems not to be used for single journals -- introduced mostly for books.
    d) Internet Robots and Crawlers (s. section 7.8):
    Same as for Release 4.
    COUNTER maintains the current list of internet robots and crawlers at https://github.com/atmire/COUNTER-Robots.
    We use it as module in lib/pkp/lib/counterBots, assign the file to the variable COUNTER_USER_AGENTS_FILE (https://github.com/pkp/pkp-lib/blob/master/classes/core/Core.inc.php#L23) and implement the function isUserAgentBot in https://github.com/pkp/pkp-lib/blob/master/classes/core/Core.inc.php#L100. The function is then used when the log files are processed (https://github.com/pkp/usageStats/blob/master/UsageStatsLoader.inc.php#L170).
    We should define the strategy when we get the most recent version of the list.

  4. The other points @ctgraham mentioned above in Update log handling to ensure metrics are calculated correctly across versions #4904 (comment) are surely worth considering. I would first need to know more about them. If there is something we could consider 'quickly' with these changes, please tell me. Else, I would leave them for some other time.

@bozana
Copy link
Collaborator

bozana commented Feb 5, 2021

Hi @NateWr and @ctgraham, there would be some things to correct here: the problem that Nate found out (No. 1 and 3 above) should be fixed. The changes in the the log processing (No. 5b and 5c above) -- above all the immediate access on HTML and then PDF, or different article versions -- are related to the new COUNTER Release 5. So we will need to implement them when we implement the support for that R5.
I could not find anything specific about different article versions (in R5) -- if they are considered as a unique item. I could try to write to that email address on the counter web site, if you have no better idea. Or should/can we maybe just define them as one item for now?
Any other comments/thoughts are of course very welcome.
Thanks a lot!!!

@bozana
Copy link
Collaborator

bozana commented Feb 7, 2021

Hi @NateWr and @ctgraham, reading the document https://www.projectcounter.org/wp-content/uploads/2020/08/Module_2_Journal_Usage_20200811.pdf: it seems that that unique item and title (from the 5b and 5c above) first came now with the COUNTER Release 5. The Release 4, that we currently support, I think, still counts HTML and PDF separately. Also, the R5 considers abstract views as Investigations. Also, SUSHI support is mandatory for compliance with COUNTER Release 5 (s. https://www.projectcounter.org/wp-content/uploads/2019/05/Release_5_TechNotes_PDFX_20190509-Revised.pdf).
Thus, I would maybe suggest that we only fix the problems Nate encountered for this issue (eventually also for this OJS/OMP/OPS release?) and then think and re-implement the statistics so that we support that release 5. What do you think?
Can one still use the R4 reports?

Note about the unique title from 5c above:
The Journal Report changes -- e.g. the JR1 seems to exclude the OA-Gold -- it seems we should then provide Title Master Report (and eventually TR_J3, by Access_Type?)... s. https://www.projectcounter.org/wp-content/uploads/2019/05/Release_5_Providers_20190509-Revised.pdf. Those journal reports seem not to have that unique title numbers from 5c above? -- I only see them in the document for librarians (https://www.projectcounter.org/wp-content/uploads/2019/05/Release_5_Librarians_20190509-Revised-Edition.pdf) and it seems they are above all meant to be for (some) books. So I think we could ignore that 5c above for journals?
(some examples of reports https://www.projectcounter.org/appendix-i-sample-counter-repor/)

@bozana
Copy link
Collaborator

bozana commented Feb 8, 2021

Hmmm... This Release 5 seems to has been out there for almost 3 years now, so I suppose we should support it very soon...
Do we know journals that use the COUNTER reports?
It was nice to be able to say that our (internal) statistics are COUNTER compatible, so this will be good to have further on -- although eventually not much journals are actually using the COUNTER reports...
Having a look at the Release 5 reviews (https://www.projectcounter.org/counter-release-5-an-independent-review/) it seems it is still not implemented by most of the publishers/members. And it seems the OA is a little bit 'neglected' in this release.

@ctgraham
Copy link
Collaborator

ctgraham commented Feb 8, 2021

COUNTER R4 is still fairly widely used around Libraries, though it ought to be phased out in favor of R5; Pitt ULS uses COUNTER in communicating usage statistics to Plum Analytics.

@shanu17 from Pitt ULS is working on SUSHI/COUNTER R5 for PKP. I owe him better definitions of each report so that he can map the report requirements against our Statistics Service / MetricsDAO. There remain some gaps in our internal statistics harvesting for non-OA usages, e.g. mapping access against institutional subscription and counting access denied requests.

@bozana
Copy link
Collaborator

bozana commented Feb 8, 2021

Hi @ctgraham and @shanu17, that is great to hear -- that you are working on the support for R5! :-)))
I would then only fix the processing problems @NateWr encountered here, for R4, now.
If I can somehow help for the support of R5, just let me know. (Maybe we can open a new issue for that and ... )

@bozana bozana self-assigned this Feb 22, 2021
bozana added a commit to bozana/pkp-lib that referenced this issue Feb 23, 2021
bozana added a commit to bozana/usageStats that referenced this issue Feb 23, 2021
bozana added a commit to bozana/ojs that referenced this issue Feb 23, 2021
bozana added a commit to bozana/ojs that referenced this issue Feb 23, 2021
bozana added a commit to bozana/ojs that referenced this issue Feb 23, 2021
@bozana
Copy link
Collaborator

bozana commented Feb 23, 2021

The coming PRs consider the following issues:

  1. Article abstract page versions URL were not logged. Now they are logged, e.g. .../article/view/1/version/1.

  2. If there is a log entry of a galley view page, it will not be counted.
    The log file example above:

127.0.0.1 administrative 1 "2019-12-19 14:32:08" http://localhost:8000/publicknowledge/article/view/5/1 200 "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0"
127.0.0.1 administrative 1 "2019-12-19 14:32:08" http://localhost:8000/publicknowledge/article/download/5/1/15 200 "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0"

counts only the last, file download.

  1. The representation ID was calculated using file->getAssocId(). Because a file can be associated with several representations in OJS now, this eventually led to saving the wrong representation ID in the table metrics. This is corrected -- now the representation ID is passed through, from the log file URL processing via table usage_stats_temporary_records to the table metrics.

NOTE about the double click processing for versions:
The double click processing uses assocType and assocId -- if they are equal, the double click rule is applied. Earlier those two variables were defining also the unique URL, now with versioning this is not the case any more. The R4 says here https://www.projectcounter.org/code-of-practice-sections/data-processing/:

All users’ double-clicks on an http-link should be counted as only 1 request.

When two requests are made for one and the same article within the above time limits (10
seconds for HTML, 30 seconds for PDF), the first request should be removed and the
second retained. Any additional requests for the same article within these time limits should
be treated identically: always remove the first and retain the second. (For further
information on the implementation of this protocol, see Appendix D: Guidelines for
Implementation)

Thus, I am not sure if this applies to the same URLs or content objects.

(The R5 is more precise, I believe, and mentions only URLs, s. https://www.projectcounter.org/code-of-practice-five-sections/7-processing-rules-underlying-counter-reporting-data/#doubleclick. -- The uniqueness is handled extra.)

Currently, that means for versioning:
For the file download, e.g. the following log file entries when two different versions have the same file:

article/download/2/3/4
article/download/2/4/4

are counted only 1.

If file changes in a new version, e.g. the following log file entries

article/download/1/1/2
article/download/1/10/18

are counted = 2.

This seems to be OK -- if the file does not change and the new version contains the same file (with the same file ID) it is considered for double click.
But if the abstract changes (which we do not know) and we have these two URLs in the log file:

article/view/1/version/1
article/view/1/

they are counted = 1.

Is this all OK so for now? (For R5 we would then change the way double click processing works, to consider only the same URLs)

bozana added a commit to bozana/ops that referenced this issue Mar 3, 2021
bozana added a commit to bozana/ops that referenced this issue Mar 3, 2021
@bozana
Copy link
Collaborator

bozana commented Mar 3, 2021

Thanks a lot @NateWr! I would then merge the main branch changes as soon as all tests are successfully run, ok?

I also did the PRs for stable-3_3_0 (see above). Those contain only the fixes for the problem No. 1 and No. 2 from here #4904 (comment). (No 3. requires DB change, so this is not coming into the stable branch). Also, I did not implement PKPSubmissionDAO::exists() -- as we said -- not to change it in the stable branch, but I added the PKPPublicationDAO::exists() -- because this check is new/first now added.
Would you like to take a look and eventually test it?
🙏

bozana added a commit that referenced this issue Mar 3, 2021
bozana added a commit to pkp/usageStats that referenced this issue Mar 3, 2021
bozana added a commit to pkp/ojs that referenced this issue Mar 3, 2021
bozana added a commit to pkp/omp that referenced this issue Mar 3, 2021
bozana added a commit to pkp/ops that referenced this issue Mar 3, 2021
@NateWr
Copy link
Contributor Author

NateWr commented Mar 4, 2021

Looks good, go ahead and merge to stable.

bozana added a commit to bozana/usageStats that referenced this issue Mar 4, 2021
bozana added a commit to bozana/pkp-lib that referenced this issue Mar 4, 2021
bozana added a commit to bozana/omp that referenced this issue Mar 4, 2021
bozana added a commit to bozana/omp that referenced this issue Mar 4, 2021
bozana added a commit to bozana/ops that referenced this issue Mar 4, 2021
bozana added a commit to bozana/ops that referenced this issue Mar 4, 2021
bozana added a commit to pkp/usageStats that referenced this issue Mar 4, 2021
bozana added a commit to pkp/ojs that referenced this issue Mar 4, 2021
bozana added a commit to pkp/omp that referenced this issue Mar 4, 2021
bozana added a commit to pkp/ops that referenced this issue Mar 4, 2021
bozana added a commit that referenced this issue Mar 4, 2021
@bozana
Copy link
Collaborator

bozana commented Mar 4, 2021

Everything merged from this issue, thus closing...

@bozana bozana closed this as completed Mar 4, 2021
@asmecher asmecher modified the milestones: 3.3.1, 3.3.0-9 Oct 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants