Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COUNTER Release 5 #6781

Closed
11 of 12 tasks
bozana opened this issue Feb 22, 2021 · 101 comments
Closed
11 of 12 tasks

COUNTER Release 5 #6781

bozana opened this issue Feb 22, 2021 · 101 comments
Assignees
Labels
Enhancement:3:Major A new feature or improvement that will take a month or more to complete. Meta Issue An issue that groups and describes a collection of other issues.
Milestone

Comments

@bozana
Copy link
Collaborator

bozana commented Feb 22, 2021

Implement the COUNTER Release 5 for OJS/OMP/OPS usage statistics.
Here we can collect everything we decide is necessary. We can have a discussion below and every time we decide something we can summarize it here.

It seems the Release 5 with lots of changes is out there.
Here a guide for journals: https://www.projectcounter.org/wp-content/uploads/2020/08/Module_2_Journal_Usage_20200811.pdf.

  1. Processing rules for COUNTER 5 reports, s. https://www.projectcounter.org/code-of-practice-five-sections/7-processing-rules-underlying-counter-reporting-data/:
    a) Double click filtering (s. section 7.2):
    This is implemented here: https://github.com/pkp/usageStats/blob/master/UsageStatsLoader.inc.php#L194-L224. Till know the differentiation was between the access of HTML, PDF and other. This seems not to be needed any more -- We can change it to consider 30 seconds for any link i.e. file.
    Also we should change our implementation so that only the same URLs are considered (and not the assocType + assocID as till now). The uniqueness is treated differently:
    b) Unique Items (s. section 7.3):
    In our case Item is an article. The matching report is AR1. And the rule is: "If multiple transactions qualifying for the Metric_Type in question represent the same item and occur in the same user-sessions, only one unique activity MUST be counted for that item." Where user-session seems to be defined for an hour, as far as I understand it.
    The question if the article versions do belong to the same Item is still open. Due to the way we represent them internally I would say they do belong to the same Item.
    c) Unique Titles (s. section 7.4):
    In the case of a journal Title = a journal and the report = Title Master Report. Similar to the rule for the unique item above, the rule here is: "If multiple transactions qualifying for the Metric_Type in question represent the same title and occur in the same user-session only one unique activity MUST be counted for that title.". Where the user-session seems to be defined for an hour. I.e. here, if a user accesses one article and then another in the same session, it would only count once.
    This rule i.e. report seems not to be used for single journals -- introduced mostly for books. Do we need it (e.g. for libraries and multi-journal installations)?
    d) Internet Robots and Crawlers (s. section 7.8):
    Same as for Release 4.
    COUNTER maintains the current list of internet robots and crawlers at https://github.com/atmire/COUNTER-Robots.
    We use it as module in lib/pkp/lib/counterBots, assign the file to the variable COUNTER_USER_AGENTS_FILE (https://github.com/pkp/pkp-lib/blob/master/classes/core/Core.inc.php#L23) and implement the function isUserAgentBot in https://github.com/pkp/pkp-lib/blob/master/classes/core/Core.inc.php#L100. The function is then used when the log files are processed (https://github.com/pkp/usageStats/blob/master/UsageStatsLoader.inc.php#L170).
    We should define the strategy when we get the most recent version of the list.

  2. Because R5 now supports/count abstract views (in total views count), shell we consider the galley view pages too?

  3. SUSHI support is mandatory for compliance with COUNTER Release 5 (s. https://www.projectcounter.org/wp-content/uploads/2019/05/Release_5_TechNotes_PDFX_20190509-Revised.pdf).

  4. What Reports we would need/like to support/provide: AR1, Journal Master Report, X?

@bozana bozana added the Enhancement:1:Minor A new feature or improvement that can be implemented in less than 3 days. label Feb 22, 2021
@bozana
Copy link
Collaborator Author

bozana commented Feb 22, 2021

Hi @NateWr, @ctgraham, and @shanu17, I've opened this issue for us to see what everything has to be done for the COUNTER R5 support. I just started with a few things I have identified above, but the list is still to be filled. It would be great if we would also know what exactly is Pitt ULS working on, so that we can work on other things and arange.
Closely related to these changes for R5 would be some improvements discussed here: #6782.

@bozana bozana added this to the OJS/OMP/OPS 3.4 milestone Feb 22, 2021
@NateWr
Copy link
Contributor

NateWr commented Feb 22, 2021

Till know the differentiation was between the access of HTML, PDF and other. This seems not to be needed any more -- We can change it to consider 30 seconds for any link i.e. file.

We will probably want to continue to track total views between different kinds of full text, because journals will want to know that. So we'll just need to make sure that we're counting appropriately for R5 while not losing some specificity we already have.

@bozana
Copy link
Collaborator Author

bozana commented Feb 22, 2021

Till know the differentiation was between the access of HTML, PDF and other. This seems not to be needed any more -- We can change it to consider 30 seconds for any link i.e. file.

We will probably want to continue to track total views between different kinds of full text, because journals will want to know that. So we'll just need to make sure that we're counting appropriately for R5 while not losing some specificity we already have.

👍
(The above is about double-click processing, which was different in R4 and now it is the same -- 30 seconds -- for any files)

@bozana
Copy link
Collaborator Author

bozana commented Jul 12, 2021

Hi all, above all @asmecher and @NateWr, but maybe @ctgraham (above all regarding COUNTER R5 rules) as well :-)
I implemented the major part of the new UsageStatsLoader (the function processFile()), that considers the COUNTER R5. Would it be possible for you to take a look at it, if you would have better ideas, suggestions,...
Here the short summary:

In the process of log file processing till the data in the DB tables: what do you think at which place I should check if the object with the ID exists? -- For the current log files this is surely not necessary, but if someone would like to reprocess some old files. I was thinking at the moment we load the data from the temp tables into the actual ones.

Thanks a lot!

@bozana
Copy link
Collaborator Author

bozana commented Jul 12, 2021

@ctgraham, earlier we had administrative and/or user name logged, but I do not think this was considered in a way for the usage stats numbers and COUNTER. As far as I could see we do not need them now, we do not need to differentiate/consider/remove such access, correct?

@bozana
Copy link
Collaborator Author

bozana commented Jul 12, 2021

And maybe one more question @ctgraham: I think we do not need unique_title metric type, correct? -- We would/could consider books as submissions in OJS?

@NateWr
Copy link
Contributor

NateWr commented Jul 13, 2021

Thanks @bozana, I've left some comments on the commit.

extends FileLoader (that is still only used by/for usage stats)

I think this is fine for now. Ideally, we would migrate this to use the new FileService and Jobs Queue to handle the staging and processing of files. We would probably benefit from breaking this down into several smaller jobs, but that can be done another time.

unique item: the day is sliced in 24 pieces

Is that really how the COUNTER spec works!? 😮 So if I view something at 7:59 and 8:01 these are considered unique, but not if I view it at 8:01 and 8:03?

@bozana
Copy link
Collaborator Author

bozana commented Jul 14, 2021

Thanks a lot @NateWr!
Yes, we would need to adapt scheduling from Laravel, also the jobs queue, but I agree to do it then, when everything else is done...
Yes, what you say about uniqueness is true, and maybe @ctgraham can confirm?
Actually the uniqueness is connected with/based on one user session, but if such does not exists (e.g. if the user is not logged in), than that way, s. https://www.projectcounter.org/code-of-practice-five-sections/7-processing-rules-underlying-counter-reporting-data/.

@ctgraham
Copy link
Collaborator

earlier we had administrative and/or user name logged, but I do not think this was considered in a way for the usage stats numbers and COUNTER.

If we had a way to exclude administrative usage counts from COUNTER statistics, we would be responsible to do so. If we could exclude counting access generated via the Issue Preview, this would be appropriate (but might not be readily done). In general, just the fact that a user was logged in should not be a consideration for COUNTER.

I think we do not need unique_title metric type, correct? -- We would/could consider books as submissions in OJS?

The unique_title metric shouldn't be relevant for OJS. Perhaps for OMP, if individual chapters can be presented?

Actually the uniqueness is connected with/based on one user session.

Yes, the fuzzy definition of "per hour" is only relevant if the user session itself cannot be identified.

@bozana
Copy link
Collaborator Author

bozana commented Jul 16, 2021

I haven't thought much about OMP yet, but: A book/submission can just contain the files or it can contain chapters. So it could be a problem if we would have different Items (per COUNTER definition) within one press -- in the first case the book/submission and in the second chapters -- correct? So my first thought was to simplify all this and say the book/submission is the item, and the chapters would be seen as just files... 🤔

@bozana
Copy link
Collaborator Author

bozana commented Jul 16, 2021

Yes, the fuzzy definition of "per hour" is only relevant if the user session itself cannot be identified.

Here as well I tended to (over) simplify it and have just considered the hour slices 😅 So I should consider/log the user session, if there?
I will see how long do our sessions last...
Somehow I do not like this COUNTER 'rule' neither -- different systems can have differently lasting sessions... :-P

@bozana
Copy link
Collaborator Author

bozana commented Jul 16, 2021

Regarding the administrative access:

  • we can check if the user is admin or editor or so, but this does not necessarily mean the access is administrative i.e. we maybe should not do that, right?
  • the issue preview uses the same function 'view' but we could fire the usage event only when the object (issue or submission) is published, ok?

@ctgraham
Copy link
Collaborator

Regarding the administrative access:

* we can check if the user is admin or editor or so, but this does not necessarily mean the access is administrative i.e. we maybe should not do that, right?

* the issue preview uses the same function 'view' but we could fire the usage event only when the object (issue or submission) is published, ok?

Agreed, and agreed.

@bozana
Copy link
Collaborator Author

bozana commented Jul 16, 2021

@asmecher, do I see/understand it correctly that our user sessions, depending on the setting in the config file, either 'never' expires (30 days) (and the session id is not changed) or with the browser session i.e. when the browser is closed?

@asmecher
Copy link
Member

SessionManager.inc.php contains:

ini_set('session.cookie_lifetime', 0);
...
ini_set('session.gc_maxlifetime', 60 * 60);

...so by my read, sessions unused for 1 hour become eligible for garbage collection, which is stochastic.

These policies haven't been changed for a long time, and I suspect there are some best practices we could adopt. So I'm open to change on this.

@bozana
Copy link
Collaborator Author

bozana commented Jul 19, 2021

That sounds good to me -- I am just trying to figure our if we can rely on our session ID for usage stats...
For some reason I am always logged in (with the same session ID), also after 1 hour (and setting 0 in the config and deactivating 'remember me') of not using the site... Do the other experience the same?

@bozana
Copy link
Collaborator Author

bozana commented Jul 19, 2021

Hmmm... It seems that using those two settings is not reliable and we should implement the session timeout by ourselves, s. https://stackoverflow.com/questions/520237/how-do-i-expire-a-php-session-after-30-minutes.
So maybe to have that check, if the last usage is long time ago, here, before these lines: https://github.com/pkp/pkp-lib/blob/main/classes/session/SessionManager.inc.php#L109-L110. Maybe somewhere else too?

@bozana
Copy link
Collaborator Author

bozana commented Jul 19, 2021

But, even then, if we implement to expire the user session after 30 minutes or 1 hour of inactivity:
For COUNTER usage stats:
If a logged-in user uses the journal site for the whole day, it would mean only 1 unique submission access, differently to the other counts when users are not logged-in and we use the 24 day slices.
Somehow I tend to always use those 24 slices for usage stats...
@ctgraham and @NateWr, what do you think?

@bozana
Copy link
Collaborator Author

bozana commented Jul 19, 2021

Thanks to suggestion from @NateWr I moved the double click and unique item processing i.e. removal to the database, doing in with the SQL -- all log entries will be inserted into the temporary tables and then the removal of double and unique clicks done there, s.
https://github.com/bozana/pkp-lib/blob/6782/classes/statistics/UsageStatsTotalTemporaryRecordDAO.inc.php#L95
and
https://github.com/bozana/pkp-lib/blob/6782/classes/statistics/UsageStatsUniqueTemporaryRecordDAO.inc.php#L94
Now the processing in the UsageStatsLoader is slim, s. https://github.com/bozana/pkp-lib/blob/6782/classes/task/UsageStatsLoader.inc.php#L87.
Maybe @asmecher and @jonasraoni could have a look at that SQLs too?

@asmecher
Copy link
Member

@bozana, I think it should be possible to formulate a query that works for both MySQL and PostgreSQL using DELETE FROM xxx WHERE yyy IN (subquery) -- but you'd need to test it against both to be sure, as I remember seeing complaints about self-joins but don't recall the conditions. I have some PostgreSQL test datasets from various versions, and could either send you those, or test potential queries, whatever's most helpful.

@bozana
Copy link
Collaborator Author

bozana commented Jul 20, 2021

@ctgraham, just to be sure: do you think we should consider user session when possible or use only 24 slices?

@bozana
Copy link
Collaborator Author

bozana commented Jul 20, 2021

If we decide to use the session ID when possible, shall it expire after 1 hour of inactivity, or 1/2 hour?

@bozana
Copy link
Collaborator Author

bozana commented Jul 20, 2021

@asmecher, would this code be OK for the session expiration: bozana@e16bbfa, as said above?

bozana added a commit to bozana/pkp-lib that referenced this issue Sep 12, 2022
bozana added a commit to bozana/ojs that referenced this issue Sep 12, 2022
bozana added a commit to bozana/ojs that referenced this issue Sep 12, 2022
bozana added a commit to bozana/omp that referenced this issue Sep 12, 2022
bozana added a commit to bozana/ops that referenced this issue Sep 12, 2022
bozana added a commit that referenced this issue Sep 12, 2022
#6781 Opt-out for public SUSHI API
bozana added a commit to pkp/ojs that referenced this issue Sep 12, 2022
bozana added a commit to pkp/ops that referenced this issue Sep 12, 2022
bozana added a commit to pkp/omp that referenced this issue Sep 12, 2022
@bozana bozana closed this as completed Sep 12, 2022
Repository owner moved this from Under Development to Done in Statistics Sep 12, 2022
@bozana
Copy link
Collaborator Author

bozana commented Sep 19, 2022

@ctgraham, the major functionality is in the main branch. It would be great and I would be very happy if you would like and have some time to take a look/test... but... no pressure, of course... :-)

@bozana
Copy link
Collaborator Author

bozana commented Nov 1, 2022

PR that considers the first date published of a context when calculating the SUSHI start date:
pkp-lib: #8390
ojs: pkp/ojs#3605 (only submodule update)
omp: pkp/omp#1240 (only submodule update)
ops: pkp/ops#386 (only submodule update)

bozana added a commit to bozana/pkp-lib that referenced this issue Nov 3, 2022
bozana added a commit to bozana/omp that referenced this issue Nov 3, 2022
bozana added a commit to bozana/ops that referenced this issue Nov 3, 2022
bozana added a commit to bozana/ojs that referenced this issue Nov 3, 2022
bozana added a commit that referenced this issue Nov 3, 2022
#6781 consider first date published of a context for the S…
bozana added a commit to pkp/omp that referenced this issue Nov 3, 2022
pkp/pkp-lib#6781 submodule update ##bozana/6781##
bozana added a commit to pkp/ops that referenced this issue Nov 3, 2022
pkp/pkp-lib#6781 submodule update ##bozana/6781##
bozana added a commit to pkp/ojs that referenced this issue Nov 3, 2022
pkp/pkp-lib#6781 submodule update ##bozana/6781##
withanage pushed a commit to withanage/ojs that referenced this issue Dec 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement:3:Major A new feature or improvement that will take a month or more to complete. Meta Issue An issue that groups and describes a collection of other issues.
Projects
Status: No status
Status: Done
Development

No branches or pull requests

5 participants