-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve usage statistics handling in the background/code #6782
Comments
s. also this: #4904 (comment) |
The thoughts/decisions on the new usage stats log format:
|
@NateWr, @ctgraham and @asmecher, I have a question regarding the following columns in the DB table metrics:
Depending what we decide, I would eventually consider them in the log format (see above)... |
JSON I/O in PHP is pretty lightning-fast; I don't foresee any issues there. Do we want to leave room here for an "institution" field when that's available -- ideally (but probably rarely) a ROR? |
I agree with this proposal. The current statistics service class that delivers the visual statistics uses
Should we pack this into the |
COUNTER reporting would totally be interested in downloads by published item by institution (subscriber). |
Yes, the institution will definitely come, this is my next step, will then come back to you regarding this again :-) So, would you agree that we remove assoc_object_type, assoc_object_id and pkp_section_id from DB table metrics? |
Hmm, maybe it would be clearer to just call it |
Wouldn't this be a privacy concern? I thought that libraries were big on not tracking the specific material that someone is reading. It seems to me like individual resource tracking at the institutional level would frequently result in very low counts, which would make them prone to deanonymisation. In many cases, maybe only one person has visited a specific article from an institution. Also, it's my understanding that institutional reporting exists to assess ROI. But what relationship does ROI have with article-level stats? |
The "why track at the article level?" is interesting. I was coming at it from the protocol requirements perspective: COUNTER promises a certain set of reports, and institutions are expected to be interested in any of those reports, so the protocol provides for any of the reports to be filtered at the Customer / Consortia Members level. But practically, a customer is probably more interested in the Platform Report or Database Report or Title Report than they are in an Item Report. And, a metrics service is probably more interested in Items, irrespective of customers. A counter example (no pun intended) where a library would be interested in item level access would be something like EZPAARSE, where proxy logs are post-processed to gather additional metadata for consultation events represented within the proxy log. This isn't related COUNTER reports, but illustrates that sometimes we library folks are interested in (anonymized) usage of very specific things (even digital things). In terms of the risk of deanonymization, this would be the case if we had a record of who accessed a particular OJS instance, but not what they accessed, but pulled down the COUNTER report and, because of low usage, now could guess at what they accessed. This seems like an unlikely scenario. What data source would I be using to get a record of a specific user accessing a specific OJS site, which isn't also capturing requested URIs? |
I think you're right that I am deep into unlikely but worst-case scenarios. I think my concern is not so much on the institution's side -- the institution can probably track all activity on its network. I'm more concerned about OJS as a store of information on time, place and content. A worst-case scenario I can imagine is in a country where it is dangerous to access certain information. An academic in that country wants to access such articles safely, and uses a proxy set up by a colleague on the Pitt network to read them. As part of an investigation, the country learns about the proxy and acquires the OJS institutional stats (by hacking on the OJS or Pitt side). By comparing records of the academic's requests to the proxy server (which the state has) with records of institutional stats, the state can infer that the academic is accessing those records. That's probably pretty unlikely, and storing the stats at journal-level instead of article-level may not help much in such cases. But it illustrates how it can be very hard to prevent deanonymization whenever data about time, place and content is joined together. We can probably weave our way through this in some sense by offering options at the level of institutional configuration. A journal could say "for this institution, track at the [instance/journal/article] level", and the institution could be required to request finer-grained stats if desired. |
Interestingly, COUNTER is currently soliciting feedback from publishers and institutional users regarding reporting on the level of geographic subdivisions and institutional attribution of OA access. |
Hi all, I would like to discuss with you how we could/should improve the data model for usage statistics.
So first the question would be how to add those additional information needed. I tend to say that we have two DB tables, containing almost the same columns, one containing the total numbers and one containing the unique numbers (that would not consider assoc_tpye = article and issue files). The tables would have the additional columns institution_id and access_type. Rather than or additionally to separating the table(s) by assoc_type or institution or X into several tables, and what we should probably do however: I would very much love to hear you opinion on this! (@mfelczak, maybe you would like to follow the discussion here, or I can ping you when the final decision is there, so that you know what changes to expect, what would be relevant for your current project...) |
So maybe to have: DB table metrics_unique: DB table metrics_total_sum: DB table metrics_unique_sum: |
I've just spoken with @NateWr and we came to a few other conclusions:
Considering the requirements above, I will try to make a suggestion for the DB tables. Coming soon... |
DB table metrics_context&issue: DB table metrics_submission(abstract&files): DB table counter_metrics_submission(+Geo)_daily: DB table counter_metrics_submission(+Geo)_monthly: Table metrics_submission(abstract&files) and counter_metrics_submission(+Geo)_daily (for total metrics) contain some same data in a way, but we suppose that the Geo+Institution differentiation would produce less rows this way, especially later, when consolidated into the _monthly table 🤔 I have to admin that I mostly have OJS in mind, so I will have to double check it all for OMP and OPS, but I believe it would not change a lot... Any further thoughts would be very much appreciated! Especially on the performance... So maybe @asmecher and/od @jonasraoni would have some further suggestions/thoughts/ideas/tips... ? |
Thanks @bozana! For Alec and Jonas, do you know if there are any SQL tricks we can use for date tracking? At the moment each row includes a |
Thanks! I will read this through as soon as I can, starting from tomorrow I will be away for a week. |
Hi @bozana! I think the best for this case would be to use a time series database, but as it's not feasible, follow some ideas =]
Edit: About the logging/locking: https://bugs.php.net/bug.php?id=40897... And from this link, I saw another possibility (set the error_log output path to the usage log path, call the error_log, then restore the previous path). |
I think we should have separate tables. Then we can get rid of
Instead of using I wonder if we should drop the
Do we need to differentiate this? I understand it's part of the COUNTER R5 spec, but I'm not sure it's a great fit for us. Most of our journals are not "Gold OA" but "Diamond OA". And where restrictions exist, they don't really match the description of "Controlled" that I read. I'm wondering if it would be more prudent for us to leave this out of our COUNTER reports.
The Also, I think that
I think we could probably get rid of this in every table. I don't think we're likely to support an alternative metric type in the future, and if we do, we probably need a separate table for it anyway. |
The differentiation is Controlled = Subscription; Gold OA = Open Access. The naming of Platinum/Diamond and other nuances are not recognized by COUNTER.
|
Hi all, Else, if we do separate the table by assoc_type: @jonasraoni, thanks for all the ideas! @ctgraham, thanks a lot for following and helping with those COUNTER requirements! I am very happy that somebody else has an eye on it too! :-) |
Maybe a note for those knowing better/dealing with the user requirements: Having the Geo and institution data only in the new counter metrics table, we will not be able to have abstract and file type usage and Geo/Institution data, also not issue usage and Geo/Institution data, e.g. we will not be able say something like "There were X PDF downloads from Germany/University of Blabla". We would be able to say: "There were X total/unique investigations and Y total/unique requests from Germany/University of Blabla". |
I don't think we should make too many assumptions about what people want. We should be support not just high-level yearly reports but also things like individual curiosity, internal market research, etc. That may still be possible when partitioning the table by year. But in the stats UI we will want people to be able to specify any date range and get results back, regardless of whether the start/end dates span multiple years or not (eg - April 3 2018-July 17 2020). If this can all be abstracted into a stats service class that can be used without too much difficulty, then that's fine with me. My main concern is ensuring that we have a stats service class that provides an easy-to-use tool for getting whatever stats we have.
I really think we should use separate tables instead of using
I don't think we should try to track geographic stats and institutional stats in the same table. It should be two tables: Schema::create('metrics_submission_geo', function (Blueprint $table) {
$table->bigInteger('load_id');
$table->bigInteger('context_id');
$table->bigInteger('submission_id');
$table->bigInteger('country_id');
$table->bigInteger('region_id');
$table->bigInteger('city_id');
$table->integer('day');
$table->tinyInteger('month');
$table->bigInteger('access_type');
$table->bigInteger('metric_investigations');
$table->bigInteger('metric_investigations_unique');
$table->bigInteger('metric_requests');
$table->bigInteger('metric_requests_unique');
$table->foreign('context_id')->references('journal_id')->on('journals');
$table->foreign('submission_id')->references('submission_id')->on('submissions');
$table->foreign('country_id')->references('country_id')->on('metrics_countries');
$table->foreign('region_id')->references('region_id')->on('metrics_regions');
$table->foreign('city_id')->references('city_id')->on('metrics_cities');
});
Schema::create('metrics_submission_institutions', function (Blueprint $table) {
$table->bigInteger('load_id');
$table->bigInteger('context_id');
$table->bigInteger('submission_id');
$table->bigInteger('institution_id');
$table->bigInteger('access_type');
$table->bigInteger('metric_investigations');
$table->bigInteger('metric_investigations_unique');
$table->bigInteger('metric_requests');
$table->bigInteger('metric_requests_unique');
$table->foreign('context_id')->references('journal_id')->on('journals');
$table->foreign('submission_id')->references('submission_id')->on('submissions');
$table->foreign('institution_id')->references('institution_id')->on('institutions');
}); |
#6782 do not use installation migrations in the upgrade
pkp/pkp-lib#6782 do not use installation migrations in the upgrade
pkp/pkp-lib#6782 do not use installation migrations in the upgrade
pkp/pkp-lib#6782 do not use installation migrations in the upgrade
Final PRs:
pkp-lib: #8109
ui-library: pkp/ui-library#213
ojs: pkp/ojs#3465
omp: pkp/omp#1161
ops: pkp/ops#313
plugins/generic/lensGalley: asmecher/lensGalley#60
fixes PRs:
s. #8123
s. #8125
do not use installation migrations in upgrade,
fix CompileMonthlyMetrics job and move it to the pkp-lib:
pkp-lib: #8184
ojs: pkp/ojs#3501
omp: pkp/omp#1180
ops: pkp/ops#331
TO-DOs:
(OJS and OPS: Improve usage statistics handling in the background/code #6782 (comment), and
OMP: Improve usage statistics handling in the background/code #6782 (comment))
Change ViewReport plugin (s. ViewReport: provide PDF, HTML, Other stats instead of stats for each artilce galley #7384)Migrate privacy policy and the opportunity to opt-out logging to a new place in the code (was a block plugin within the usageStats plugin)-- not neededCopy usage stats display and its setting to all themes (This will be done in a separated issue)Some improvements are wished.
S. also #4904 (comment).
Different log format:
a) Instead or additionally to the URL it would be good to have something like:
contextID: submissionID: representationID: publicationID: fileID
b) Consider everything needed for institutional subscription statistics.
Additional data model that would at least support aggregation of the old usage stats data, so that they can be removed from the DB table metrics.
The text was updated successfully, but these errors were encountered: