Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve usage statistics handling in the background/code #6782

Closed
31 of 35 tasks
bozana opened this issue Feb 22, 2021 · 129 comments
Closed
31 of 35 tasks

Improve usage statistics handling in the background/code #6782

bozana opened this issue Feb 22, 2021 · 129 comments
Assignees
Labels
Enhancement:1:Minor A new feature or improvement that can be implemented in less than 3 days.
Milestone

Comments

@bozana
Copy link
Collaborator

bozana commented Feb 22, 2021

Final PRs:
pkp-lib: #8109
ui-library: pkp/ui-library#213
ojs: pkp/ojs#3465
omp: pkp/omp#1161
ops: pkp/ops#313
plugins/generic/lensGalley: asmecher/lensGalley#60

fixes PRs:
s. #8123
s. #8125

do not use installation migrations in upgrade,
fix CompileMonthlyMetrics job and move it to the pkp-lib:
pkp-lib: #8184
ojs: pkp/ojs#3501
omp: pkp/omp#1180
ops: pkp/ops#331

TO-DOs:

Some improvements are wished.
S. also #4904 (comment).

  1. Different log format:
    a) Instead or additionally to the URL it would be good to have something like:
    contextID: submissionID: representationID: publicationID: fileID
    b) Consider everything needed for institutional subscription statistics.

  2. Additional data model that would at least support aggregation of the old usage stats data, so that they can be removed from the DB table metrics.

@bozana bozana added the Enhancement:1:Minor A new feature or improvement that can be implemented in less than 3 days. label Feb 22, 2021
@bozana bozana mentioned this issue Feb 22, 2021
12 tasks
@bozana bozana added this to the OJS/OMP/OPS 3.4 milestone Feb 22, 2021
@bozana
Copy link
Collaborator Author

bozana commented Mar 4, 2021

s. also this: #4904 (comment)

@bozana bozana self-assigned this May 2, 2021
@bozana
Copy link
Collaborator Author

bozana commented May 3, 2021

The thoughts/decisions on the new usage stats log format:

  • The entries in JSON format (instead of space separated): this will probably need more processing (insignificantly?), but will allow us to easier add new elements if needed.
  • The following elements are minimally needed:
issueId (only in OJS)
chapterId (only in OMP)
time
ip (anonymized)
canonicalURL
contextId
submissionId
representationId
assocType
assocId
fileType
userAgent 
country
region
city
institutionIds
version

time, ip and canonicalURL will be used for double click filtering (COUNTER R5).
time, ip, submissionId and assocType + assocId will be used for unique items (COUNTER R5).
userAgent will be used for robots filtering (COUNTER R5).
contextId, submissionId and representationId are logged so that they do not have to be re-calculated later, when processing the log file, because we already have them, so we can just write/overtake them to the DB table. Later they will be used for statistics aggregation (by journal and submission). representationId is needed for the association with the file (currently having only the file id we cannot determine the representation id in OJS).
assocType and assocId are the actual object that is being accessed/visited/used.

@bozana
Copy link
Collaborator Author

bozana commented May 3, 2021

@NateWr, @ctgraham and @asmecher, I have a question regarding the following columns in the DB table metrics:
assoc_object_type and assoc_object_id -- those are assoc_type_issue and issue id, only used in OJS, when visiting article, article files, issue toc and issue galley.
pkp_section_id - this is the section id in OJS and OPS, and series id in OMP.
Those are currently only used in reports and not to generate the statistics for those objects.
I am not sure if I should keep them:

  • It would be maybe good to keep them, to be on the safe side when doing all the other changes. It would reduce the changes needed for current reports.
  • It is maybe better to remove them and keep the DB table metrics small as possible, because we actually do not use them for the statistics aggregation (by them). This way we would not need to care when e.g. an article changes the section or the issue. This would mean more changes/adaptations for the current reports.

Depending what we decide, I would eventually consider them in the log format (see above)...

@asmecher
Copy link
Member

asmecher commented May 3, 2021

JSON I/O in PHP is pretty lightning-fast; I don't foresee any issues there.

Do we want to leave room here for an "institution" field when that's available -- ideally (but probably rarely) a ROR?

@NateWr
Copy link
Contributor

NateWr commented May 3, 2021

I agree with this proposal. The current statistics service class that delivers the visual statistics uses assoc_type and assoc_id to filter by sections and I think this should be sufficient.

Do we want to leave room here for an "institution" field when that's available -- ideally (but probably rarely) a ROR?

Should we pack this into the metrics table? Or would it make sense to have a separate metrics_institutions table that tracked stats by journal/institution? I don't think we'll need/want all of the same level of detail for institutional metrics, like hits-by-issue or by hits-by-submission, will we?

@ctgraham
Copy link
Collaborator

ctgraham commented May 3, 2021

COUNTER reporting would totally be interested in downloads by published item by institution (subscriber).

@bozana
Copy link
Collaborator Author

bozana commented May 3, 2021

Yes, the institution will definitely come, this is my next step, will then come back to you regarding this again :-)

So, would you agree that we remove assoc_object_type, assoc_object_id and pkp_section_id from DB table metrics?

@asmecher
Copy link
Member

asmecher commented May 3, 2021

those are assoc_type_issue and issue id, only used in OJS, when visiting article, article files, issue toc and issue galley.

Hmm, maybe it would be clearer to just call it issue_id in OJS and log it accordingly? We don't have a use for this in the other apps, and when the assoc_id/assoc_type pairings don't have extra utility, best to avoid them, I think.

@NateWr
Copy link
Contributor

NateWr commented May 4, 2021

COUNTER reporting would totally be interested in downloads by published item by institution (subscriber).

Wouldn't this be a privacy concern? I thought that libraries were big on not tracking the specific material that someone is reading. It seems to me like individual resource tracking at the institutional level would frequently result in very low counts, which would make them prone to deanonymisation. In many cases, maybe only one person has visited a specific article from an institution.

Also, it's my understanding that institutional reporting exists to assess ROI. But what relationship does ROI have with article-level stats?

@ctgraham
Copy link
Collaborator

ctgraham commented May 6, 2021

The "why track at the article level?" is interesting. I was coming at it from the protocol requirements perspective: COUNTER promises a certain set of reports, and institutions are expected to be interested in any of those reports, so the protocol provides for any of the reports to be filtered at the Customer / Consortia Members level. But practically, a customer is probably more interested in the Platform Report or Database Report or Title Report than they are in an Item Report. And, a metrics service is probably more interested in Items, irrespective of customers.

A counter example (no pun intended) where a library would be interested in item level access would be something like EZPAARSE, where proxy logs are post-processed to gather additional metadata for consultation events represented within the proxy log. This isn't related COUNTER reports, but illustrates that sometimes we library folks are interested in (anonymized) usage of very specific things (even digital things).

In terms of the risk of deanonymization, this would be the case if we had a record of who accessed a particular OJS instance, but not what they accessed, but pulled down the COUNTER report and, because of low usage, now could guess at what they accessed. This seems like an unlikely scenario. What data source would I be using to get a record of a specific user accessing a specific OJS site, which isn't also capturing requested URIs?

@NateWr
Copy link
Contributor

NateWr commented May 10, 2021

This seems like an unlikely scenario.

I think you're right that I am deep into unlikely but worst-case scenarios. I think my concern is not so much on the institution's side -- the institution can probably track all activity on its network. I'm more concerned about OJS as a store of information on time, place and content.

A worst-case scenario I can imagine is in a country where it is dangerous to access certain information. An academic in that country wants to access such articles safely, and uses a proxy set up by a colleague on the Pitt network to read them. As part of an investigation, the country learns about the proxy and acquires the OJS institutional stats (by hacking on the OJS or Pitt side). By comparing records of the academic's requests to the proxy server (which the state has) with records of institutional stats, the state can infer that the academic is accessing those records.

That's probably pretty unlikely, and storing the stats at journal-level instead of article-level may not help much in such cases. But it illustrates how it can be very hard to prevent deanonymization whenever data about time, place and content is joined together.

We can probably weave our way through this in some sense by offering options at the level of institutional configuration. A journal could say "for this institution, track at the [instance/journal/article] level", and the institution could be required to request finer-grained stats if desired.

@ctgraham
Copy link
Collaborator

Interestingly, COUNTER is currently soliciting feedback from publishers and institutional users regarding reporting on the level of geographic subdivisions and institutional attribution of OA access.
https://www.projectcounter.org/counter-consultation/

@bozana
Copy link
Collaborator Author

bozana commented May 25, 2021

Hi all, I would like to discuss with you how we could/should improve the data model for usage statistics.

  1. As noted above some columns (issue_id and section_id) are only used for easier reporting. I understand from the discussion above that we would like to keep them.
  2. COUNTER R5 does not consider tracking of issue TOC and galleys, as well as journal landing page, but I suppose we would still like to track those for our users and keep/consider that assoc_types?
    EDIT: I would suggest to remove the tracking of the context landing page, s. the next comment.
  3. COUNTER R5 does not differentiate between different file formats (i.e. PDF, HTML, Other) any more. I suppose we would still like to track the usage on different formats for our users and thus keep that column?
  4. COUNTER R5 introduces new metrics: Total_Item_Investigations, Unique_Item_Investigations, Total_Item_Requests and Unique_Item_Requests. Except for the Item_Report they are all meant to be per journal.
    Investigations consider every click that displays information about an article, and every click that downloads its content: "Information can be an abstract, for example, or information about the author. Download clicks are those that download the full text of the article. Most of these downloads are in either HTML or PDF format, but downloads of the full text in any format are counted.". This mean that also our tracked supplementary files would be counted/considered here.
    Requests mean download of an article (item) either as a PDF, HTML or other formats.
    Total number is the number we currently have -- all clicks without double clicks.
    For the unique number the repeated clicks on the same article during a session (by one and the same user withing one hour) are not added to the count, i.e. all clicks within an hour count as 1.
    Thus, we would need an additional DB table that will contain the unique numbers. For unique numbers maybe to only consider assoc_type submission?
    Then there are also Unique_Title_Investigations and Unique_Title_Requests, where "A title would usually be something like a book or journal—the wrapper around items of content". I am not sure if we need that. @ctgraham, what do you think?
  5. COUNTER R5 differentiate between Access_Type = Controlled and Gold_OA. This is the column we would also need to add. The items that were closed but then later OA are currently considered Controlled. Thus, for OJS, we could eventually assume everything Controlled for subscription journals and everything Gold_OA for OA journals -- because I am not sure we can know if an item was/is/will always be OA. For OMP I would need to investigate it more.
  6. And finally we have the institution_id as described here: IP location and institution service #6895 (comment).
  7. EDIT: do we need the column metric_type? Earlier there were other metric types (e.g. timeView) but for a long time there is only ojs/omp/ops::counter...

So first the question would be how to add those additional information needed. I tend to say that we have two DB tables, containing almost the same columns, one containing the total numbers and one containing the unique numbers (that would not consider assoc_tpye = article and issue files). The tables would have the additional columns institution_id and access_type.
@NateWr suggested to separate the institution stats. How would this be?
We could eventually separate all different assoc_types into separate tables, so we would for example have a table for issue+issue galley statistics, a table for submission+submission file statistics, and a table for only journal index page access (or we could have it in the issue table)? @asmecher, what do you think if that would mean any performance improvement?

Rather than or additionally to separating the table(s) by assoc_type or institution or X into several tables, and what we should probably do however:
We could/should summarize the old statistics. COUNTER R5 reports contain only monthly statistics (and I believe nobody really needs the stats per day, or at least not for a time longer than a few month or maximum a year), thus we could regularly summarize them per month. And for the even older statistics we could eventually summarize them per year? This way we would have additionally DB table(s) for those summarized counts and would remove the entries from the table(s) that contain daily numbers.
Maybe the users (with a big amount of data in the current metrics table) could tell us in what form they would still need the old data, i.e. if a sum per whole year would be enough (e.g. @ctgraham or @ajnyga)?
If we summarize the statistics per month or even year, there could be a problem if somebody for some reason would like reprocess an old log file again -- the users would need to reprocess the files of the whole month at least, maybe then even the year?

I would very much love to hear you opinion on this!

(@mfelczak, maybe you would like to follow the discussion here, or I can ping you when the final decision is there, so that you know what changes to expect, what would be relevant for your current project...)

@bozana
Copy link
Collaborator Author

bozana commented May 25, 2021

So maybe to have:
DB table metrics_total:
load_id, context_id, pkp_section_id, issue_id, submission_id, representation_id, assoc_type, assoc_id, day, month, file_type, country_id, region, city, institution_id, access_type (Controlled or Gold_OA), metric_type?, metric
(Based on assoc_type we would know if it is Total_Item_investigations or Total_Item_Requests).

DB table metrics_unique:
(would only consider the COUNTER metrics)
load_id, context_id, pkp_section_id, issue_id, assoc_type, assoc_id, day, month, country_id, region, city, institution_id, access_type (Controlled or Gold_OA), r5_metric_type (Unique_Item_Investigations or Unique_Item_Requests), metric
(assoc_type = submission: all clicks for a submission by an user within an hour would count 1
if we would like/need to consider assoc_type = journal: all submissions clicks for a journal by an user within an hour would count 1 and this would be Unique_Title_Investigations or Unique_Title_Requests)

DB table metrics_total_sum:
context_id, pkp_section_id, issue_id, submission_id, representation_id, assoc_type, assoc_id, month, file_type, country_id, region, city, institution_id, access_type, metric_type?, metric

DB table metrics_unique_sum:
context_id, pkp_section_id, issue_id, assoc_type, assoc_id, month, country_id, region, city, institution_id, access_type, r5_metric_type, metric

@bozana
Copy link
Collaborator Author

bozana commented May 26, 2021

I've just spoken with @NateWr and we came to a few other conclusions:

  1. We will keep the format (PDF, HTML and other) granularity for our users.
  2. Important for our UI is also the differentiation between primary and supplementary files (the dependent files are ignored). (The supplementary files have assoc_type = ASSOC_TYPE_SUBMISSION_FILE_COUNTER_OTHER. (Eventually I can change this to ASSOC_TYPE_SUBMISSION_FILE_SUPP or so). I.e. this information is/will be also available in the DB table metrics.)
  3. We would like to keep the daily numbers for our UI.
  4. Because (as noted in No.3) we cannot then consolidate the numbers, we would like to split the DB table metrics by assoc_type, if possible.
  5. We would like to track Geo and Institution only on the submission level.
  6. We would like to have an extra DB table for the COUNTER R5 specific metric types (Total_Item_Investigations, Unique_Item_Investigations, Total_Item_Requests and Unique_Item_Requests), definitely for the unique types, but maybe also for the total types, so that we can have the Geo and Institution columns/data here.
  7. Because COUNTER R5 only requires monthly numbers we would have another DB Table to consolidate the numbers from No.6 into it and remove it then from No.6 table.

Considering the requirements above, I will try to make a suggestion for the DB tables. Coming soon...

@bozana
Copy link
Collaborator Author

bozana commented May 26, 2021

DB table metrics_context&issue:
load_id, context_id, assoc_type, assoc_id, day, month, file_type?, metric_type?, metric
(assoc_type = context, issue)

DB table metrics_submission(abstract&files):
load_id, context_id, pkp_section_id, issue_id (ojs), submission_id, representation_id, assoc_type, assoc_id, day, month, file_type, metric_type?, metric
(assoc_type = submission, submission file, submission supp file)

DB table counter_metrics_submission(+Geo)_daily:
load_id, context_id, submission_id, day, month, country_id, region, city, institution_id, access_type (Controlled or Gold_OA), r5_metric_type, metric
(r5_metric_type = Total_Item_Investigations, Unique_Item_Investigations, Total_Item_Requests and Unique_Item_Requests)

DB table counter_metrics_submission(+Geo)_monthly:
context_id, submission_id, month, country_id, region, city, institution_id, access_type (Controlled or Gold_OA), r5_metric_type, metric
(once the number are consolidated here, the entries from the _daily table will be removed)

Table metrics_submission(abstract&files) and counter_metrics_submission(+Geo)_daily (for total metrics) contain some same data in a way, but we suppose that the Geo+Institution differentiation would produce less rows this way, especially later, when consolidated into the _monthly table 🤔

I have to admin that I mostly have OJS in mind, so I will have to double check it all for OMP and OPS, but I believe it would not change a lot...

Any further thoughts would be very much appreciated! Especially on the performance... So maybe @asmecher and/od @jonasraoni would have some further suggestions/thoughts/ideas/tips... ?
Thanks a lot!

@NateWr
Copy link
Contributor

NateWr commented May 26, 2021

Thanks @bozana! For Alec and Jonas, do you know if there are any SQL tricks we can use for date tracking? At the moment each row includes a day (YYYYMMDD) column and a month (YYYYMM) column. This makes it easy to sum up records by month. But is there any way to do this in SQL with a single DATE column? Would it be possible to store the date in a single column but still sum results monthly?

@ajnyga
Copy link
Collaborator

ajnyga commented May 26, 2021

Thanks! I will read this through as soon as I can, starting from tomorrow I will be away for a week.

@jonasraoni
Copy link
Contributor

jonasraoni commented May 26, 2021

Hi @bozana! I think the best for this case would be to use a time series database, but as it's not feasible, follow some ideas =]

  • As we're not looking for increasing the time granularity, we could replace the day column by a set of columns, from day1 until day31, this way we would be able to pack a whole month into a single record.
  • Partitioning (both MySQL and PostgreSQL support it) might bring a great performance boost. If for some reason, that's not feasible, we could still partition the table manually by creating a "metrics_2019", "metrics_2020" (and a "metrics_tables" 👀).
  • Depending on how the statistics are used, perhaps archiving is more interesting. So we could have an option to offload the previous years away into a backup file.
  • Check the feasibility of allowing the user to setup the granularity (per year/month/day).
  • Indirectly related to this issue: perhaps we can remove the flock from the script that saves the logs. There’s a debatable warranty that small append writes are guaranteed to be atomic by the OS... I didn't research much, but if that’s true, we could at least add an if(writeSize < threshold) skip locking... Another possibility is to just skip waiting for the lock using the LOCK_NB modifier, which would cause some leads to be missed.

Edit: About the logging/locking: https://bugs.php.net/bug.php?id=40897... And from this link, I saw another possibility (set the error_log output path to the usage log path, call the error_log, then restore the previous path).

@NateWr
Copy link
Contributor

NateWr commented May 27, 2021

DB table metrics_context&issue:
load_id, context_id, assoc_type, assoc_id, day, month, file_type?, metric_type?, metric
(assoc_type = context, issue)

I think we should have separate tables. Then we can get rid of assoc_type and use just context_id or issue_id with foreign keys. I don't think we would need file_type here, would we?

DB table metrics_submission(abstract&files):
load_id, context_id, pkp_section_id, issue_id (ojs), submission_id, representation_id, assoc_type, assoc_id, day, month, file_type, metric_type?, metric
(assoc_type = submission, submission file, submission supp file)

Instead of using assoc_type/assoc_id let's rely on the existing columns that identify what they are (we can use foreign keys too). So by adding a single column, is_primary_file, we can use the value of submission_id or representation_id to detect if it's a file or abstract. (I also wonder if it would just be better to have separate tables for submissions and representations.)

I wonder if we should drop the context_id, pkp_section_id and issue_id columns. I'd be curious to see what the performance cost is of using a join to get stats for submissions in a context/section/id. I also think that if we include section/issue we should include category, but a single submission can be in multiple categories so that will be a problem (OPS uses categories more than sections). In terms of performance, in general I think we're better off getting performance by caching some common stats than trying to bake too much into the database tables.

access_type (Controlled or Gold_OA)

Do we need to differentiate this? I understand it's part of the COUNTER R5 spec, but I'm not sure it's a great fit for us. Most of our journals are not "Gold OA" but "Diamond OA". And where restrictions exist, they don't really match the description of "Controlled" that I read. I'm wondering if it would be more prudent for us to leave this out of our COUNTER reports.

DB table counter_metrics_submission(+Geo)_daily:
load_id, context_id, submission_id, day, month, country_id, region, city, institution_id, access_type (Controlled or Gold_OA), r5_metric_type, metric
(r5_metric_type = Total_Item_Investigations, Unique_Item_Investigations, Total_Item_Requests and Unique_Item_Requests)

The r5_metric_type column would result in 4x the number of rows, right? (One row for each metric type.) I think it would be better to either have different tables or to consider each metric type as a column, so there is only one row per submission per day.

Also, I think that institution_id ought to be separated out. If I understand correctly, we'd need a separate row to track the institutional stats. But since there's never going to be a row with data for both city and institution_id, I'd prefer to see it in its own table.

metric_type?

I think we could probably get rid of this in every table. I don't think we're likely to support an alternative metric type in the future, and if we do, we probably need a separate table for it anyway.

@ctgraham
Copy link
Collaborator

Do we need to differentiate this? I understand it's part of the COUNTER R5 spec, but I'm not sure it's a great fit for us. Most of our journals are not "Gold OA" but "Diamond OA". And where restrictions exist, they don't really match the description of "Controlled" that I read. I'm wondering if it would be more prudent for us to leave this out of our COUNTER reports.

The differentiation is Controlled = Subscription; Gold OA = Open Access. The naming of Platinum/Diamond and other nuances are not recognized by COUNTER.

C.f. https://www.projectcounter.org/wp-content/uploads/2020/02/Webinar-understanding-the-use-of-OPEN-ACCESS-content.pdf

There are two Access_Types
Controlled: means that at the time of the usage the content item was not open because it was behind a paywall and access was restricted to authorized users.
OA_Gold:means at the time of the usage the content item was available under an open access license -that is immediately and permanently available as open access.

@bozana
Copy link
Collaborator Author

bozana commented May 28, 2021

Hi all,
Thanks a lot!
I like the idea of partitioning the table by year. Also if possible archiving it depending on the statistics usage. I believe that most users only/in most use cases want to see the statistics of the last year or so, but I might be wrong? If we would go this way I would tend not to separate the table by assoc_type -- the entries per year should be well handlebar.
What do others think about that?

Else, if we do separate the table by assoc_type:
I tend to stop tracking the usage for assoc_type = context -- it is only the context landing page, and catalog index page for OMP. I do not think the users are much interested in those?
Then we could have the table as @NateWr suggested, only with issue_id. The file_type in this case is the issue galley file type. I don't know if this is relevant, but I believe we could eventually also remove that.
I would like to keep the column context_id for each table, because this is definitely 'always' used column/distinction. I agree to remove the issue_id and pkp_section_id column: They are currently only used in reports, to get issue identification and section name, but this information we can also get from the submission we have. If we would like to filter by them we could get all submission IDs of the wished issue and section/category and filter the metrics table by those IDs.
I think we have to keep the access_type for the counter reports.
Yes, maybe to have a column for each r5_metric_type -- I have to double check it for OMP -- there could be 4 more needed (Total_Title_Investigations, Total_Title_Requests, Unique_Title_Investigations and Unique_Title_Requests -- Title = Book)...
@NateWr, I do not understand what you mean with institution_id 🤔
Yes, maybe to remove metric_type.

@jonasraoni, thanks for all the ideas!
I read that flock is not needed for fwrite, see the 3. note here https://www.php.net/manual/en/function.fwrite.php. So I also thought to remove it. I will take a look at the bug link you posted -- did not have enough time yet...
We do need a row per day -- we need the stats per day. I think @NateWr thought something like: to have a format/function so that we are able to make a query something like:
... where day is in month january...
We could use day >= ... and day <=... (we would maybe also need group by month and order by month), but I prefer to leave the month column :-)

@ctgraham, thanks a lot for following and helping with those COUNTER requirements! I am very happy that somebody else has an eye on it too! :-)

@bozana
Copy link
Collaborator Author

bozana commented May 28, 2021

Maybe a note for those knowing better/dealing with the user requirements: Having the Geo and institution data only in the new counter metrics table, we will not be able to have abstract and file type usage and Geo/Institution data, also not issue usage and Geo/Institution data, e.g. we will not be able say something like "There were X PDF downloads from Germany/University of Blabla". We would be able to say: "There were X total/unique investigations and Y total/unique requests from Germany/University of Blabla".
This should be sufficient, right?

@NateWr
Copy link
Contributor

NateWr commented May 31, 2021

I like the idea of partitioning the table by year. Also if possible archiving it depending on the statistics usage. I believe that most users only/in most use cases want to see the statistics of the last year or so, but I might be wrong?

I don't think we should make too many assumptions about what people want. We should be support not just high-level yearly reports but also things like individual curiosity, internal market research, etc.

That may still be possible when partitioning the table by year. But in the stats UI we will want people to be able to specify any date range and get results back, regardless of whether the start/end dates span multiple years or not (eg - April 3 2018-July 17 2020). If this can all be abstracted into a stats service class that can be used without too much difficulty, then that's fine with me. My main concern is ensuring that we have a stats service class that provides an easy-to-use tool for getting whatever stats we have.

I would tend not to separate the table by assoc_type

I really think we should use separate tables instead of using assoc_type unless there's a significant improvement in performance or data storage size by not doing so. The main benefit of separating by assoc_type is that we can make use of foreign keys, which makes our database relationships more reliable and easier to understand.

@NateWr, I do not understand what you mean with institution_id thinking

I don't think we should try to track geographic stats and institutional stats in the same table. It should be two tables:

Schema::create('metrics_submission_geo', function (Blueprint $table) {
    $table->bigInteger('load_id');
    $table->bigInteger('context_id');
    $table->bigInteger('submission_id');
    $table->bigInteger('country_id');
    $table->bigInteger('region_id');
    $table->bigInteger('city_id');
    $table->integer('day');
    $table->tinyInteger('month');
    $table->bigInteger('access_type');
    $table->bigInteger('metric_investigations');
    $table->bigInteger('metric_investigations_unique');
    $table->bigInteger('metric_requests');
    $table->bigInteger('metric_requests_unique');

    $table->foreign('context_id')->references('journal_id')->on('journals');
    $table->foreign('submission_id')->references('submission_id')->on('submissions');
    $table->foreign('country_id')->references('country_id')->on('metrics_countries');
    $table->foreign('region_id')->references('region_id')->on('metrics_regions');
    $table->foreign('city_id')->references('city_id')->on('metrics_cities');
});

Schema::create('metrics_submission_institutions', function (Blueprint $table) {
    $table->bigInteger('load_id');
    $table->bigInteger('context_id');
    $table->bigInteger('submission_id');
    $table->bigInteger('institution_id');
    $table->bigInteger('access_type');
    $table->bigInteger('metric_investigations');
    $table->bigInteger('metric_investigations_unique');
    $table->bigInteger('metric_requests');
    $table->bigInteger('metric_requests_unique');

    $table->foreign('context_id')->references('journal_id')->on('journals');
    $table->foreign('submission_id')->references('submission_id')->on('submissions');
    $table->foreign('institution_id')->references('institution_id')->on('institutions');
});

bozana added a commit to bozana/pkp-lib that referenced this issue Aug 19, 2022
bozana added a commit to bozana/ojs that referenced this issue Aug 19, 2022
bozana added a commit to bozana/omp that referenced this issue Aug 19, 2022
bozana added a commit to bozana/ops that referenced this issue Aug 19, 2022
bozana added a commit to bozana/pkp-lib that referenced this issue Aug 22, 2022
bozana added a commit to bozana/pkp-lib that referenced this issue Aug 22, 2022
bozana added a commit to bozana/ojs that referenced this issue Aug 22, 2022
bozana added a commit to bozana/omp that referenced this issue Aug 22, 2022
bozana added a commit to bozana/ops that referenced this issue Aug 22, 2022
bozana added a commit that referenced this issue Aug 22, 2022
#6782 do not use installation migrations in the upgrade
bozana added a commit to pkp/ojs that referenced this issue Aug 22, 2022
pkp/pkp-lib#6782 do not use installation migrations in the upgrade
bozana added a commit to pkp/ops that referenced this issue Aug 22, 2022
pkp/pkp-lib#6782 do not use installation migrations in the upgrade
bozana added a commit to pkp/omp that referenced this issue Aug 22, 2022
pkp/pkp-lib#6782 do not use installation migrations in the upgrade
@bozana bozana closed this as completed Aug 22, 2022
Repository owner moved this from Todo to Done in Statistics Aug 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement:1:Minor A new feature or improvement that can be implemented in less than 3 days.
Projects
Status: No status
Status: Done
Development

No branches or pull requests

9 participants