Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GIP] New analytics module #11

Open
jeanpommier opened this issue Jan 2, 2025 · 0 comments
Open

[GIP] New analytics module #11

jeanpommier opened this issue Jan 2, 2025 · 0 comments
Labels
2 - In review This proposal is currently reviewed / discussed by the community GIP

Comments

@jeanpommier
Copy link
Member

jeanpommier commented Jan 2, 2025

Who ?

Jean Pommier ([email protected]) with funding from geo2france and MEL.

Target Module

This is mostly an independent module. Its main interaction with the existing modules will be to collect usage data.

As a first step, it should mostly affect:

  • SP / gateway will need to be configured to write the access logs, which the new analytics module will use to gather usage information
  • The applicative PostgreSQL database, possibly: we plan to use timescaleDB (a Postgresql extension) to store the usage data. Two options will be proposed:
    • store them in the applicative Postgresql database. This will require installing the timescaleDB extension
    • store them in a separate Postgresql/TimescaleDB database

In the long term, it will probably affect all app modules (all modules that we want to track): collecting access logs is not the perfect solution. Implementing a specific functionality in each app, to log meaningful operations, will be supported. We are considering leveraging OpenTelemetry for this.

Why ?

  • The current analytics module is limited to OGC requests.
  • The current analytics module won't be supported by the gateway
  • visualisation is hardcoded (can't add custom graphs)

The current analytics module is limited to OGC requests. And it is not supported (and won't be) with the Gateway. There has already been some discussion about replacing it. It was also discussed at geOcom 2023 and 2024.

What ?

@MaelREBOUX and @jeanpommier have done some exploratory work. This GIP is the continuation of this work.

The aim of this GIP is to provide a new analytics component, that would not only replace the old module but also extend it: cover all apps, with configurable dashboards.

How ?

I see 3 quite independent parts in this module:

  • collect the usage data:
    • A generic functionality will be provided by parsing the access logs of the SP/Gateway. I'd like this generic feature to be easy to customize, because every platform manager may have custom apps to support. Consequently, it needs to be accessible to platform managers. Python code seems a good candidate. Some exploratory work has already been done on https://github.com/georchestra/analytics. A python module that can run as a cron task, on a regular basis, to process the access logs.
    • Since access logs will not be able to capture all the interesting information, additional capacity will be provided, where each app will be able to push their specific usage data. A priori relying on OpenTelemetry. This will require some dedicated code to enable such functionality on a per-app basis.
    • OpenTelemetry will also be used to provide a file-less solution to get to the access logs (some platforms won't accept using file-based access logs)
  • Store the usage data: we had some discussions about the best storage solution. TimescaleDB seems to stand out as an interesting option
    • it is a PostgreSQL extension. We are already familiar with PostgreSQL, and we can even use the applicative database of geOrchestra, provided we install the extension
    • TimescaleDB can manage very high number of records with some custom functionnalities: automatic partitioning of the tables, aggregation, retention time and time buckets. A working configuration will be provided, but any platform manager will have the possibility to adjust it if necessary, providing a lot of flexibility
  • Analyze the usage data: we will use the new/incoming dataviz module, Superset (see [GIP] Integrate Apache Superset (dashboarding software) into geOrchestra #10). Here also, some base dashboard config will be provided for standard vizualisation of the usage data, but any platform admin will be able to customize them and add their own graphs.

Any potential pitfalls and ways to circumvent them ?

  • some platforms won't accept using file-based access logs. We will also provide the possibility to expose the access logs via OpenTelemetry, at least for the gateway.
  • Storing usage data might lead to bloated database. TimescaleDB provides some features like aggregation and retention time that should allow to limit this, providing we properly configure them. It is a matter of compromise between how much information we want to retain vs how much space is acceptable to use. It will probably be better, too, to use a dedicated DB instead of the applicative DB, to prevent bloating the app DB, compartimentalize and simplify backups. Also, we will try to filter the data incoming into the DB and store only what contains relevant information.
  • SP and Gateway don't provide all the necessary information in their access logs. User name and roles are not provided. There is already a pending PR solving that for the gateway. I have checked, we can get similar results for the SP.

When ?

I plan to release a first version by April 2025. At least iso-functional with the old analytics module (support for OGC usage stats).

State of the vote:

PSC members vote
Fabrice Phung
François Van Der Biest
Pierre Mauduit
Landry Breuil
Stéphane Mével-Viannay
Maël Reboux
Pierre Jégo
Jean Pommier
Catherine Piton-Morales
@jeanpommier jeanpommier added GIP 1 - Pending The author is working on the GIP proposal labels Jan 2, 2025
@jeanpommier jeanpommier changed the title [GIP] DRAFT: New analytics module [GIP] New analytics module Jan 2, 2025
@jeanpommier jeanpommier added 2 - In review This proposal is currently reviewed / discussed by the community and removed 1 - Pending The author is working on the GIP proposal labels Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In review This proposal is currently reviewed / discussed by the community GIP
Projects
None yet
Development

No branches or pull requests

1 participant