Skip to content

Commit

Permalink
docs: add Freshdesk data pipeline and catalog docs
Browse files Browse the repository at this point in the history
Add docs for the new Freshdesk dataset.
  • Loading branch information
patheard committed Jan 20, 2025
1 parent 6d3aea7 commit d197fb6
Show file tree
Hide file tree
Showing 2 changed files with 75 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/data/pipelines/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Data pipelines
Provides details on the automated processes that import and transform the data sources that are part of the data lake.

- [Platform / Support / Freshdesk](./platform/support/freshdesk.md)
- [Operations / AWS / Cost and Usage Report](./operations/aws/cost-and-usage-report.md)

:page_facing_up: When adding a new pipeline, please use [template.md](./template.md) as a starting point.
74 changes: 74 additions & 0 deletions docs/data/pipelines/platform/support/freshdesk.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Platform / Support / Freshdesk
## Description
The Freshdesk dataset provides information on user support tickets in [Parquet format](https://parquet.apache.org/). All user entered information and personally identifiable information (PII) has been removed from the dataset. The data is partitioned by month, and updated daily.

This data pipeline creates the Glue data catalog table `platform_support_freshdesk` in the `platform_support_production` database. It can be queried in Superset as follows:

```sql
SELECT
*
FROM
"platform_support_production"."platform_support_freshdesk"
LIMIT 10;
```

---

[:information_source: View the data catalog](../../../catalog/platform/support/freshdesk.md)

## Data pipeline
A high level view is shown below with more details about each step following the diagram.

```mermaid
graph TD
%% Source Systems
Lambda["`**Lambda (Daily)**<br>Export Freshdesk tickets`"]
%% Storage
RawS3["`**S3 Bucket (Raw)**<br/>cds-data-lake-raw-production`"]
TransS3["`**S3 Bucket (Transformed)**<br/>cds-data-lake-transformed-production`"]
%% Processing
Crawlers["Crawlers (Monthly)"]
CatalogRaw["`**Data Catalog (Raw)**<br/>platform_support_freshdesk_raw`"]
CatalogTransformed["`**Data Catalog (Transformed)**<br/>platform_support_freshdesk`"]
ETL["ETL Job (Daily)"]
%% Flow
subgraph datalake[Data Lake account]
Lambda --> |YYYY-MM-DD.json|RawS3
RawS3 --> Crawlers
Crawlers --> |Updates Schema|CatalogRaw
CatalogRaw --> ETL
ETL --> TransS3
ETL --> CatalogTransformed
end
```

### Source data
The [Freshdesk data export](https://github.com/cds-snc/data-lake/tree/6d3aea78d5d5a47d318ca66d37f0d4af6972fca4/export/platform/support/freshdesk) is run in the `DataLake-Production` account [ `platform-support-freshdesk-export` Lambda function](https://github.com/cds-snc/data-lake/tree/6d3aea78d5d5a47d318ca66d37f0d4af6972fca4/terragrunt/aws/export/platform/support/freshdesk) that is triggered each day. On each run it saves any updated tickets from the previous day to the Raw S3 bucket in a `YYYY-MM-DD.json` file:

```
cds-data-lake-raw-production/platform/support/freshdesk/month=YYYY-MM/YYYY-MM-DD.json
```

### Crawlers
On the first of each month, an AWS Glue crawler run in the `DataLake-Production` AWS account to identify schema changes and update the Glue data catalog:

- [Platform / Support / Freshdesk](https://github.com/cds-snc/data-lake/blob/6d3aea78d5d5a47d318ca66d37f0d4af6972fca4/terragrunt/aws/glue/crawlers.tf#L49-L79)

This crawler creates and manages the following data catalog table in the [`platform_support_production_raw` database](https://github.com/cds-snc/data-lake/blob/6d3aea78d5d5a47d318ca66d37f0d4af6972fca4/terragrunt/aws/glue/databases.tf#L11-L14):

- `platform_support_freshdesk`: Freshdesk ticket data with no PII or user entered information.

### Extract, Transform and Load (ETL) Jobs

Each day, the [`Platform / Support / Freshdesk` Glue ETL job](https://github.com/cds-snc/data-lake/blob/6d3aea78d5d5a47d318ca66d37f0d4af6972fca4/terragrunt/aws/glue/etl.tf#L39-L108) runs and updates existing tickets as well as adding new tickets. The resulting data is saved in the data lake's Transformed `cds-data-lake-transformed-production` S3 bucket:

```
cds-data-lake-transformed-production/platform/support/freshdesk/month=YYYY-MM/*.parquet
```

Additionally, a data catalog table is created in the [`platform_support_production` database](https://github.com/cds-snc/data-lake/blob/6d3aea78d5d5a47d318ca66d37f0d4af6972fca4/terragrunt/aws/glue/databases.tf#L6-L9):

- `platform_support_freshdesk`: Freshdesk ticket data with no PII or user entered information.

0 comments on commit d197fb6

Please sign in to comment.