diff --git a/docs/data/pipelines/README.md b/docs/data/pipelines/README.md index a8c82f9..5492bd3 100644 --- a/docs/data/pipelines/README.md +++ b/docs/data/pipelines/README.md @@ -1,6 +1,7 @@ # Data pipelines Provides details on the automated processes that import and transform the data sources that are part of the data lake. +- [Platform / Support / Freshdesk](./platform/support/freshdesk.md) - [Operations / AWS / Cost and Usage Report](./operations/aws/cost-and-usage-report.md) :page_facing_up: When adding a new pipeline, please use [template.md](./template.md) as a starting point. \ No newline at end of file diff --git a/docs/data/pipelines/platform/support/freshdesk.md b/docs/data/pipelines/platform/support/freshdesk.md new file mode 100644 index 0000000..b074aba --- /dev/null +++ b/docs/data/pipelines/platform/support/freshdesk.md @@ -0,0 +1,74 @@ +# Platform / Support / Freshdesk +## Description +The Freshdesk dataset provides information on user support tickets in [Parquet format](https://parquet.apache.org/). All user entered information and personally identifiable information (PII) has been removed from the dataset. The data is partitioned by month, and updated daily. + +This data pipeline creates the Glue data catalog table `platform_support_freshdesk` in the `platform_support_production` database. It can be queried in Superset as follows: + +```sql +SELECT + * +FROM + "platform_support_production"."platform_support_freshdesk" +LIMIT 10; +``` + +--- + +[:information_source: View the data catalog](../../../catalog/platform/support/freshdesk.md) + +## Data pipeline +A high level view is shown below with more details about each step following the diagram. + +```mermaid +graph TD + %% Source Systems + Lambda["`**Lambda (Daily)**
Export Freshdesk tickets`"] + + %% Storage + RawS3["`**S3 Bucket (Raw)**
cds-data-lake-raw-production`"] + TransS3["`**S3 Bucket (Transformed)**
cds-data-lake-transformed-production`"] + + %% Processing + Crawlers["Crawlers (Monthly)"] + CatalogRaw["`**Data Catalog (Raw)**
platform_support_freshdesk_raw`"] + CatalogTransformed["`**Data Catalog (Transformed)**
platform_support_freshdesk`"] + ETL["ETL Job (Daily)"] + + %% Flow + subgraph datalake[Data Lake account] + Lambda --> |YYYY-MM-DD.json|RawS3 + RawS3 --> Crawlers + Crawlers --> |Updates Schema|CatalogRaw + CatalogRaw --> ETL + ETL --> TransS3 + ETL --> CatalogTransformed + end +``` + +### Source data +The [Freshdesk data export](https://github.com/cds-snc/data-lake/tree/6d3aea78d5d5a47d318ca66d37f0d4af6972fca4/export/platform/support/freshdesk) is run in the `DataLake-Production` account [ `platform-support-freshdesk-export` Lambda function](https://github.com/cds-snc/data-lake/tree/6d3aea78d5d5a47d318ca66d37f0d4af6972fca4/terragrunt/aws/export/platform/support/freshdesk) that is triggered each day. On each run it saves any updated tickets from the previous day to the Raw S3 bucket in a `YYYY-MM-DD.json` file: + +``` +cds-data-lake-raw-production/platform/support/freshdesk/month=YYYY-MM/YYYY-MM-DD.json +``` + +### Crawlers +On the first of each month, an AWS Glue crawler run in the `DataLake-Production` AWS account to identify schema changes and update the Glue data catalog: + +- [Platform / Support / Freshdesk](https://github.com/cds-snc/data-lake/blob/6d3aea78d5d5a47d318ca66d37f0d4af6972fca4/terragrunt/aws/glue/crawlers.tf#L49-L79) + +This crawler creates and manages the following data catalog table in the [`platform_support_production_raw` database](https://github.com/cds-snc/data-lake/blob/6d3aea78d5d5a47d318ca66d37f0d4af6972fca4/terragrunt/aws/glue/databases.tf#L11-L14): + +- `platform_support_freshdesk`: Freshdesk ticket data with no PII or user entered information. + +### Extract, Transform and Load (ETL) Jobs + +Each day, the [`Platform / Support / Freshdesk` Glue ETL job](https://github.com/cds-snc/data-lake/blob/6d3aea78d5d5a47d318ca66d37f0d4af6972fca4/terragrunt/aws/glue/etl.tf#L39-L108) runs and updates existing tickets as well as adding new tickets. The resulting data is saved in the data lake's Transformed `cds-data-lake-transformed-production` S3 bucket: + +``` +cds-data-lake-transformed-production/platform/support/freshdesk/month=YYYY-MM/*.parquet +``` + +Additionally, a data catalog table is created in the [`platform_support_production` database](https://github.com/cds-snc/data-lake/blob/6d3aea78d5d5a47d318ca66d37f0d4af6972fca4/terragrunt/aws/glue/databases.tf#L6-L9): + +- `platform_support_freshdesk`: Freshdesk ticket data with no PII or user entered information.