Skip to content

Commit

Permalink
docs: add data catalog and fix links
Browse files Browse the repository at this point in the history
  • Loading branch information
patheard committed Jan 21, 2025
1 parent d197fb6 commit af1ab2a
Show file tree
Hide file tree
Showing 6 changed files with 83 additions and 4 deletions.
1 change: 1 addition & 0 deletions docs/data/catalog/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Data catalog
Provides an inventory that contains metadata, descriptions, and technical details about all data sources within the data lake.

- [Platform / Support / Freshdesk](./platform/support/freshdesk.md)
- [Operations / AWS / Cost and Usage Report](./operations/aws/cost-and-usage-report.md)

:page_facing_up: When adding a new pipeline, please use [template.md](./template.md) as a starting point.
2 changes: 1 addition & 1 deletion docs/data/catalog/operations/aws/cost-and-usage-report.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Dataset describing how much was spent on Amazon Web Services (AWS) by CDS.

Each row describes the cost of using a particular AWS service (i.e., a line item) within a billing period.

This dataset is represented in [Superset](https://superset.cdssandbox.xyz/) as the Physical dataset [`cost_usage_report_by_account`](https://superset.cdssandbox.xyz/explore/?datasource_type=table&datasource_id=68). All of the Virtual datasets in the "Operations / AWS / Cost and Usage" group are derived from it.
This dataset is represented in [Superset](https://superset.cds-snc.ca/) as the Physical dataset [`cost_usage_report_by_account`](https://superset.cds-snc.ca/explore/?datasource_type=table&datasource_id=68). All of the Virtual datasets in the "Operations / AWS / Cost and Usage" group are derived from it.

**Keywords:** AWS, Amazon, cost, usage, fees

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ This query will return the first ten rows of the "AWS Cost and Usage Report"
dataset. To run it, open the SQL Lab in Superset and cut and paste the whole
query into the query window.
SQL Lab: https://superset.cdssandbox.xyz/sqllab/
SQL Lab: https://superset.cds-snc.ca/sqllab/
The example dataset is provided as a query instead of a CSV to limit
visibility to only those with Superset access.
Expand All @@ -13,4 +13,4 @@ SELECT
*
FROM
operations_aws_production.cost_usage_report_by_account
LIMIT 10;
LIMIT 10;
16 changes: 16 additions & 0 deletions docs/data/catalog/platform/support/examples/freshdesk.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
/*
This query will return the first ten rows of the "Freshdesk"
dataset. To run it, open the SQL Lab in Superset and cut and paste the whole
query into the query window.
SQL Lab: https://superset.cds-snc.ca/sqllab/
The example dataset is provided as a query instead of a CSV to limit
visibility to only those with Superset access.
*/

SELECT
*
FROM
platform_support_production.platform_support_freshdesk
LIMIT 10;
62 changes: 62 additions & 0 deletions docs/data/catalog/platform/support/freshdesk.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Platform / Support / Freshdesk

Dataset providing Freshdesk support ticket data raised by the users of CDS products.

Each row is a Freshdesk ticket that does not include any personally identifiable information (PII) or user entered content.

This dataset is represented in [Superset](https://superset.cds-snc.ca/) as the Physical dataset `platform_support_freshdesk`.

**Keywords:** Platform, Freshdesk, support, tickets

---

[:information_source: View the data pipeline](../../../pipelines/platform/support/freshdesk.md)

## Provenance

This dataset is extracted daily using the Freshdesk API. Each day the extract process downloads all tickets from the previous day that have been updated or created. These tickets are then merged with the existing tickets in the dataset.

More documentation on the pipeline can be found [here](../../../pipelines/platform/support/freshdesk.md).

* **Updated:** Daily
* **Steward:** Platform Core Services
* **Contact:** [Pat Heard](mailto:[email protected])
* **Location:** s3://cds-data-lake-transformed-production/platform/support/freshdesk/month=YYYY-MM/*.parquet

## Fields

Almost all fields are sourced directly from Freshdesk's [Tickets](https://developers.freshdesk.com/api/#tickets), [Contacts](https://developers.freshdesk.com/api/#contacts), and [Conversations](https://developers.freshdesk.com/api/#conversations) API endpoints.

A [query to return example data](examples/freshdesk.sql) has also been provided.

Here's a descriptive list of the Freshdesk ticket fields:

* `id` (bigint) - Unique identifier for each support ticket.
* `status` (bigint) - Numerical code representing the ticket's current status.
* `status_label` (string) - Human-readable label for the ticket status (e.g., "Open", "Pending", "Resolved").
* `priority` (bigint) - Numerical code indicating the ticket's priority level.
* `priority_label` (string) - Human-readable label for the priority level (e.g., "Low", "Medium", "High", "Urgent").
* `source` (bigint) - Numerical code indicating how the ticket was created.
* `source_label` (string) - Human-readable label for the ticket source (e.g., "Email", "Phone", "Portal", "Chat").
* `created_at` (timestamp) - Date and time when the ticket was initially created.
* `updated_at` (timestamp) - Date and time of the most recent update to the ticket.
* `due_by` (timestamp) - Deadline for ticket resolution based on support policies.
* `fr_due_by` (timestamp) - First response due time based on support policies.
* `is_escalated` (boolean) - Indicates whether the ticket has been escalated to a higher support tier.
* `tags` (array<string>) - List of labels or categories assigned to the ticket for classification.
* `spam` (boolean) - Indicates whether the ticket has been marked as spam.
* `requester_email_suffix` (string) - Domain portion of the requester's email address. For non Government of Canada users, this will have the value of `external`.
* `type` (string) - Classification of the ticket type.
* `product_id` (bigint) - Unique identifier for the product associated with the ticket.
* `product_name` (string) - Name of the product associated with the ticket.
* `conversations_total_count` (bigint) - Total number of messages in the ticket thread.
* `conversations_reply_count` (bigint) - Number of replies to and from the user in the ticket thread.
* `conversations_note_count` (bigint) - Number of internal notes from the support team added to the ticket.
* `language` (string) - Primary language used in the ticket communication.
* `province_or_territory` (string) - Canadian province or territory where the ticket originated.
* `organization` (string) - Government of Canada department or crown corporation associated with the ticket requester.
* `month` (string) - Month of ticket creation, used as a partition key for data organization.

## Notes

The `language`, `province_or_territory` and `organization` fields are custom fields managed by the Platform Support team. As such, they will not always have a value populated in the dataset.
2 changes: 1 addition & 1 deletion docs/data/catalog/template.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Briefly describe where the dataset comes from using words. If from a database, i

## Fields

[Link to the first 10-20 rows of the table as CSV](http://www.example.com/dataset.csv). If the head of the table is not representative (e.g., missing data) or sensitive (contains PII), more appropriate rows may be selected instead. Alternatively, a [SQL query](http://www.example.com/dataset.sql) may be provided that returns appropriate example data. This query must be complete, so much that it can be run directly in [Superset's SQL Lab](https://superset.cdssandbox.xyz/sqllab/) without modification.
[Link to the first 10-20 rows of the table as CSV](http://www.example.com/dataset.csv). If the head of the table is not representative (e.g., missing data) or sensitive (contains PII), more appropriate rows may be selected instead. Alternatively, a [SQL query](http://www.example.com/dataset.sql) may be provided that returns appropriate example data. This query must be complete, so much that it can be run directly in [Superset's SQL Lab](https://superset.cds-snc.ca/sqllab/) without modification.

A bulleted list of field names must be included, alongside a brief description of the field. Boolean descriptions can simply be the question answered by the boolean. The data type of a field and the unit of measurement should be included as well. The names of data types are dictated by the storage format. For example, the Parquet storage format commonly includes include booleans, dates, floats, integers, strings, times, and timestamps. There is no need to include the integer or float width unless the circumstances are exceptional.

Expand Down

0 comments on commit af1ab2a

Please sign in to comment.