diff --git a/docs/data/pipelines/operations/aws/cost-and-usage-report.md b/docs/data/pipelines/operations/aws/cost-and-usage-report.md new file mode 100644 index 0000000..98d3f52 --- /dev/null +++ b/docs/data/pipelines/operations/aws/cost-and-usage-report.md @@ -0,0 +1,77 @@ +# Operations / AWS / Cost and Usage Report +## Description +The AWS [Cost and Usage Report (CUR) 2.0](https://docs.aws.amazon.com/cur/latest/userguide/what-is-cur.html) provides detailed billing data exports in [Parquet format](https://parquet.apache.org/). It contains line items for all AWS services usage with resource tags, pricing, and cost allocation data. The data is partitioned by time period and account ID, and updated daily. + +## Data pipeline +A high level view of the data pipeline is shown below: + +```mermaid +graph LR + %% Source Systems + CUR[AWS CUR 2.0 Export] + Lambda[Account Tags Lambda] + + %% Storage + OrgS3[Organization S3 Bucket
cds-cost-usage-report] + RawS3[Raw S3 Bucket
cds-data-lake-raw-production] + TransS3[Transformed S3 Bucket
cds-data-lake-transformed-production] + + %% Processing + Crawlers["Crawlers (Monthly)"] + CatalogRaw["Data Catalog (Raw)
operations_aws_production_raw"] + CatalogTransformed["Data Catalog (Transformed)
operations_aws_production"] + ETL["ETL Job (Daily)"] + + %% Flow + subgraph org[Organization] + CUR --> OrgS3 + Lambda + end + + Lambda --> |account-tags.json|RawS3 + OrgS3 --> |S3 Replication|RawS3 + + subgraph datalake[Data Lake] + RawS3 --> Crawlers + Crawlers --> |Updates Schema|CatalogRaw + CatalogRaw --> ETL + ETL --> TransS3 + ETL --> CatalogTransformed + end +``` + +### Source data +The CUR data export is configured in our AWS Organization account and written daily to the [`cds-cost-usage-report` S3 bucket in that account](https://github.com/cds-snc/cds-aws-lz/blob/8785287379159de892c255ec4d40afffee2810c1/terragrunt/org_account/cost_usage_report/s3.tf#L4-L27). An S3 replication rule on this bucket sends the export to the [data lake's Raw `cds-data-lake-raw-production` S3 bucket](https://github.com/cds-snc/cds-aws-lz/blob/8785287379159de892c255ec4d40afffee2810c1/terragrunt/org_account/cost_usage_report/s3.tf#L15-L24): + +``` +cds-data-lake-raw-production/operations/aws/cost-usage-report/data/BILLING_PERIOD=YYYY-MM/*.parquet +``` + +Additionally, [the `billing_extract_tags` Lambda function](https://github.com/cds-snc/cds-aws-lz/blob/8785287379159de892c255ec4d40afffee2810c1/terragrunt/org_account/cost_usage_report/lambda.tf) runs each day in the AWS Organization account to retrieve all member account business unit tags and save them to the data lake's Raw bucket as well: + +``` +cds-data-lake-raw-production/operations/aws/organization/account-tags.json +``` + +### Crawlers +On the first of each month, AWS Glue crawlers run in the `DataLake-Production` AWS account to identify schema changes and update the Glue data catalogue: + +- [Operations / AWS / Cost and Usage Report](https://github.com/cds-snc/data-lake/blob/468142031c7bdd1a2720def7d5ebb4e07fff4bef/terragrunt/aws/glue/crawlers.tf#L24-L49) +- [Operations / AWS / Organization / Account Tags](https://github.com/cds-snc/data-lake/blob/468142031c7bdd1a2720def7d5ebb4e07fff4bef/terragrunt/aws/glue/crawlers.tf#L54-L80) + +These create and manage the following data catalog tables in the [`operations_aws_production_raw` database](https://github.com/cds-snc/data-lake/blob/468142031c7bdd1a2720def7d5ebb4e07fff4bef/terragrunt/aws/glue/databases.tf#L6-L9): + +- `account_tags_organization`: AWS Organization member accounts and their tag data. +- `cost_usage_report_data`: CUR 2.0 export data. +- `cost_usage_report_metadata`: CUR 2.0 export metadata, which gives details on the columns and export date. + +### ETL Jobs + +Each day, the [`Operations / AWS / Cost and Usage Report` Glue ETL job](https://github.com/cds-snc/data-lake/blob/468142031c7bdd1a2720def7d5ebb4e07fff4bef/terragrunt/aws/glue/etl/operations/aws/cost-and-usage-report.json) runs and joins the CUR 2.0 data with the AWS account tag data. The resulting data is saved in the data lake's Transformed `cds-data-lake-transformed-production` S3 bucket: +``` +cds-data-lake-transformed-production/operations/aws/cost-usage-report/data/billing_period=YYYY-MM/*.parquet +``` + +Additionally, a data catalog table is created in the [`operations_aws_production` database](https://github.com/cds-snc/data-lake/blob/468142031c7bdd1a2720def7d5ebb4e07fff4bef/terragrunt/aws/glue/databases.tf#L1-L4): + +- `cost_usage_report_by_account`: CUR 2.0 export data joined with the AWS account tags. This table is made available to users for analysis.