generated from cds-snc/project-template
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: Cost and Usage Report data pipeline
Add documentation on the AWS CUR 2.0 data pipeline.
- Loading branch information
Showing
1 changed file
with
77 additions
and
0 deletions.
There are no files selected for viewing
77 changes: 77 additions & 0 deletions
77
docs/data/pipelines/operations/aws/cost-and-usage-report.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
# Operations / AWS / Cost and Usage Report | ||
## Description | ||
The AWS [Cost and Usage Report (CUR) 2.0](https://docs.aws.amazon.com/cur/latest/userguide/what-is-cur.html) provides detailed billing data exports in [Parquet format](https://parquet.apache.org/). It contains line items for all AWS services usage with resource tags, pricing, and cost allocation data. The data is partitioned by time period and account ID, and updated daily. | ||
|
||
## Data pipeline | ||
A high level view of the data pipeline is shown below: | ||
|
||
```mermaid | ||
graph LR | ||
%% Source Systems | ||
CUR[AWS CUR 2.0 Export] | ||
Lambda[Account Tags Lambda] | ||
%% Storage | ||
OrgS3[Organization S3 Bucket<br/>cds-cost-usage-report] | ||
RawS3[Raw S3 Bucket<br/>cds-data-lake-raw-production] | ||
TransS3[Transformed S3 Bucket<br/>cds-data-lake-transformed-production] | ||
%% Processing | ||
Crawlers["Crawlers (Monthly)"] | ||
CatalogRaw["Data Catalog (Raw)<br/>operations_aws_production_raw"] | ||
CatalogTransformed["Data Catalog (Transformed)<br/>operations_aws_production"] | ||
ETL["ETL Job (Daily)"] | ||
%% Flow | ||
subgraph org[Organization] | ||
CUR --> OrgS3 | ||
Lambda | ||
end | ||
Lambda --> |account-tags.json|RawS3 | ||
OrgS3 --> |S3 Replication|RawS3 | ||
subgraph datalake[Data Lake] | ||
RawS3 --> Crawlers | ||
Crawlers --> |Updates Schema|CatalogRaw | ||
CatalogRaw --> ETL | ||
ETL --> TransS3 | ||
ETL --> CatalogTransformed | ||
end | ||
``` | ||
|
||
### Source data | ||
The CUR data export is configured in our AWS Organization account and written daily to the [`cds-cost-usage-report` S3 bucket in that account](https://github.com/cds-snc/cds-aws-lz/blob/8785287379159de892c255ec4d40afffee2810c1/terragrunt/org_account/cost_usage_report/s3.tf#L4-L27). An S3 replication rule on this bucket sends the export to the [data lake's Raw `cds-data-lake-raw-production` S3 bucket](https://github.com/cds-snc/cds-aws-lz/blob/8785287379159de892c255ec4d40afffee2810c1/terragrunt/org_account/cost_usage_report/s3.tf#L15-L24): | ||
|
||
``` | ||
cds-data-lake-raw-production/operations/aws/cost-usage-report/data/BILLING_PERIOD=YYYY-MM/*.parquet | ||
``` | ||
|
||
Additionally, [the `billing_extract_tags` Lambda function](https://github.com/cds-snc/cds-aws-lz/blob/8785287379159de892c255ec4d40afffee2810c1/terragrunt/org_account/cost_usage_report/lambda.tf) runs each day in the AWS Organization account to retrieve all member account business unit tags and save them to the data lake's Raw bucket as well: | ||
|
||
``` | ||
cds-data-lake-raw-production/operations/aws/organization/account-tags.json | ||
``` | ||
|
||
### Crawlers | ||
On the first of each month, AWS Glue crawlers run in the `DataLake-Production` AWS account to identify schema changes and update the Glue data catalogue: | ||
|
||
- [Operations / AWS / Cost and Usage Report](https://github.com/cds-snc/data-lake/blob/468142031c7bdd1a2720def7d5ebb4e07fff4bef/terragrunt/aws/glue/crawlers.tf#L24-L49) | ||
- [Operations / AWS / Organization / Account Tags](https://github.com/cds-snc/data-lake/blob/468142031c7bdd1a2720def7d5ebb4e07fff4bef/terragrunt/aws/glue/crawlers.tf#L54-L80) | ||
|
||
These create and manage the following data catalog tables in the [`operations_aws_production_raw` database](https://github.com/cds-snc/data-lake/blob/468142031c7bdd1a2720def7d5ebb4e07fff4bef/terragrunt/aws/glue/databases.tf#L6-L9): | ||
|
||
- `account_tags_organization`: AWS Organization member accounts and their tag data. | ||
- `cost_usage_report_data`: CUR 2.0 export data. | ||
- `cost_usage_report_metadata`: CUR 2.0 export metadata, which gives details on the columns and export date. | ||
|
||
### ETL Jobs | ||
|
||
Each day, the [`Operations / AWS / Cost and Usage Report` Glue ETL job](https://github.com/cds-snc/data-lake/blob/468142031c7bdd1a2720def7d5ebb4e07fff4bef/terragrunt/aws/glue/etl/operations/aws/cost-and-usage-report.json) runs and joins the CUR 2.0 data with the AWS account tag data. The resulting data is saved in the data lake's Transformed `cds-data-lake-transformed-production` S3 bucket: | ||
``` | ||
cds-data-lake-transformed-production/operations/aws/cost-usage-report/data/billing_period=YYYY-MM/*.parquet | ||
``` | ||
|
||
Additionally, a data catalog table is created in the [`operations_aws_production` database](https://github.com/cds-snc/data-lake/blob/468142031c7bdd1a2720def7d5ebb4e07fff4bef/terragrunt/aws/glue/databases.tf#L1-L4): | ||
|
||
- `cost_usage_report_by_account`: CUR 2.0 export data joined with the AWS account tags. This table is made available to users for analysis. |