Skip to content

Commit

Permalink
Roll in updates from template, clean up documentation, add example qu…
Browse files Browse the repository at this point in the history
…ery and associated contract test
  • Loading branch information
brabster committed Jan 6, 2024
1 parent cebdb0d commit d4842ae
Show file tree
Hide file tree
Showing 9 changed files with 152 additions and 76 deletions.
42 changes: 42 additions & 0 deletions .dev_scripts/init_and_update.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
#!/bin/bash

set -euo pipefail

PIP_REQUIRE_VIRTUALENV=true # have pip abort if we try to install outside a venv
PROJECT_DIR=$(dirname "$0")/.. # script directory
VENV_PATH=${PROJECT_DIR}/.venv
IS_RUNNING_IN_VENV="$(python -c 'import sys; print(sys.prefix != sys.base_prefix)')"

if [ "${IS_RUNNING_IN_VENV}" == 'False' ]; then
echo 'Not in virtualenv, setting up';
python -m venv ${VENV_PATH}
source ${VENV_PATH}/bin/activate
fi

echo "install or upgrade system packages"
pip install --upgrade pip setuptools

echo "install safety for vulnerability check; it prints its own messages about noncommercial use"
pip install --upgrade safety

echo "install or upgrade project-specific dependencies"
pip install -U -r ${PROJECT_DIR}/requirements.txt

echo "install or upgrade dbt dependencies"
dbt deps

echo "check for vulnerabilities"
safety check

echo "load user environment, if present"
ENV_PATH=${PROJECT_DIR}/.env
if [ -f "${ENV_PATH}" ]; then
source ${ENV_PATH}
echo "check dbt setup"
dbt debug
else
echo "Unable to check dbt setup until .env file is set up and suitable data warehouse credentials are available"
fi



Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion .github/actions/dbt_build/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ runs:
dbt clean
dbt deps
dbt debug
dbt build
dbt build --exclude contracts
dbt test -s contracts
dbt docs generate
- name: upload target artifacts
uses: actions/upload-artifact@v3
Expand Down
51 changes: 13 additions & 38 deletions .vscode/tasks.json
Original file line number Diff line number Diff line change
@@ -1,40 +1,15 @@
{
// See https://go.microsoft.com/fwlink/?LinkId=733558
// for the documentation about the tasks.json format
"version": "2.0.0",
"tasks": [
{
"label": "ensure_pip_version",
"type": "shell",
"command": "pip install --upgrade pip"
},
{
"label": "ensure_python_deps_updated",
"type": "shell",
"command": "pip install -U -r ${workspaceFolder}/requirements.txt"
},
{
"label": "load_user_env",
"type": "shell",
"command": ". ${workspaceFolder}/.env"
},
{
"label": "ensure_dbt_packages_updated",
"type": "shell",
"command": "dbt",
"args": ["deps", "--upgrade"],
"dependsOn": ["ensure_python_deps_updated", "load_user_env"]
},
{
"label": "ensure_updated",
"dependsOn": [
"ensure_pip_version",
"ensure_python_deps_updated",
"ensure_dbt_packages_updated"
],
"runOptions": {
"runOn": "folderOpen"
}
// See https://go.microsoft.com/fwlink/?LinkId=733558
// for the documentation about the tasks.json format
"version": "2.0.0",
"tasks": [
{
"label": "init_and_update",
"type": "shell",
"command": "${workspaceFolder}/.dev_scripts/init_and_update.sh",
"runOptions": {
"runOn": "folderOpen"
}
]
}
}
]
}
82 changes: 57 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,74 @@
Investigating downloads of vulnerable Python packages from PyPI.

DBT Documentation on [GitHub Pages](https://brabster.github.io/pypi_vulnerabilities).
# Supporters

Thanks to [Equal Experts](https://equalexperts.com) for supporting this work.
<a href="https://equalexperts.com">
<img alt="Equal Experts logo"
src="https://www.equalexperts.com/wp-content/themes/equalexperts/assets/logos/colour/equal-experts-logo-colour.png"
style="height:75px">
</img>
</a>

dbt docs automatically published on deployment at https://brabster.github.io/dbt_bigquery_template/
# Generated Resources

# Pre-Reqs
- DBT Documentation on [GitHub Pages](https://brabster.github.io/pypi_vulnerabilities).
- Public Dataset on BigQuery US Location: `pypi-vulnerabilities.pypi_vulnerabilities_us`

# Timeframe

I'm performing this initial analysis on package downloads performed on a specific date, 2023-11-05. There's a few reasons for that:

- PyPI downloads is a big dataset - days in late 2023 are on the order of 250GB. At $5/TB scanned, that's about a dollar a day to scan one day of the full dataset.
- The Safety public dataset is updated monthly, so I can use a the 2023-10-01 update to be sure that any vulnerabilities I'm considering have been in the public domain and accessible via tools for at least a month.

I can get an idea of what's going on and figure out how to solve the problems that need solving with a relatively small snapshot dataset, so I copy just the columns I need for one day with minimal processing to a new table and work from that.

# Example Query

## Top Ten Packages by Number of Vulnerable Downloads

Bills 94MB

```sql
SELECT
package,
downloads_with_known_vulnerabilities,
downloads_without_known_vulnerabilities,
proportion_vulnerable_downloads
FROM `pypi-vulnerabilities.pypi_vulnerabilities_us.vulnerable_downloads_by_package`
ORDER BY downloads_with_known_vulnerabilities DESC
LIMIT 10
```

![Results table for top ten packages by vulnerable download count query](./.docs/assets/top_ten_packages_by_vuln_download_count.png)

# Contributing

See [CONTRIBUTORS.md](CONTRIBUTORS.md) for guidance.

## Pre-Reqs

- Python == 3.11 (see https://docs.getdbt.com/faqs/Core/install-python-compatibility)
- [RECOMMENDED] VSCode to use built-in tasks
- Access to GCP Project enabled for BigQuery
- [RECOMMENDED] set environment variable `PIP_REQUIRE_VIRTUALENV=true`
- Prevents accidentally installing to your system Python installation (if you have permissions to do so)

# Setup
## Setup Local

Setting up the local software without any need for Data Warehouse credentials.

A VSCode task triggers a shell script [.dev_scripts/init_and_update.sh](.dev_scripts/init_and_update.sh)
which should take care of setting up a virtualenv if necessary, then installing/updating software and running a vulnerability scan.

> Note - the vulnerability scan is performed using [safety](https://pypi.org/project/safety/), which is *not free for commercial use* and has limitations on freshness and completeness of the vulnerability database.
> Note - on first open, automated tasks that update your local dependency versions will fail because your virtualenv is not yet created. My attempts to automate this process haven't turned out to be reliable, so bootstrapping instructions are provided below. Once set up, automated dependency updates should execute automatically when you open this directory in VSCode.
That script describes the steps involved in a full setup if you are unable to run a bash script and need to translate to some other language.

## Connect to Data Warehouse

Set up credentials and environment and test connectivity.

- open the terminal
- `Terminal` - `New Terminal`
- create virtualenv and install dependencies
- Use [VSCode options](https://code.visualstudio.com/docs/python/environments#_creating-environments)
- `Python - Create Environment`, accept defaults
- OR manually
- `python -m venv .venv` (other parts of the project assume venv is in `.venv`, so find/replace if you change that)
- `source .venv/bin/activate` (`source .venv/Scripts/activate` on Windows/Git-Bash)
- VSCode command `Python - Select Interpreter`
- Install dependencies: `pip install -U -r requirements.txt`
- update .env with appropriate values
- note project ID not project name (manifests as 404 error)
- `. .env` to update values in use in terminal
Expand All @@ -38,15 +79,6 @@ dbt docs automatically published on deployment at https://brabster.github.io/dbt
- `dbt debug` should now succeed and list settings/versions
- if `dbt` is not found, you may need to activate your venv at the terminal as described earlier

# Timeframe

I'm performing this initial analysis on package downloads performed on a specific date, 2023-11-05. There's a few reasons for that:

- PyPI downloads is a big dataset - days in late 2023 are on the order of 250GB. At $5/TB scanned, that's about a dollar a day to scan one day of the full dataset.
- The Safety public dataset is updated monthly, so I can use a the 2023-10-01 update to be sure that any vulnerabilities I'm considering have been in the public domain and accessible via tools for at least a month.

I can get an idea of what's going on and figure out how to solve the problems that need solving with a relatively small snapshot dataset, so I copy just the columns I need for one day with minimal processing to a new table and work from that.

# Obtaining Safety DB in BigQuery

I use the public database used by the [safety] package as a reference for which PyPI packages have known vulnerabilities.
Expand Down
25 changes: 13 additions & 12 deletions macros/ensure_target_dataset_exists.sql
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,20 @@
{% set dataset_name = target.schema %}
{% set dataset_location = target.location %}

{{ print("Ensuring dataset " ~ project_id ~ "." ~ dataset_name ~ " exists in location " ~ dataset_location ) }}
{% if execute %}

{% set create_dataset_query %}
CREATE SCHEMA IF NOT EXISTS `{{ project_id }}`.`{{ dataset_name }}`
OPTIONS (
description = 'Exploring vulnerable PyPI downloads. Managed by https://github.com/brabster/pypi_vulnerabilities',
location = '{{ dataset_location }}',
labels = [('data_classification', 'public')]
)
{% endset %}
{% do log("Ensuring dataset " ~ project_id ~ "." ~ dataset_name ~ " exists in location " ~ dataset_location ) %}

{% if execute %}
{% set results = run_query(create_dataset_query) %}
{% set create_dataset_query %}
CREATE SCHEMA IF NOT EXISTS `{{ project_id }}`.`{{ dataset_name }}`
OPTIONS (
description = 'Exploring vulnerable PyPI downloads. Managed by https://github.com/brabster/pypi_vulnerabilities',
location = '{{ dataset_location }}',
labels = [('data_classification', 'public')]
)
{% endset %}

{% set results = run_query(create_dataset_query) %}
{% endif %}

{% endmacro %}
{% endmacro %}
5 changes: 5 additions & 0 deletions macros/ensure_target_dataset_exists.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
version: 2

macros:
- name: ensure_target_dataset_exists
description: Creates the specified dataset if it does not exist and the executor has permission
5 changes: 5 additions & 0 deletions macros/ensure_udfs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
version: 2

macros:
- name: ensure_udfs
description: Creates UDFs specified in the macro. Does not clean up any UDFs that are removed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
WITH test AS (
SELECT
package,
downloads_with_known_vulnerabilities,
downloads_without_known_vulnerabilities,
proportion_vulnerable_downloads
FROM `pypi-vulnerabilities.pypi_vulnerabilities_us.vulnerable_downloads_by_package`
ORDER BY downloads_with_known_vulnerabilities DESC
LIMIT 10
)

SELECT
COUNT(1) row_count
FROM test
HAVING row_count != 10

0 comments on commit d4842ae

Please sign in to comment.