Skip to content

Commit

Permalink
Merge branch 'phmsa-company-transform' of https://github.com/catalyst…
Browse files Browse the repository at this point in the history
…-cooperative/pudl into phmsa-company-transform
  • Loading branch information
e-belfer committed Jan 7, 2025
2 parents 875da71 + 62a821c commit 06ae880
Show file tree
Hide file tree
Showing 19 changed files with 6,082 additions and 6,062 deletions.
4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ repos:
# Formatters: hooks that re-write Python & documentation files
####################################################################################
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.8.1
rev: v0.8.6
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
Expand Down Expand Up @@ -60,7 +60,7 @@ repos:

# Check Github Actions
- repo: https://github.com/rhysd/actionlint
rev: v1.7.4
rev: v1.7.6
hooks:
- id: actionlint

Expand Down
24 changes: 16 additions & 8 deletions docs/data_access.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,13 @@ PUDL data, so if you have a suggestion, please `open a GitHub issue
<https://github.com/catalyst-cooperative/pudl/issues>`__. If you have a question, you
can `create a GitHub discussion <https://github.com/orgs/catalyst-cooperative/discussions/new?category=help-me>`__.

PUDL's primary data output is the ``pudl.sqlite`` database. We recommend working with
tables with the ``out_`` prefix, as these tables contain the most complete and easiest
to work with data. For more information about the different types
of tables, read through :ref:`PUDL's naming conventions <asset-naming>`.
PUDL's primary data output is the ``pudl.sqlite`` database. All the tables are also
distributed as individual `Apache Parquet <https://parquet.apache.org/docs/>`__ files
which are more space efficient, have richer
data types and are better suited for distributed and large-scale data analysis.
We recommend working with tables with the ``out_`` prefix, as these tables contain
the most complete and easiest to work with data. For more information about the
different types of tables, read through :ref:`PUDL's naming conventions <asset-naming>`.

.. _access-modes:

Expand Down Expand Up @@ -106,8 +109,14 @@ resulting outputs pass all of the data validation tests we've defined, the outpu
automatically uploaded to the `AWS Open Data Registry
<https://registry.opendata.aws/catalyst-cooperative-pudl/>`__, and used to deploy a new
version of Datasette (see above). These nightly build outputs can be accessed using the
AWS CLI, or programmatically via the S3 API. They can also be downloaded directly over
HTTPS using the following links:
AWS CLI, or programmatically via the S3 API.

If you don't want to mess with the API
or CLI, you can also download the data directly over HTTPS. The download links for
each table's Parquet file can be found in
the :doc:`PUDL data dictionary page </data_dictionaries/pudl_db>`.

These are the download links for the PUDL and raw FERC SQLite databases:

Fully Processed SQLite Databases
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand All @@ -119,8 +128,7 @@ Hourly Tables as Parquet
^^^^^^^^^^^^^^^^^^^^^^^^

Hourly time series take up a lot of space in SQLite and can be slow to query in bulk,
so we have moved to publishing all our hourly tables using the compressed, columnar
`Apache Parquet <https://parquet.apache.org/docs/>`__ file format.
so all our hourly tables are only distributed as Parquet files:

* `EIA-930 BA Hourly Interchange <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/core_eia930__hourly_interchange.parquet>`__
* `EIA-930 BA Hourly Net Generation by Energy Source <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/core_eia930__hourly_net_generation_by_energy_source.parquet>`__
Expand Down
4 changes: 4 additions & 0 deletions docs/dev/naming_conventions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,10 @@ quantities are actually different.
* Regardless of what label utilities are given in the original data source
(e.g. ``operator`` in EIA or ``respondent`` in FERC) we refer to them as
``utilities`` in PUDL.
* Include verb prefixes (e.g.: ``is_{x}``, ``has_{x}``, or ``served_{x}``)
to boolean columns to highlight their binary nature. (Not all columns in
the PUDL database follow this standard, but we'd like them to moving
forward).

Naming Conventions in Code
--------------------------
Expand Down
8 changes: 7 additions & 1 deletion docs/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,11 +70,17 @@ EPA CEMS
~~~~~~~~
* Added 2024 Q3 of CEMS data. See :issue:`3943` and :pr:`3948`.

FERC to EIA Record Linkage
Record Linkage
^^^^^^^^^^^^^^^^^^^^^^^^^^
* Updated the ``splink`` FERC to EIA development notebook to be compatible with
the latest version of ``splink``. This notebook is not run in production but
is helpful for visualizing model weights and what is happening under the hood.
* Updated ``pudl.analysis.record_linkage.name_cleaner`` company name cleaning
module to be more efficient by removing all ``.apply`` and instead use
``pd.Series.replace`` to make regex replacement rules vectorized. Also removed
some of the allowed replacement rules to make the cleaner simpler and more
effective. This module runs approximately 3x faster now when cleaning a
string Series.

.. _release-v2024.10.0:

Expand Down
7 changes: 5 additions & 2 deletions docs/templates/resource.rst.jinja
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,14 @@
**This table has no primary key.**
{%- endif %}

**Access methods:**

{% if resource.create_database_schema -%}
`Browse or query this table in Datasette. <https://data.catalyst.coop/pudl/{{ resource.name }}>`__
* `Browse or query this table in Datasette. <https://data.catalyst.coop/pudl/{{ resource.name }}>`__
{% else -%}
This table is not published to Datasette.
* This table is not published to Datasette.
{%- endif %}
* `Download this table as a Parquet file. <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/{{ resource.name }}.parquet>`__

.. list-table::
:widths: auto
Expand Down
Loading

0 comments on commit 06ae880

Please sign in to comment.