Skip to content

Latest commit

 

History

History
751 lines (512 loc) · 31.3 KB

CONTRIBUTING.rst

File metadata and controls

751 lines (512 loc) · 31.3 KB

Contributions are welcome and are greatly appreciated! Every little bit helps, and credit will always be given.

Report bugs through Apache Jira.

Please report relevant information and preferably code that exhibits the problem.

Look through the JIRA issues for bugs. Anything is open to whoever wants to implement it.

Look through the Apache JIRA for features.

Any unassigned "Improvement" issue is open to whoever wants to implement it.

We've created the operators, hooks, macros and executors we needed, but we've made sure that this part of Airflow is extensible. New operators, hooks, macros and executors are very welcomed!

Airflow could always use better documentation, whether as part of the official Airflow docs, in docstrings, docs/*.rst or even on the web as blog posts or articles.

The best way to send feedback is to open an issue on Apache JIRA.

If you are proposing a new feature:

  • Explain in detail how it would work.
  • Keep the scope as narrow as possible to make it easier to implement.
  • Remember that this is a volunteer-driven project, and that contributions are welcome :)

The latest API documentation is usually available here.

To generate a local version:

  1. Set up an Airflow development environment.
  2. Install the doc extra.
pip install -e '.[doc]'
  1. Generate and serve the documentation as follows:
cd docs
./build.sh
./start_doc_server.sh

Before you submit a pull request (PR) from your forked repo, check that it meets these guidelines:

  • Include tests, either as doctests, unit tests, or both, to your pull request.

    The airflow repo uses Travis CI to run the tests and codecov to track coverage. You can set up both for free on your fork (see Travis CI Testing Framework section below). It will help you make sure you do not break the build with your PR and that you help increase coverage.

  • Rebase your fork, squash commits, and resolve all conflicts.

  • When merging PRs, wherever possible try to use Squash and Merge instead of Rebase and Merge.

  • Make sure every pull request has an associated JIRA ticket. The JIRA link should also be added to the PR description.

  • Preface your commit's subject & PR title with [AIRFLOW-XXX] COMMIT_MSG where XXX is the JIRA number. For example: [AIRFLOW-5574] Fix Google Analytics script loading. We compose Airflow release notes from all commit titles in a release. By placing the JIRA number in the commit title and hence in the release notes, we let Airflow users look into JIRA and GitHub PRs for more details about a particular change.

  • Add an Apache License header to all new files.

    If you have pre-commit hooks enabled, they automatically add license headers during commit.

  • If your pull request adds functionality, make sure to update the docs as part of the same PR. Doc string is often sufficient. Make sure to follow the Sphinx compatible standards.

  • Make sure the pull request works for Python 3.5, 3.6 and 3.7.

  • Run tests locally before opening PR.

    As Airflow grows as a project, we try to enforce a more consistent style and follow the Python community guidelines. We currently enforce most of PEP8 and a few other linting rules described in Running static code checks section.

  • Adhere to guidelines for commit messages described in this article. This makes the lives of those who come after you a lot easier.

There are two environments, available on Linux and macOS, that you can use to develop Apache Airflow:

The table below summarizes differences between the two environments:

Property Local virtualenv Breeze environment
Test coverage
  • (-) unit tests only
  • (+) integration and unit tests
Setup
  • (+) automated with breeze cmd
  • (+) automated with breeze cmd
Installation difficulty
  • (-) depends on the OS setup
  • (+) works whenever Docker works
Team synchronization
  • (-) difficult to achieve
  • (+) reproducible within team
Reproducing CI failures
  • (-) not possible in many cases
  • (+) fully reproducible
Ability to update
  • (-) requires manual updates
  • (+) automated update via breeze cmd
Disk space and CPU usage
  • (+) relatively lightweight
  • (-) uses GBs of disk and many CPUs
IDE integration
  • (+) straightforward
  • (-) via remote debugging only

Typically, you are recommended to use both of these environments depending on your needs.

All details about using and running local virtualenv environment for Airflow can be found in LOCAL_VIRTUALENV.rst.

Benefits:

  • Packages are installed locally. No container environment is required.
  • You can benefit from local debugging within your IDE.
  • With the virtualenv in your IDE, you can benefit from autocompletion and running tests directly from the IDE.

Limitations:

  • You have to maintain your dependencies and local environment consistent with other development environments that you have on your local machine.

  • You cannot run tests that require external components, such as mysql, postgres database, hadoop, mongo, cassandra, redis, etc.

    The tests in Airflow are a mixture of unit and integration tests and some of them require these components to be set up. Local virtualenv supports only real unit tests. Technically, to run integration tests, you can configure and install the dependencies on your own, but it is usually complex. Instead, you are recommended to use Breeze development environment with all required packages pre-installed.

  • You need to make sure that your local environment is consistent with other developer environments. This often leads to a "works for me" syndrome. The Breeze container-based solution provides a reproducible environment that is consistent with other developers.

Possible extensions:

  • You are STRONGLY encouraged to also install and use pre-commit hooks for your local virtualenv development environment. Pre-commit hooks can speed up your development cycle a lot.

All details about using and running Airflow Breeze can be found in BREEZE.rst.

The Airflow Breeze solution is intended to ease your local development as "It's a Breeze to develop Airflow".

Benefits:

  • Breeze is a complete environment that includes external components, such as mysql database, hadoop, mongo, cassandra, redis, etc., required by some of Airflow tests. Breeze provides a preconfigured Docker Compose environment where all these services are available and can be used by tests automatically.
  • Breeze environment is almost the same as used in Travis CI automated builds. So, if the tests run in your Breeze environment, they will work in Travis CI as well.

Limitations:

  • Breeze environment takes significant space in your local Docker cache. There are separate environments for different Python and Airflow versions, and each of the images takes around 3GB in total.
  • Though Airflow Breeze setup is automated, it takes time. The Breeze environment uses pre-built images from DockerHub and it takes time to download and extract those images. Building the environment for a particular Python version takes less than 10 minutes.
  • Breeze environment runs in the background taking precious resources, such as disk space and CPU. You can stop the environment manually after you use it or even use a bare environment to decrease resource usage.

NOTE: Breeze CI images are not supposed to be used in production environments. They are optimized for repeatability of tests, maintainability and speed of building rather than production performance. The production images are not yet officially published.

We are in the process of fixing code flagged with pylint checks for the whole Airflow project. This is a huge task so we implemented an incremental approach for the process. Currently most of the code is excluded from pylint checks via scripts/ci/pylint_todo.txt. We have an open JIRA issue AIRFLOW-4364 which has a number of sub-tasks for each of the modules that should be made compatible. Fixing problems identified with pylint is one of straightforward and easy tasks to do (but time-consuming), so if you are a first-time contributor to Airflow, you can choose one of the sub-tasks as your first issue to fix.

To fix a pylint issue, do the following:

  1. Remove module/modules from the scripts/ci/pylint_todo.txt.

2. Run scripts/ci/ci_pylint_main.sh and scripts/ci/ci_pylint_tests.sh.

  1. Fix all the issues reported by pylint.

4. Re-run scripts/ci/ci_pylint_main.sh and scripts/ci/ci_pylint_tests.sh.

  1. If you see "success", submit a PR following Pull Request guidelines.

These are guidelines for fixing errors reported by pylint:

  • Fix the errors rather than disable pylint checks. Often you can easily refactor the code (IntelliJ/PyCharm might be helpful when extracting methods in complex code or moving methods around).
  • If disabling a particular problem, make sure to disable only that error by using the symbolic name of the error as reported by pylint.
import airflow.*  # pylint: disable=wildcard-import
  • If there is a single line where you need to disable a particular error, consider adding a comment to the line that causes the problem. For example:
def  MakeSummary(pcoll, metric_fn, metric_keys): # pylint: disable=invalid-name
  • For multiple lines/block of code, to disable an error, you can surround the block with pylint: disable/pylint: enable comment lines. For example:
# pylint: disable=too-few-public-methods
class  LoginForm(Form):
    """Form for the user"""
    username = StringField('Username', [InputRequired()])
    password = PasswordField('Password', [InputRequired()])
# pylint: enable=too-few-public-methods

Pre-commit hooks help speed up your local development cycle, either in the local virtualenv or Breeze, and place less burden on the CI infrastructure. Consider installing the pre-commit hooks as a necessary prerequisite.

The pre-commit hooks only check the files you are currently working on and make them fast. Yet, these checks use exactly the same environment as the CI tests use. So, you can be sure your modifications will also work for CI if they pass pre-commit hooks.

We have integrated the fantastic pre-commit framework in our development workflow. To install and use it, you need Python 3.6 locally.

It is the best to use pre-commit hooks when you have your local virtualenv for Airflow activated since then pre-commit hooks and other dependencies are automatically installed. You can also install the pre-commit hooks manually using pip install.

The pre-commit hooks require the Docker Engine to be configured as the static checks are executed in the Docker environment. You should build the images locally before installing pre-commit checks as described in BREEZE.rst. In case you do not have your local images built, the pre-commit hooks fail and provide instructions on what needs to be done.

The pre-commit hooks use several external linters that need to be installed before pre-commit is run.

Each of the checks installs its own environment, so you do not need to install those, but there are some checks that require locally installed binaries. On Linux, you typically install them with sudo apt install, on macOS - with brew install.

The current list of prerequisites:

  • xmllint: on Linux, install via sudo apt install xmllint; on macOS, install via brew install xmllint

To turn on pre-commit checks for commit operations in git, enter:

pre-commit install

To install the checks also for pre-push operations, enter:

pre-commit install -t pre-push

For details on advanced usage of the install method, use:

pre-commit install --help

Before running the pre-commit hooks, you must first build the Docker images as described in BREEZE.rst.

Sometimes your image is outdated and needs to be rebuilt because some dependencies have been changed. In such case the Docker-based pre-commit will inform you that you should rebuild the image.

In Airflow, we have the following checks (The checks with stare in Breeze require BREEZE.rst image built locally):

Hooks Description Breeze
base-operator Checks that BaseOperator is imported properly  
build Builds image for check-apache-licence, mypy, pylint, flake8.
check-apache-license Checks compatibility with Apache License requirements.
check-executables-have-shebangs Checks that executables have shebang.  
check-hooks-apply Checks which hooks are applicable to the repository.  
check-merge-conflict Checks if a merge conflict is committed.  
check-xml Checks XML files with xmllint.  
consistent-pylint Consistent usage of pylint enable/disable with space.  
debug-statements Detects accidenatally committed debug statements.  
detect-private-key Detects if private key is added to the repository.  
doctoc Refreshes the table of contents for md files.  
end-of-file-fixer Makes sure that there is an empty line at the end.  
flake8 Runs flake8.
forbid-tabs Fails if tabs are used in the project.  
insert-license Adds licenses for most file types.  
isort Sorts imports in python files.  
lint-dockerfile Lints a dockerfile.  
mixed-line-ending Detects if mixed line ending is used (r vs. rn).  
mypy Runs mypy.
pydevd Check for accidentally commited pydevd statements.  
pylint Runs pylint for main code.
pylint-tests Runs pylint for tests.
python-no-log-warn Checks if there are no deprecate log warn.  
rst-backticks Checks if RST files use double backticks for code.  
setup-order Checks for an order of dependencies in setup.py  
shellcheck Checks shell files with shellcheck.  
update-breeze-file Update output of breeze command in BREEZE.rst.  
yamllint Checks yaml files with yamllint.  

After installation, pre-commit hooks are run automatically when you commit the code. But you can run pre-commit hooks manually as needed.

  • Run all checks on your staged files by using:
pre-commit run
  • Run only mypy check on your staged files by using:
pre-commit run mypy
  • Run only mypy checks on all files by using:
pre-commit run mypy --all-files
  • Run all checks on all files by using:
pre-commit run --all-files
  • Skip one or more of the checks by specifying a comma-separated list of checks to skip in the SKIP variable:
SKIP=pylint,mypy pre-commit run --all-files

You can always skip running the tests by providing --no-verify flag to the git commit command.

To check other usage types of the pre-commit framework, see Pre-commit website.

When you implement core features or DAGs you might need to import some of the core objects or modules. Since Apache Airflow can be used both as application (by internal classes) and as library (by DAGs), there are different ways those core objects and packages are imported.

Airflow imports some of the core objects directly to 'airflow' package so that they can be used from there.

Those criteria were assumed for choosing what import path to use:

  • If you work on a core feature inside Apache Airflow, you should import the objects directly from the package where the object is defined - this minimises the risk of cyclic imports.
  • If you import the objects from any of 'providers' classes, you should import the objects from 'airflow' or 'airflow.models', It is very important for back-porting operators/hooks/sensors to Airflow 1.10.* (AIP-21)
  • If you import objects from within a DAG you write, you should import them from 'airflow' or 'airflow.models' package where stable location of such import is important.

Those checks enforced for the most important and repeated objects via pre-commit hooks as described below.

The BaseOperator should be imported: * as from airflow.models import BaseOperator in external DAG/operator * as from airflow.models.baseoperator import BaseOperator in Airflow core to avoid cyclic imports

Airflow test suite is based on Travis CI framework as running all of the tests locally requires significant setup. You can set up Travis CI in your fork of Airflow by following the Travis CI Getting Started guide.

There are two different options available for running Travis CI, and they are set up on GitHub as separate components:

  • Travis CI GitHub App (new version)
  • Travis CI GitHub Services (legacy version)

NOTE: The apache/airflow project is still using the legacy version.

Travis CI GitHub Services version uses an Authorized OAuth App.

  1. Once installed, configure the Travis CI Authorized OAuth App at Travis CI OAuth APP.
  2. If you are a GitHub admin, click the Grant button next to your organization; otherwise, click the Request button. For the Travis CI Authorized OAuth App, you may have to grant access to the forked ORGANIZATION/airflow repo even though it is public.
  3. Access Travis CI for your fork at https://travis-ci.org/ORGANIZATION/airflow.

If you need to create a new project in Travis CI, use travis-ci.com for both private repos and open source.

The travis-ci.org site for open source projects is now legacy and you should not use it.

More information:

When developing features, you may need to persist information to the metadata database. Airflow has Alembic built-in module to handle all schema changes. Alembic must be installed on your development machine before continuing with migration.

# starting at the root of the project
$ pwd
~/airflow
# change to the airflow directory
$ cd airflow
$ alembic revision -m "add new field to db"
   Generating
~/airflow/airflow/migrations/versions/12341123_add_new_field_to_db.py

airflow/www/ contains all npm-managed, front-end assets. Flask-Appbuilder itself comes bundled with jQuery and bootstrap. While they may be phased out over time, these packages are currently not managed with npm.

Make sure you are using recent versions of node and npm. No problems have been found with node>=8.11.3 and npm>=6.1.3.

Make sure npm is available in your environment.

To install it on macOS:

  1. Run the following commands (taken from this source):
brew install node --without-npm
echo prefix=~/.npm-packages >> ~/.npmrc
curl -L https://www.npmjs.com/install.sh | sh
  1. Add ~/.npm-packages/bin to your PATH so that commands you install globally are usable.

  2. Set up your .bashrc file and then source ~/.bashrc to reflect the change.

    For example:

export PATH="$HOME/.npm-packages/bin:$PATH"


You can also follow  _`general npm installation
instructions <https://docs.npmjs.com/downloading-and-installing-node-js-and-npm>`__.
  1. Install third party libraries defined in package.json by running the following commands within the airflow/www/ directory:
# from the root of the repository, move to where our JS package.json lives
cd airflow/www/
# run npm install to fetch all the dependencies
npm install

These commands install the libraries in a new node_modules/ folder within www/.

Should you add or upgrade an npm package, which involves changing package.json, you'll need to re-run npm install and push the newly generated package-lock.json file so that we get a reproducible build.

To parse and generate bundled files for Airflow, run either of the following commands:

# Compiles the production / optimized js & css
npm run prod

# Starts a web server that manages and updates your assets as you modify them
npm run dev

We try to enforce a more consistent style and follow the JS community guidelines.

Once you add or modify any javascript code in the project, please make sure it follows the guidelines defined in Airbnb JavaScript Style Guide.

Apache Airflow uses ESLint as a tool for identifying and reporting on patterns in JavaScript. To use it, run any of the following commands:

# Check JS code in .js and .html files, and report any errors/warnings
npm run lint

# Check JS code in .js and .html files, report any errors/warnings and fix them if possible
npm run lint:fix