Collection of python scripts and code snippets to get alternative ESG data from sources like Google Trends or Google results.
Empower projects within sustainable finance with a codebase and provide a starting point for data sourcing, analyses or modelling. We plan to upload python scripts that source from
- Google Trends
- Google results
- Yahoo! Finance
The project is in the scope of the towards sustainable finance initiative. We always welcome anyone interested in joining forces and look forward to your message.
- [ ]
- check visuals
- revise README
- create github actions
- run and revise mkdocs
- [ ]
Open anaconda prompt
as admin and run:
git clone https://github.com/philippschmalen/ESG-data-codebase.git
# create venv from file
conda env create -f conda.yaml
# init pre-commit
pre-commit install
pre-commit autoupdate
# test-run
pre-commit
Configure settings.yaml
. Mine looks like:
dir:
raw: 'data/raw'
interim: 'data/interim'
processed: 'data/processed'
external: 'data/external'
query:
google_results:
user_agent: {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"}
base_url: "https://www.google.com/search?q="
To get analysis-ready data from Google trends, use get_interest_over_time()
. It takes a list of keywords and stores each query result into a CSV in filepath
. It has in-built error handling and is designed fail-safe. For example, it increases the timeout between queries if one fails due to rate limit. Even after max retries, data is not lost, but the unsuccessful keywords are stored in a csv.
Here is the official documentation powered by mkdocs and mkdocstrings.
The project benefits from previous work of the repositories:
- https://github.com/philippschmalen/ESG-trending-topics-radar
- https://github.com/philippschmalen/ESG-with-googletrends
- https://github.com/philippschmalen/ESG-topics-Google-count
- https://github.com/philippschmalen/etl_spark_airflow_emr
├── LICENSE
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- Files build with mkdocs and mkdocstrings. Use Google style docstrings
│
├── streamlit <- Streamlit apps to interact with data. Naming convention according to TDS process:
│ the process step and a short `-` delimited step description, e.g.
│ `0-exploration`, `1-preprocessing`, `2-feature_engineering`, `3-modelling`.
│
├── notebooks <- Jupyter notebooks to interact with data. Same conventions like ./streamlit apply
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── conda_env.yaml <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `conda env export -f --no-builds > conda_env.yaml`
│
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
└── make_dataset.py
Here is everything related to further develop and maintain the project.
Build conda env create -f conda_env.yaml
Export to yaml python conda_env_export.py
(A modifier on conda env export
to combine results from
--no-builds
and without this flag, to avoid common pitfalls from default conda.)
Update current conda env conda env update -f conda_env.yaml
(optional flag --prune
remove libs
installed in current env but not listed in yaml.)
Following Google style doccstrings: https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html
A common issue that causes false rendering is to include a space after a headline. Args:
will not work, but Args:
will do.
Workflow to update documentation:
# conda activate esg_codebase
mkdocs serve # live-reloading for editing
mkdocs build # build static site
mkdocs gh-deploy # deploy to github pages
# in project root, run with conda env activated
mkdocs new .
# created mkdocs.yml and ./docs
Configure mkdocs.yml
to work with mkdocstrings
package.
site_name: ESG data codebase
theme:
name: "material" # theme works with mkdocstrings
plugins:
- search
- mkdocstrings:
default_handler: python
watch:
- src/data # enable auto-reload
Add python scripts to index.md
so that they appear in the code reference:
# Code reference
## Google results count
::: src.data.gresults_extract.py
---
## Google trends
::: src.data.gtrends_extract.py
Project based on the cookiecutter data science project template. #cookiecutterdatascience