Clean .git history #317

pz-max · 2022-03-28T10:38:53Z

Just realised that our .git history (hidden files) is quite large (270MB). One can clean historic .ipynb, cleaning their outputs.

The last command from here does the job:

git filter-branch --tree-filter "python3 -m nbconvert --ClearOutputPreprocessor.enabled=True --inplace *.ipynb **/*.ipynb || true"

pz-max · 2022-03-28T10:40:40Z

One could test that now for our hackathon repo pypsa-meets-earth/pypsa-africa-hackathon#14

pz-max · 2022-03-28T18:26:22Z

This for cleaning additional big files above i.e. 1M files
https://netdevops.me/2021/remove-binaries-and-big-files-from-git-repo/

pz-max · 2022-03-28T18:27:12Z

List of file types that could be cleaned:
*.json *.geojson *.pickle *.pbf *.shp *.gpkg *.zip *.tif *.nc

pz-max · 2022-03-28T19:53:18Z

git-filter-repo seems to be the new tool to do the job. Docs can be found here.

Note. Action with git filter-repo can be really deconstructive. Make always a copy of the original, perform dry-runs and check if it worked. After cleaning the repository, NO old repository is allowed to push, otherwise histories will be mixed up resulting in a mess.

PR's should be only allowed to merge a new clone with the changes/ or ideally repeat the PR on a new clone.

pz-max · 2022-03-28T20:47:56Z

Great tip in the Step-by-Step guide (called DISCUSSION here):

We create a mirror clone of PyPSA-Africa git clone <repo-name> --mirror
Run git filter-repo --analyze
We do all cleanup actions first with --dry-run then without:

empty all historic .ipynb,
keep only paths that exist in the current pypsa-africa version,
remove big files

Push changes to a new dummy repository (yes not to pypsa-africa, it will remain dirty. We push the changes to a new created repository)
After the dummy works we push the clean repository to the new pypsa-earth version

PyPSA-Africa needs to remain with dirty history at the beginning.
@davide-f

davide-f · 2022-05-31T00:25:10Z

I'm wondering whether we may rebase the old commits by squashing them: intermediate files that are created and deleted in the squashed commits may disappear from the history (hopefully).
Probably, if we do that for the initial few hundred commits, we may automatically solve many issues.
It may be worth trying on a fork/branch

pz-max · 2022-08-11T08:50:53Z

Some more info:

Nice story about how repos can explode in size & how GitLFS can help: https://www.practicaldatascience.org/html/exercises/Exercise_git_2.html#Git-LFS . I think binaries (images) in changing Jupyter notebooks are the big problem. With every change, a new image will be stored.
GitLFS will only store the latest image, create a pointer & move the old ones in the gut history to git lfs (1 GB for free, 50GB 5€/month). This requires users to have git lfs installed. 'conda install -c conda-forge git-lfs'. Maybe a good solution in case we want to keep the outputs 👍🏽 A note on general options to deal with Jupyter notebooks: https://stackoverflow.com/a/61157923
Gitlab documentation on reducing repo size. https://docs.gitlab.com/ee/user/project/repository/reducing_the_repo_size_using_git.html
Blog article on how to clean a whole repo (2020) https://mg.readthedocs.io/git-jupyter.html

Whatever we go for, we should experiment first in a fork or a dummy repo

mnm-matin · 2022-09-05T15:30:48Z

#449 shows that even if we remove all the history and keep the latest code we are still close to 100 MB. It could be possible/essential to clean the current state of the notebook folder and images folder before sanitizing the history.

davide-f · 2022-09-17T20:34:20Z

I've tested on my fork at https://github.com/davide-f/pypsa-africa (branch main)

When downloading the repository, the size is now about 50MB, see image below.

What has been done:

First, using git filter-repo --analyze, a report with the size of the folders with larger data storage. The output was in {parent}\pypsa-africa\.git\filter-repo.
By analyzing the output, there where binaries in the old folders data_exploration, data/clean, data/osm. The entire content of those folders have been completely removed from the git history, using git filter-repo --path {dir} --invert-paths. The size of the pypsa-africa folder descreased to about 100MB
A large number number of changes and data storage was allocated to the folder notebooks. Therefore, the entire history has been rewritten with the scope of cleaning all notebooks from the stored data. That has been done using this procedure; the procedure was also found in another website (my feeling is that this procedure has been copied from the other source). This is the last step that lead the total size of the repo to fall to 50MB

Note: To keep the notebooks empty, I've proposed to also add a pre-commit rule

Note2: I am not sure why the badge in this branch is not showing the size I'd expect; it may be due to some time delay or not sure why...

pz-max · 2022-09-17T22:39:32Z

I could reduce the repo size to 5.9 MB while keeping all the .py history.

First install filter-repo:
pip install filter-repo
Then just add this .txt to the folder above pypsa-africa (generated by find . -type f > paths_i_want.txt and some manual labor):
paths_i_want.txt
Finally, execute the following in the pypsa-africa repo:
git filter-repo --paths-from-file ../paths-i-want.txt

Design note:

The .txt described paths that exist in the current repo.
Filtering with the step 3 command only keeps the history of the files that exist right now, given the existing paths (any folder changes etc. are not tracked).
However, to not loose some python code history the .txt was extended by the glob: *.py
As notable from the .txt, I removed the existing paths with /images (they are not used anywhere in the documentation) and /notebooks (because we aim to move them to a separate repository)

davide-f · 2022-09-17T23:52:21Z

Update: I locally did what you did as well max: I removed the images folder and the corresponding history and the size is now 38MB

Very interesting approach; however, you removed all the notebooks.
Personally, I think that having the notebooks inside the repo should be the way to go. Keeping them in a parallel repo wouldn't be as confortable to use as if they are in the same one

Update2: I added what you did on my repo, same link as before but (a) keeping the notebooks and (b) fixing the missing images in the doc/image to match the images of the api_reference.
Images of the validation notebooks have been moved to a newly created images in the validation folder.
Moreover, the notebooks coded as OLD_* have been completely removed as well.
The total size is 21.3MB

Removing the OLD_* notebooks remove only 0.5MB of memory and around 100-200 commits. Not sure if it is worthy. However those notebboks are no more needed.
Result: https://github.com/davide-f/pypsa-africa/graphs/contributors

pz-max · 2022-09-18T17:44:58Z

Proposal B: Keep jupyter notebooks outside of the PyPSA-Earth repository & add clear documentation on use.

Context & problem:
Jupyter notebooks are used to:

explore the model inputs and outputs
visualise results in paper quality
perform validation
demonstrate certain features

In my opinion, Jupyter notebooks are only useful if they are precompiled such that the user knows what images or results to expect. Otherwhise the user/developer will waste time debugging code that was not necessary in the first place. The problem is that not all notebooks or even none jupyter notebook should be hosted on the pypsa-earth repository. We experienced because of this a bloating repository.

Design idea:
Let's create a new repository where we add the compiled notebooks. Here every user can push it's new validation and demonstration jupyter notebook without causing many problems. Executing here sometimes destructive commands such as filter-repo is not too bad. The new repository can be designed in such as way that it smoothly integrates with the PyPSA-Earth repository. For instance, in case we write in the PyPSA-Earth get-started documentation that:
a) create a new folder mkdir "pypsa-earth-project"
b) move into pypsa-earth folder cd pypsa-earth
c) git clone pypsa-earth
d) git clone jupyter-repo
Thereby, each compiled repository will be linked by commands to read the results of pypsa-earth-project/pypsa-earth/results. This allows easy and smooth usage of any jupyter notebook as before while keeping now clearly all notebooks away from the code and having them all at one central place.

A side benefit. The PyPSA-Earth repository will be reduced from 360MB -> 6-25MB at the current version

Maybe another strong argument for this option. Adding Jupyter notebooks later if it's really needed is not destructive while removing Jupyter notebooks could be destructive (requiring filter-repo & everyone needs to work on a new fork/clone)

pz-max · 2022-09-23T17:18:12Z

We decided to go for Proposal B.
The action plan for project rebranding and cleaning is given here: #460

pz-max · 2022-09-26T12:50:07Z

We decided to go for Proposal B. The action plan for project rebranding and cleaning is given here: #460

MISSION COMPLETED. From 360MB to 1.8MB.

…y_demand Adaptations to industry demand

pz-max mentioned this issue Aug 11, 2022

Reduce size of jupyter notebooks #227

Closed

mnm-matin mentioned this issue Sep 5, 2022

update clone instructions in README.md to use a shallow clone #449

Closed

pz-max closed this as completed Sep 26, 2022

FabianHofmann pushed a commit that referenced this issue Aug 16, 2024

Merge pull request #317 from pypsa-meets-earth/adaptations_to_industr…

b7da641

…y_demand Adaptations to industry demand

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean .git history #317

Clean .git history #317

pz-max commented Mar 28, 2022

pz-max commented Mar 28, 2022 •

edited

Loading

pz-max commented Mar 28, 2022

pz-max commented Mar 28, 2022

pz-max commented Mar 28, 2022

pz-max commented Mar 28, 2022 •

edited

Loading

davide-f commented May 31, 2022 •

edited

Loading

pz-max commented Aug 11, 2022 •

edited

Loading

mnm-matin commented Sep 5, 2022

davide-f commented Sep 17, 2022 •

edited

Loading

pz-max commented Sep 17, 2022 •

edited

Loading

davide-f commented Sep 17, 2022 •

edited

Loading

pz-max commented Sep 18, 2022 •

edited

Loading

pz-max commented Sep 23, 2022 •

edited

Loading

pz-max commented Sep 26, 2022

Clean .git history #317

Clean .git history #317

Comments

pz-max commented Mar 28, 2022

pz-max commented Mar 28, 2022 • edited Loading

pz-max commented Mar 28, 2022

pz-max commented Mar 28, 2022

pz-max commented Mar 28, 2022

pz-max commented Mar 28, 2022 • edited Loading

davide-f commented May 31, 2022 • edited Loading

pz-max commented Aug 11, 2022 • edited Loading

mnm-matin commented Sep 5, 2022

davide-f commented Sep 17, 2022 • edited Loading

pz-max commented Sep 17, 2022 • edited Loading

davide-f commented Sep 17, 2022 • edited Loading

pz-max commented Sep 18, 2022 • edited Loading

Proposal B: Keep jupyter notebooks outside of the PyPSA-Earth repository & add clear documentation on use.

pz-max commented Sep 23, 2022 • edited Loading

pz-max commented Sep 26, 2022

pz-max commented Mar 28, 2022 •

edited

Loading

pz-max commented Mar 28, 2022 •

edited

Loading

davide-f commented May 31, 2022 •

edited

Loading

pz-max commented Aug 11, 2022 •

edited

Loading

davide-f commented Sep 17, 2022 •

edited

Loading

pz-max commented Sep 17, 2022 •

edited

Loading

davide-f commented Sep 17, 2022 •

edited

Loading

pz-max commented Sep 18, 2022 •

edited

Loading

pz-max commented Sep 23, 2022 •

edited

Loading