Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean .git history #317

Closed
pz-max opened this issue Mar 28, 2022 · 14 comments
Closed

Clean .git history #317

pz-max opened this issue Mar 28, 2022 · 14 comments

Comments

@pz-max
Copy link
Member

pz-max commented Mar 28, 2022

Just realised that our .git history (hidden files) is quite large (270MB). One can clean historic .ipynb, cleaning their outputs.

The last command from here does the job:

git filter-branch --tree-filter "python3 -m nbconvert --ClearOutputPreprocessor.enabled=True --inplace *.ipynb **/*.ipynb || true"
@pz-max
Copy link
Member Author

pz-max commented Mar 28, 2022

One could test that now for our hackathon repo pypsa-meets-earth/pypsa-africa-hackathon#14

@pz-max
Copy link
Member Author

pz-max commented Mar 28, 2022

This for cleaning additional big files above i.e. 1M files
https://netdevops.me/2021/remove-binaries-and-big-files-from-git-repo/

@pz-max
Copy link
Member Author

pz-max commented Mar 28, 2022

List of file types that could be cleaned:
*.json *.geojson *.pickle *.pbf *.shp *.gpkg *.zip *.tif *.nc

@pz-max
Copy link
Member Author

pz-max commented Mar 28, 2022

git-filter-repo seems to be the new tool to do the job. Docs can be found here.

Note. Action with git filter-repo can be really deconstructive. Make always a copy of the original, perform dry-runs and check if it worked. After cleaning the repository, NO old repository is allowed to push, otherwise histories will be mixed up resulting in a mess.

  • PR's should be only allowed to merge a new clone with the changes/ or ideally repeat the PR on a new clone.

@pz-max
Copy link
Member Author

pz-max commented Mar 28, 2022

Great tip in the Step-by-Step guide (called DISCUSSION here):

  • We create a mirror clone of PyPSA-Africa git clone <repo-name> --mirror
  • Run git filter-repo --analyze
  • We do all cleanup actions first with --dry-run then without:
  1. empty all historic .ipynb,
  2. keep only paths that exist in the current pypsa-africa version,
  3. remove big files
  • Push changes to a new dummy repository (yes not to pypsa-africa, it will remain dirty. We push the changes to a new created repository)
  • After the dummy works we push the clean repository to the new pypsa-earth version

PyPSA-Africa needs to remain with dirty history at the beginning.
@davide-f

@davide-f
Copy link
Member

davide-f commented May 31, 2022

I'm wondering whether we may rebase the old commits by squashing them: intermediate files that are created and deleted in the squashed commits may disappear from the history (hopefully).
Probably, if we do that for the initial few hundred commits, we may automatically solve many issues.
It may be worth trying on a fork/branch

@pz-max
Copy link
Member Author

pz-max commented Aug 11, 2022

Some more info:

Whatever we go for, we should experiment first in a fork or a dummy repo

@mnm-matin
Copy link
Member

#449 shows that even if we remove all the history and keep the latest code we are still close to 100 MB. It could be possible/essential to clean the current state of the notebook folder and images folder before sanitizing the history.

@davide-f
Copy link
Member

davide-f commented Sep 17, 2022

I've tested on my fork at https://github.com/davide-f/pypsa-africa (branch main)

When downloading the repository, the size is now about 50MB, see image below.
image

What has been done:

  • First, using git filter-repo --analyze, a report with the size of the folders with larger data storage. The output was in {parent}\pypsa-africa\.git\filter-repo.
  • By analyzing the output, there where binaries in the old folders data_exploration, data/clean, data/osm. The entire content of those folders have been completely removed from the git history, using git filter-repo --path {dir} --invert-paths. The size of the pypsa-africa folder descreased to about 100MB
  • A large number number of changes and data storage was allocated to the folder notebooks. Therefore, the entire history has been rewritten with the scope of cleaning all notebooks from the stored data. That has been done using this procedure; the procedure was also found in another website (my feeling is that this procedure has been copied from the other source). This is the last step that lead the total size of the repo to fall to 50MB

Note: To keep the notebooks empty, I've proposed to also add a pre-commit rule

Note2: I am not sure why the badge in this branch is not showing the size I'd expect; it may be due to some time delay or not sure why...

@pz-max
Copy link
Member Author

pz-max commented Sep 17, 2022

I could reduce the repo size to 5.9 MB while keeping all the .py history.

  1. First install filter-repo:
    pip install filter-repo

  2. Then just add this .txt to the folder above pypsa-africa (generated by find . -type f > paths_i_want.txt and some manual labor):
    paths_i_want.txt

  3. Finally, execute the following in the pypsa-africa repo:
    git filter-repo --paths-from-file ../paths-i-want.txt


Design note:

  • The .txt described paths that exist in the current repo.
  • Filtering with the step 3 command only keeps the history of the files that exist right now, given the existing paths (any folder changes etc. are not tracked).
  • However, to not loose some python code history the .txt was extended by the glob: *.py
  • As notable from the .txt, I removed the existing paths with /images (they are not used anywhere in the documentation) and /notebooks (because we aim to move them to a separate repository)

@davide-f
Copy link
Member

davide-f commented Sep 17, 2022

Update: I locally did what you did as well max: I removed the images folder and the corresponding history and the size is now 38MB

Very interesting approach; however, you removed all the notebooks.
Personally, I think that having the notebooks inside the repo should be the way to go. Keeping them in a parallel repo wouldn't be as confortable to use as if they are in the same one

Update2: I added what you did on my repo, same link as before but (a) keeping the notebooks and (b) fixing the missing images in the doc/image to match the images of the api_reference.
Images of the validation notebooks have been moved to a newly created images in the validation folder.
Moreover, the notebooks coded as OLD_* have been completely removed as well.
The total size is 21.3MB

Removing the OLD_* notebooks remove only 0.5MB of memory and around 100-200 commits. Not sure if it is worthy. However those notebboks are no more needed.
Result: https://github.com/davide-f/pypsa-africa/graphs/contributors

@pz-max
Copy link
Member Author

pz-max commented Sep 18, 2022

Proposal B: Keep jupyter notebooks outside of the PyPSA-Earth repository & add clear documentation on use.

Context & problem:
Jupyter notebooks are used to:

  • explore the model inputs and outputs
  • visualise results in paper quality
  • perform validation
  • demonstrate certain features

In my opinion, Jupyter notebooks are only useful if they are precompiled such that the user knows what images or results to expect. Otherwhise the user/developer will waste time debugging code that was not necessary in the first place. The problem is that not all notebooks or even none jupyter notebook should be hosted on the pypsa-earth repository. We experienced because of this a bloating repository.

Design idea:
Let's create a new repository where we add the compiled notebooks. Here every user can push it's new validation and demonstration jupyter notebook without causing many problems. Executing here sometimes destructive commands such as filter-repo is not too bad. The new repository can be designed in such as way that it smoothly integrates with the PyPSA-Earth repository. For instance, in case we write in the PyPSA-Earth get-started documentation that:
a) create a new folder mkdir "pypsa-earth-project"
b) move into pypsa-earth folder cd pypsa-earth
c) git clone pypsa-earth
d) git clone jupyter-repo
Thereby, each compiled repository will be linked by commands to read the results of pypsa-earth-project/pypsa-earth/results. This allows easy and smooth usage of any jupyter notebook as before while keeping now clearly all notebooks away from the code and having them all at one central place.

A side benefit. The PyPSA-Earth repository will be reduced from 360MB -> 6-25MB at the current version

Maybe another strong argument for this option. Adding Jupyter notebooks later if it's really needed is not destructive while removing Jupyter notebooks could be destructive (requiring filter-repo & everyone needs to work on a new fork/clone)

@pz-max
Copy link
Member Author

pz-max commented Sep 23, 2022

We decided to go for Proposal B.
The action plan for project rebranding and cleaning is given here: #460

@pz-max
Copy link
Member Author

pz-max commented Sep 26, 2022

We decided to go for Proposal B. The action plan for project rebranding and cleaning is given here: #460

MISSION COMPLETED. From 360MB to 1.8MB.
image

@pz-max pz-max closed this as completed Sep 26, 2022
FabianHofmann pushed a commit that referenced this issue Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants