Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General Trilinos maintenance: useless spaces are making up to 2 MB #12398

Closed
romintomasetti opened this issue Oct 12, 2023 · 8 comments
Closed
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. type: enhancement Issue is an enhancement, not a bug

Comments

@romintomasetti
Copy link
Contributor

romintomasetti commented Oct 12, 2023

Enhancement

Trilinos is full of:

  • trailing spaces
  • lines filled up with only spaces

On today's develop, running the Python script attached (see below) leads to:

INFO:root:> Parsed 51715 files
INFO:root:> There are 1435701 errors of type Types.TrailingSpace, total weight 1810923
INFO:root:> There are 46797 errors of type Types.OnlySpace, total weight 201760
INFO:root:> There are 480 errors of type Types.ParseError, total weight 480

That is, each time someone makes a shallow clone of Trilinos (thinking of the CI mostly), it gets almost 2 megabytes of...spaces (I don't know what happens for the 480 files for which I got a UnicodeDecodeError and I guess it deserves a new issue on its own).

This:

  • increases the load on the network ☎️
  • makes some people shiver 😨
  • is frankly not environmental friendly at all 🥬

Actions:

  1. I agree that removing trailing spaces all at once is not desirable as it breaks the history of the modified lines. However, I think it would be a great addition to add something along commonTools/test/utilities/check-mpi-comm-world-usage.py (@JacobDomagala) that ensures no new trailing space is added.
  2. Make a single PR that, for Trilinos packages, removes lines whose content is only spaces.
    • This would not modify the meaning of the files, but would already save somewhere around 201760 characters (I guess a bit less than that, see next bullet point).
    • Of course, files that are snapshots of other repos should not be touched (Google Test, Kokkos and others)
    • I haven't made a thorough study to understand what are the sources of the spaces but sure there are many in files maintained by Trilinos packages so it's worth a try.

To quote someone that phrased it much better then me

Trailing whitespace has nothing to do with the quality of the code because the code is the same with or without it.

Trailing whitespace is just garbage. It makes source files bigger and makes comparing changes more difficult. This is especially annoying when you have to do a code review or merge, but one version of the file has different spaces than the other. Keeping trailing spaces trimmed is therefore considered as part of basic hygiene together with consistent indentation, consistent new-lines and so on.

Last pro: useless tokens clearly have an (slight) impact on the compiler's required time to parse Trilinos source code. Removing the trailing spaces might induce very tiny gains with that respect.

trailing_white_space.py.txt

@romintomasetti romintomasetti added the type: enhancement Issue is an enhancement, not a bug label Oct 12, 2023
@cwpearson
Copy link
Contributor

cwpearson commented Oct 17, 2023

This will replace any line that is only spaces with an empty line:

shopt -s globstar
sed -i -e 's/^[[:space:]]*$//g' packages/PACKAGE/**/*.cpp packages/PACKAGE/**/*.hpp

This will remove any trailing spaces on a line (including a line that is only spaces, note missing ^ in regex)

shopt -s globstar
sed -i -e 's/[[:space:]]*$//g' packages/PACKAGE/**/*.cpp packages/PACKAGE/**/*.hpp

e.g. for Tpetra:

develop...cwpearson:Trilinos:cleanup/remove-trailing-spaces

@GrahamBenHarper
Copy link
Contributor

@romintomasetti @cwpearson I have done some work related to this previously in MueLu with #11258 using a sed command inside a find command. I also opened #12058, and the MueLu team decided that it would be an uphill battle to try to maintain standards occasionally without also having some sort of Github action to enforce standards on a PR level.

Developing such a Github action is a work in progress still, but it's only one package. Enforcing it across all of Trilinos is a different story entirely and would require mass adoption from developers and leadership.

@romintomasetti
Copy link
Contributor Author

Hi @cwpearson and @GrahamBenHarper !

Thanks for sharing your thoughts.

I think there are 2 points w.r.t. trailing spaces:

  1. New commits should not introduce trailing spaces. Trailing spaces is a bad style and I see no valid reason why in 2023 any package would be allowed exemption from this rule. Any decent IDE has capabilities for helping with trailing spaces BTW.
  2. As I said, changing existing lines of code to enforce any sort of style rule is the best way to lose the history of the line. I think it's best applying stylistic changes only when the line is touched by a new development.

Based on these 2 points, I would propose that a Github action is setup for banning new trailing spaces in any Trilinos package. See below for a simple script that does that.

This means:

  1. No package maintainer is forced to make a PR to delete current trailing spaces.
  2. All package developers must comply to the no-more-trailing-spaces rule. No exception to that.

@GrahamBenHarper Enforcing clang-tidy at the Trilinos level would be easy, because you can opt-in for a few basic rules (for instance Trilinos could decide that only llvm-namespace-comment must be enforced globally, see also https://clang.llvm.org/extra/clang-tidy/checks/list.html for a complete list of supported checks). I agree that enforcing clang-format is a different story, because the checks are not opt-in and could lead to significant changes that sometimes aren't pretty. And, as I mentioned, significant stylistic changes of existing code without any new development should not be allowed in my opinion.

Here is a quickly drafted script. I used Python instead of any bash-based solution because of the additional flexibility to filter files you want to ignore for instance. I could run in python:latest Docker image in which you install GitPython. Use the script without any argument to look at all tracked files, or with --tracked-changed-only --commit-start upstream/develop --commit-end HEAD to compare your local branch to whatever upstream branch you want.

import argparse
import logging
import os
import pathlib
import typing

import git
import typeguard

@typeguard.typechecked
def parse_args() -> argparse.Namespace:
    """Parse CLI arguments."""
    parser = argparse.ArgumentParser()

    parser.add_argument("--directory", help = "Directory wherein checks are performed (recursively)", required = False, default = os.getcwd())

    group = parser.add_argument_group(title = 'Tracked changed only', description = "Only look at Git tracked files that changed between 2 commits.")
    group.add_argument("--tracked-changed-only", action = "store_true", help = "Enable looking at tracked changed files only.")
    group.add_argument("--commit-start", type = str, required = False, help = "If only tracked changed files must be considered, this is the start commit.")
    group.add_argument("--commit-end"  , type = str, required = False, help = "If only tracked changed files must be considered, this is the end commit.")

    args = parser.parse_args()

    if args.tracked_changed_only and (args.commit_start is None or args.commit_end is None):
        parser.error("Start and end commits are required.")

    return args

@typeguard.typechecked
def files_to_look_at(args : argparse.Namespace) -> typing.Iterator[pathlib.Path]:
    """
    Generator that either:
        - only keeps files that are tracked and modified between 2 Git commits are kept.
        - keeps all tracked files
    """
    logging.info(f"Listing tracked files to look at based on {args}.")

    repo = git.Repo(path = args.directory)

    if args.tracked_changed_only:
        
        commit_start = repo.commit(args.commit_start)
        commit_end   = repo.commit(args.commit_end)

        changed_files = repo.git.diff('--name-only', commit_start, commit_end).splitlines()

        logging.debug(
            "Filtering any file matching generator with files changed between "
            + f"'{args.commit_start}' ({commit_start}) and '{args.commit_end}' ({commit_end}):\n"
            + str(changed_files)
        )

        for changed in changed_files:
            yield pathlib.Path(changed).absolute()
    else:
        for tracked in repo.commit().tree.traverse():
            path = args.directory / pathlib.Path(tracked.path)
            if path.is_file():
                yield path

@typeguard.typechecked
def filter_what_you_want_to_ignore(files : typing.Iterable[pathlib.Path]) -> typing.Iterator[pathlib.Path]:
    """
    Add any filtering rule.
    """
    logging.info("Filtering to keep only files we want.")

    # As an example, let's filter any PNG or YAML file.
    for file in files:
        if not any(file.suffix ==  suffix for suffix in ['.png', '.yaml']):
            yield file

@typeguard.typechecked
def check_trailing_space(files : typing.Iterable[pathlib.Path]) -> None:
    """
    Check for any trailing space in provided files.
    """
    logging.info(f"Checking for trailing spaces.")

    has_error = False
    for file in files:
        with open(file, 'r') as f:
            for counter, line in enumerate(f):
                if line.endswith(' \n'):
                    has_error = True
                    logging.error(f"Trailing space at {file}:{counter+1}")

    if has_error:
        raise RuntimeError("At least one trailing space detected.")

if __name__ == "__main__":

    logging.basicConfig(level = logging.INFO)

    args = parse_args()

    check_trailing_space(files = filter_what_you_want_to_ignore(files = files_to_look_at(args = args)))

@ccober6
Copy link
Contributor

ccober6 commented Oct 18, 2023

I would like to add this topic to the TUG's Developers Day Open Discussion, and get additional input and see if we can get consensus.

@romintomasetti
Copy link
Contributor Author

@ccober6 It seems the topic was not addressed during TUG's Open Discussion session... What's the way forward? 😄

@cgcgcg
Copy link
Contributor

cgcgcg commented Nov 3, 2023

@romintomasetti We will put together a clang-format approach at the MueLu level (github action, pre-commit hook, and some type of bot). Once we feel like we have a reasonable approach other packages can opt in. We will have one massive reformatting commit that can be ignored in git blame.

Copy link

github-actions bot commented Nov 3, 2024

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Nov 3, 2024
Copy link

github-actions bot commented Dec 4, 2024

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Dec 4, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. type: enhancement Issue is an enhancement, not a bug
Projects
None yet
Development

No branches or pull requests

5 participants