/ GIT, GARBAGE COLLECTOR, CLEANUP

Git cleanup

I use Git a lot, in my daily job as well as for this blog. When using it, I often rebase locally before pushing, to have a clean and readable history.

A sample workflow

For my blog, the branching model looks like the following:

o---o---o---o master
             \
              \---o---o feature/newposts
master

As expected, this branch is the production site.

feature/newposts

The branch is dedicated for new posts. There’s one post per commit.

Also, to speed up rendering, there are only a handful of the latest posts. The first commit after master is to remove most of them.

To publish a new post, I cherry pick from feature/newposts to the master branch. Also, when I make changes to master, I do rebase feature/newposts onto master, to have the latest updates.

The impact of rebasing

Things start to get interesting when I rebase interactively on master.

  1. Initial state
    A---B---C---D master
                 \
                  \---a---b feature/newposts
  2. Rebase interactively on master
    A---B---C---D
         \       \
          \       \---a---b feature/newposts
           \
            \---D' master
  3. Rebase onto master
    A---B---C---D
         \
          \--C'---D' master
                   \
                    \---a---b feature/newposts

See commits C and D? Notice they are not referenced by any branch, and they are not displayed with git log. Still, they can be displayed via git reflog.

Likewise, those commits are not displayed in GUI such as SourceTree.

Dangling and unreachable commits

Time for some definitions:

unreachable object

An object which is not reachable from a branch, tag, or any other reference.

dangling object

An unreachable object which is not reachable even from other unreachable objects; a dangling object has no references to it from any reference or object in the repository.

— Git glossary
https://git-scm.com/docs/gitglossary/

Using those definitions, commits C and D in the above diagrams are considered unreachable because no reference points to either of them. Moreover, commit D is also dangling, because no other object reference it, while commit C is not because D points to it.

To list those dangling and unreachable objects, one can use the git fsck command:

git-fsck - Verifies the connectivity and validity of the objects in the database

— git-fsck
https://git-scm.com/docs/git-fsck

For example, to display unreachable commits:

git fsck --unreachable

If an expected commit is not displayed, then perhaps it’s because it’s referenced by a reflog. In that case, there’s an option to ignore reflog references.

git fsck --unreachable --no-reflog
The same command can be used to list dangling commits only. Replace --unreachable by --dangling.

Cleanup proper

Git is quite efficient at storing text. And yet, there’s no point to store neither reflogs nor unreachable commits past a certain point.

There’s a garbage collector in Git. It might run automatically along some commands. You know the GC has been run when there’s an output like the following:

Counting objects: 9451, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (4657/4657), done.
Writing objects: 100% (9451/9451), done.
Total 9451 (delta 3843), reused 8900 (delta 3584)

It’s also possible to run it explicitly:

git gc

Calling the GC will remove unreachable objects.

The GC not only removes unreachable objects but also compresses file revisions

However, remember that most unused objects are still referenced by reflogs. Thus, they are not considered unreachable, and therefore neither are they garbage collected. The question now is how to expire reflogs to make objects unreachable?

Reflogs expiry

To expire reflogs, run:

git reflog expire

Reflogs are separated between standard and unreachable:

StandardUnreachable

Expired after (by default, days)

90

30

Command parameter

--expire=<time>

--expire-unreachable=<time>

For example, to expire reflogs older than two weeks instead of the default 90 days value, use:

git reflog expire --expire=2.weeks.ago

After reflogs have been expired, then relevant commits truly become unreachable, and can finally be removed by the garbage collector.

Conclusion

This post has looked into how commits references each other in Git, and how they can be cleaned up. In most cases however, the default regular automated cleanup should be enough. Remember that by removing reflogs and commits, you make it harder on yourself to recover from your mistakes.

Nicolas Fränkel

Nicolas Fränkel

Nicolas Fränkel is a Software Architect with 15 years experience consulting for many different customers, in a wide range of contexts (such as telecoms, banking, insurances, large retail and public sector). Usually working on Java/Java EE and Spring technologies, but with narrower interests like Software Quality, Build Processes and Rich Internet Applications. Currently working for an eCommerce solution vendor leader. Also double as a teacher in universities and higher education schools, a trainer and triples as a book author.

Read More