Tales from the commit log

A piece of software never exists in isolation and without context. The problem to solve and the reasoning and experience of its programmers shape the implementation and are important information for future developers. Even well-written code cannot convey all of this by itself. Therefore, further means of communicating such information are required. Code comments and external documentation are the obvious choices. However, with Git we also have the chance to design the commit log in a way that it deliberately tells a story about the implementation that is not visible from the code itself.

Why code and change sets need to communicate

Programming is a form of human communication, mostly with other humans; incidentally also with the computer by instructing it to execute a function for us. Therefore, when implementing something we carefully need to consider how to communicate what and how things are done, similar to when we talk to each other about a manual task using natural language. We need to negotiate why and how we do things to ensure that everyone understand what to do and in which way. Otherwise, misinterpretation, confusion, and conflicts are pre-programmed.

With current software development techniques such as pull/merge requests, the phrase that code is read much more often than it is written or changed is probably even more valid than it ever used to be. Reading only makes fun and delivers the required insights if the literature you are reading is well written and not a convoluted mess of illogical mysteries. Similarly, reviewing a pull request only works well if the proposed changes are presented in a way that is understandable.

Clean and self-documenting code is an important technique to ensure a reasonable reading experience. However, the code itself mostly explains what and how something is realized. The really important pieces of information usually hide behind "why" questions. Why did I use this design over the more obvious one? Why this algorithm and not the other one? Why is this special case needed to fulfill our business requirements? Why do we need to make this change at all? These are actually the important pieces of information that will likely cause confusion sooner or later if omitted. Bugs might be introduced in future refactorings if business requirements and their special cases are not known to someone reworking the code. New features might break the intended design if it wasn’t clearly presented. Finally, a pull request review is much more productive and also more pleasing for the reviewer if requirements, design choices, and motivations are known.

Code comments can answer many of these issues if done properly and much has been written about how to create useful comments (e.g. [Atwood2006], [McConnell2004] chapter 32, [Ousterhout2018]). Good and concise recommendations are:

Comments augment the code by providing information at a different level of detail. Some comments provide information at a lower, more detailed, level than the code; these comments add precision by clarifying the exact meaning of the code. Other comments provide information at a higher, more abstract, level than the code; these comments offer intuition, such as the reasoning behind the code, or a simpler and more abstract way of thinking about the code.
— [Ousterhout2018]

or more compact

Comments should say things about the code that the code can’t say about itself—at the summary level or the intent level.
— [McConnell2004], p. 817

So, the essence is always that we need further forms of explaining the things that the code itself cannot tell. While comments can well explain the current structure of the code at a certain point in time, they are actually pretty bad at explaining the evolution of a piece of software. Having a long change list at the beginning of each source file is probably not a good idea. What if that file vanishes or is split up at some point in time? This is where the commit log of version control systems such as Git comes into play as an important tool for communication.

Recording history vs. telling a story

The commit log in version control systems is the primary means of documenting the evolution of a software projects. Before Git, systems such as CVS or SVN didn’t have real options to manipulate the commit log after the fact. Whenever a commit was formed, it was more or less set in stone. Anything that needed to be reworked or fixed afterwards was a new commit and iteration was not possible. Therefore, one could easily end up with a commit log that could look like the following:

Implement cool new feature, also refactor foo
Fix compilation issue introduced in previous commit
Fix typos in new code comments and fix old bug in tape ejection
Continue work
Change feature implementation from switch case to state pattern
Address review comments
TEMP will continue tomorrow
Finalize second feature

Such a commit log documents the editing history. This history definitely contains valuable information that goes beyond what the source code shows. However, is also contains a lot of noise that hides the essential messages such as the implemented features and their design decisions. In a pull request review such a commit log does not give the reviewer a concise picture of what has been done. When digging out old commits for debugging or understanding, a half-baked commit with message "TEMP will continue tomorrow" will leave the reader puzzled at best.

Some parts of this commit log could have been better even when using VCSs without rebasing features. For instance, TEMP commits could have been avoided at all and some commits could have been made more atomic and with better commit messages. But humans still make errors and need means to correct and improve. At least feedback from a code review could have never been incorporated into existing commits after the fact. Therefore, even with great editing care, when rebasing isn’t available or used, the commit log will continue to be a record of editing history. As such, it will never be an optimal tool for deliberate communication with other programmers.

With Git (and some more esoteric VCSs) including interactive rebasing support, we are now in the fortunate situation that commits are not immutable anymore and can be changed once new insights have been gained. Therefore, we can now use the commit log to tell a story about our implementation and the reasoning behind it, similarly to how code comments can be used. That way, the commit log becomes a tool that can be used consciously to document things the code itself cannot tell. By being a series of sequential code states, the commit log is particularly well-suited for describing the evolution of software and contained features / bug fixes.

Going back to the previous example, the commit log could have easily been rewritten to this more concise and descriptive version

Decouple foo by introducing the observer pattern
Fix null pointer issue when ejecting tape
Implement cool new feature
Implement second feature

This gives a much cleaner and less cluttered view on what has actually been done. A pull request reviewer easily gets a high-level view on what really is contained inside a pull request. By having well-structured commits, the review of a PR with considerable size also becomes tractable because commits can be reviewed in isolation. The commit message body of each commit is the place where important why questions are answered. That way, a reviewer can understand the motivations and overarching decisions before looking at the first line of code. Confusion and misunderstanding are less likely in this situation and feedback becomes more valuable because the context for the proposed changes is clear. Also, from the perspective of someone digging up an old commit such a state gives more insights because each commit tells a cohesive story and can be understood in isolation.

Properties of good commits

When is a commit log suitable to aid in the typical reading tasks for code? What are the properties of good commits to achieve this? To my mind, there are two important aspects.

Create cohesive commits

The first important aspect is to decide what forms a good commit. To my mind, the principles that lead to a good software architecture also apply to commits: cohesion, separation of concerns, or the single responsibility principle [Martin2014]. A commit is comprehensible if it does only one thing. That one thing can easily be described in the commit message and the commit can be understood without distractions and side tracks. Thus, each commit should be a change set with high cohesion and a single responsibility. If multiple things need to be done, apply separation of concerns and split up the work into multiple commits, one for each concern. This leads to more but smaller commits. Apart from usually being more comprehensible, such commits are also nicer targets for cherry-picking or reverts with less potential for conflicts. So, good commits and the ability to split work into small units also align with the ideas of agile software development, where many small increments are favored. It takes some practice understanding how to separate work into distinct commits. A good hint that work should be (have been) split is if the subject of the commit message is hard to find because the single purpose of the commit is not clear or important side tracks exist.

Explain what has been done and WHY

Apart from forming small and cohesive commits, the most important tool in each commit for communicating is the commit message. A lot has been written about how to create good commit messages. A good overview of different references is given in the blog post How to Write a Git Commit Message. Many of the references put a lot of emphasis on the style of the message. A cohesive and readable commit style ensures that the actual content can easily be discovered, especially when working with the Git command line. However, what is even more important is the content of the message. We want to use commit messages to guide a reader through the code and its evolution. Therefore, the commit message must clearly describe what was done and why. The what can often already be solved with the commit subject line, maybe a short additional paragraph. If not, the commit is probably too big. What should take more place is answering the why questions. Try to imagine what a future reader might want to know why something was done and in exactly this way and provide these answers upfront here in the commit message. It’s much cheaper taking a few minutes to do so than spending hours debugging because the important information is lost or impossible to find later on. Not including answers to any why question is the most common issue of bad commit messages that I have observed over the years.

Creating a clean commit history

The key to achieving a commit history that can be read like a book is iteration. When things don’t look well, you need to know the tools to improve the situation. For Git this is interactive rebasing and amending. Unfortunately, these tools need a bit of practice before they can be used fluently. I recommend the respective section on rewriting history in the git book or the tutorial Beginner’s Guide to Interactive Rebasing as a good starting point for learning interactive rebasing. Just as when learning test-driven development, take some time to experiment with the tools on a toy project.

So, whenever you propose some changes via a PR, first, take some time to review your commit history and clean it up in case something looks awkward or doesn’t explain the why questions before submitting the PR. Moreover, once you receive feedback, most of the time the proposed changes belong to one of the commits you proposed. So, instead of creating a new commit "address all PR comments", which doesn’t form a cohesive unit, add the requested changes to the existing commits where possible. Of course, some review feedback can best be addressed with a new commit, because the resulting changes form a cohesive unit. But especially small cleanup tasks don’t deserve their own commits and the commit log can be made more concise and descriptive by fixing up the original commits. Sometimes it even a good idea to completely rebuild the commits in the PR if larger changes become necessary. Don’t be afraid to do this.

Finally, in case your feature branch gets outdated, try to rebase it on main instead of merging. Linear commit histories are much easier to understand than many interleaved merge commits. Only if rebasing becomes too complicated, resort to a merge commit instead.

In any case, if something goes wrong while rebasing, nothing is lost. A messed up rebasing attempt can always be recovered with a git rebase --abort.

Caution:

You should not change history on any permanent branch such as main or v2.0. All things discussed here are a tool for cleaning up proposed changes. Once they have been accepted and become part of the mainline, commits should be treated as immutable. Recovering a local Git clone from a changed upstream history is possible, but brings a lot of trouble and user will not be happy if this happens.

Bibliography

[Atwood2006] Atwood, Jeff. “Code Tells You How, Comments Tell You Why.” Coding Horror (blog), December 18, 2006. https://blog.codinghorror.com/code-tells-you-how-comments-tell-you-why/.
[Martin2014] Martin, Robert C. Agile Software Development, Principles, Patterns, and Practices. 1st ed. Pearson New International Edition. Harlow: Pearson, 2014.
[McConnell2004] McConnell, Steve. Code Complete: A Practical Handbook of Software Construction. 2nd ed. Redmond, WA: Microsoft Press, 2004.
[Ousterhout2018] Ousterhout, John K. A Philosophy of Software Design. 1st ed. Palo Alto, CA: Yaknyam Press, 2018.

Why code and change sets need to communicate#

Recording history vs. telling a story#

Properties of good commits#

Create cohesive commits#

Explain what has been done and WHY#

Creating a clean commit history#