Immutable Snapshots – One of Git's Core Concepts 🚀

For anyone new to Git one of its core concepts, that's easily overlooked and sometimes misunderstood, is the fact that all commits are immutable snapshots of the entire project! This fundamental truth is also what sets Git apart in comparison to other source code management systems (SCMs), such as Subversion.

For a long time, I believed Git used a similar delta mechanism for storing changes as other SCMs, since I hadn't fully comprehended what immutable snapshots really meant. However, I was fortunate enough to get the concept explained to me one time by a Git guru, and afterward, my proficiency skyrocketed!

Now, it's time for me to give you the same treat, so that you can level up with Git just like I did! 🤠

Immutability

Before we dig into the concept of immutable snapshots in Git, let's make sure we have a good understanding of what immutability really means. When I started working as a developer, I myself, had never heard the term immutable, let alone the concept of immutable snapshot or immutable object; this despite the fact that I'd spent five years studying at university for a computer science degree.

If you're a bit like me and don't have English as your first language here comes the Google definition of immutable.

https://www.google.com/search?q=immutable&oq=immutable

In short, immutable means: unchanging over time or unable to change. If we then look at a more specific description of what an immutable object is related to software engineering, Wikipedia gives us the following explanation.

Immutable object: In object-oriented and functional programming, an immutable object is an object whose state cannot be modified after it is created. This is in contrast to a mutable object, which can be modified after it is created. Wikipedia

Commits – Immutable snapshots in Git

So, what does immutable snapshots refer to? In Git, all commits are immutable snapshots of your project (ignored files excluded) at a specific point in time. This means that each and every commit contains a unique representation of your entire project, not just the modified or added files (deltas), at the time the commit was created. Apart from the actual files, each commit is also infused with relevant metadata; all of which is immutable!

Inside a commit you find:

  • References to the actual project files (or blobs as Git calls them internally) the way they looked at the time of commit.
  • A commit message (e.g. info about what was done)
  • Author, including timestamp (e.g. developer who wrote the patch and when)
  • Committer, including timestamp (e.g. developer who applied the commit and when)
  • Reference to parent commit (if not the initial commit), or commits (if merge commit)
  • SHA-ID (Calculated checksum, also known as hash)
Simplified illustration of a linear history containing three immutable commits. Metadata on the left hand side, and actual file content on the right. Latest commit is C2, with a parent reference to C1, and ultimately C0.

Since the commit (or commit object as it is formally called) is immutable in its entirety, trying to modify any of its content isn't possible. Commits can never be tampered with or modified once they are created! Accepting this fact will help you tremendously when working with other aspects of Git, particularly related to undoing changes at different stages of development.

This all sounds crazy!

Hang on you might say, if the entire project is recorded in every single commit, doesn't that mean that the size of the repository will grow exponentially – eventually eating up all disk space? And also, whenever I inspect any of my commits in my history I only see the changes I've made, shouldn't all project files be listed in that case? And what if I make a mistake, can it not be corrected or undone?

Let's address the first two questions right now, and save the final one for a coming post – rewriting history deserves a post of its own!

How Git stores data

As lightweight file management and immutability are at Git's core, Git doesn't store the actual files inside the commit but rather references to them. Hence, only new or modified files that get committed are added to the internal database, the rest are just kept as is. Let's look at the example from earlier again.

In the previous example, we see that in commit C0 two files were initially added, in C1 only the .gitignore file was updated (hence, a new version of that file got stored internally), and in C2 the file readme.md was added. So, instead of having a total of 7 files stored in the repository, we end up with just 4 files recorded; two versions of .gitignore and one version each of index.html and readme.md.

Simplified illustration of how Git stores files by reference through each commit, only adding new files to the internal database when a file is added or updated.

Disclaimer: Above illustration is a simplified conceptualization, but it works perfectly as a mental model when thinking about how immutable commits and their files interplay (particularly if you're a sole dev working on your own repo).

For a deep dive into how the internal storage is actually constructed, I suggest digging into the below piece on the matter! It'll give you an even better understanding of why commits are immutable.

How Does Git Store Files?
In this post, I’ll show you how project files and folders are stored, and how they relate to the overarching commit history — all from a conceptual point of view.

The diff is what matters!

By comparing Git with other SCMs such as Subversion, it's now apparent that Git stores files in their entirety; rather than through small calculated delta diffs that need to be applied in sequence to reconstruct the actual file on disk.

Even though I somewhat knew this when I started working with Git, I still got lured into thinking that Git also stored delta diffs. Why is that? The main reason is that even though Git stores entire files, from a history perspective the diff (e.g. the actual changes) is what matters. As a developer, you are generally only interested in what changes were introduced in between commits, so Git always projects the actual diff by default for convenience (hiding everything else).

For example, $ git status only displays file changes between your Working Tree, Staging area, and HEAD. Likewise $ git diff --staged highlights the diff between HEAD and files currently staged for committing, not the actual content of what is about to be committed; don't let this fool you.

Conclusion

As already mentioned, the core fact to accept and understand is that:

Each commit is an immutable snapshot of the entire project at a given point in time, it can never be tampered with or modified once created!

You've hopefully now established a much better understanding of the above statement, and probably have the final question from earlier lingering in your head; if I make a mistake in Git, can it not be corrected or undone?

As promised, the answer to this question deserves a post of its own (which is currently under making). After all, Git could be described as "undo on steroids". So stay tuned!


Now, with the proposed simplified mental model in mind, of how commits and their respective files interplay, I hope you are able to look at your own history with fresh eyes and a better understanding.

😎 Thanks for reading and good luck improving your source code management skills!