For anyone new to Git one of its core concepts, that's easily overlooked and sometimes misunderstood, is the fact that all commits are immutable snapshots of the entire project! This fundamental truth is also what sets Git apart in comparison to other source code management systems (SCMs), such as Subversion.
For a long time I believed Git used a similar delta mechanism for storing changes as other SCMs, since I hadn't fully comprehended what immutable snapshots really meant. However, I was fortunate enough to get the concept explained to me one time by a Git guru, and afterwards my proficiency skyrocketed!
Now, it's time for me to give you the same treat, so that you can level up with Git just like I did! 🤠
Before we dig into the concept of immutable snapshots in Git, let's make sure we have a good understanding of what immutability really means. When I started working as a developer, I myself, had never heard the term immutable, let alone the concept of immutable snapshot or immutable object; this despite the fact that I'd spent five years studying at university for a computer science degree.
If you're a bit like me and don't have English as your first language here comes the Google definition of immutable.
In short, immutable means: unchanging over time or unable to change. If we then look at a more specific description of what an immutable object is related to software engineering, Wikipedia gives us the following explanation.
Immutable object: In object-oriented and functional programming, an immutable object is an object whose state cannot be modified after it is created. This is in contrast to a mutable object, which can be modified after it is created. Wikipedia
Commits – Immutable snapshots in Git
So, what does immutable snapshots refer to? In Git, all commits are immutable snapshots of your project (ignored files excluded) at a specific point in time. This means that each and every commit contains a unique representation of your entire project, not just the modified or added files (deltas), at the time the commit was created. Apart from the actual files, each commit is also infused with relevant metadata; all of which is immutable!
Inside a commit you find:
- References to the actual project files (or blobs as Git calls them internally) the way they looked at the time of commit.
- A commit message (e.g. info about what was done)
- Author, including timestamp (e.g. developer who wrote the patch and when)
- Committer, including timestamp (e.g. developer who applied the commit and when)
- Reference to parent commit (if not the initial commit), or commits (if merge commit)
- SHA-ID (Calculated checksum, also known as hash)
Since the commit (or commit object as it is formally called) is immutable in its entirety, trying to modify any of its content isn't possible. Commits can never be tampered with or modified once they are created! Accepting this fact will help you tremendously when working with other aspects of Git, particularly related to undoing changes at different stages of development.
This all sounds crazy!
Hang on you might say, if the entire project is recorded in every single commit, doesn't that mean that the size of the repository will grow exponentially – eventually eating up all disk space? And also, whenever I inspect any of my commits in my history I only see the changes I've made, shouldn't all project files be listed in that case? And what if I make a mistake, can it not be corrected or undone?
Let's address the first two questions right now, and save the final one for a coming post – rewriting history deserves a post of its own!
How Git stores data
As lightweight file management and immutability is at Git's core, Git doesn't store the actual files inside the commit, but rather references to them. Hence, only new or modified files that get committed are added to the internal database, the rest are just kept as is. Let's look at the example from earlier again.
In the previous example, we see that in commit C0 two files where initially added, in C1 only the .gitignore file was updated (hence, a new version of that file got stored internally), and in C2 the file readme.md was added. So, instead of having a total of 7 files stored in the repository, we end up with just 4 files recorded; two versions of .gitignore and one version each of index.html and readme.md.
Disclaimer: Above illustration is a slightly simplified conceptualisation, but it works perfectly as a mental model when thinking about how immutable commits and their files interplay (I resort to it almost on a daily basis). For a deep dive in how the storage is actually constructed I suggest digging into this section of the Pro Git book (although, it's probably a bit over the top depending on what you are looking for).
The diff is what matters!
By comparing Git with other SCMs such as Subversion, it's now apparent that Git stores files in their entirety; rather than through small calculated delta diffs that need to be applied in sequence to reconstruct the actual file on disk.
Even though I somewhat knew this when I started working with Git, I still got lured into thinking that Git also stored delta diffs. Why is that? The main reason being that even though Git stores entire files, from a history perspective the diff (e.g. the actual changes) is what matters. As a developer you are generally only interested in what changes were introduced in between commits, so Git always projects the actual diff by default for convenience (hiding everything else).
$ git status only displays file changes between your Working Tree, Staging area, and HEAD. Likewise
$ git diff --staged highlights the diff between HEAD and files currently staged for committing, not the actual content of what is about to be committed; don't let this fool you.
As already mentioned, the core fact to accept and understand is that:
Each commit is an immutable snapshot of the entire project at a given point in time, it can never be tampered with or modified once created!
You've hopefully now established a much better understanding of the above statement, and probably have the final question from earlier lingering in your head; if I make a mistake in Git, can it not be corrected or undone?
As promised, the answer to this question deserves a post of its own (which is currently under making). After all, Git could be described as "undo on steroids". So stay tuned!
Now, with the proposed simplified mental model in mind, of how commits and their respective files interplay, I hope you are able to look at your own history with fresh eyes and better understanding.
😎 Thanks for reading and good luck improving your source code management skills!