As software developers, we use a lot of tools. Many of them come with an intuitive User Interface which abstracts the internal complexity of the domain they are operating on. This allows us to go fast and learn those tools by using them.
For better or worse, git is not one of those tools. The abstraction git's User Interface gives you is very leaky so to become better with git you must spend (some) time learning how it works internally.
In this Understanding Git series, we will cover git’s internals (we will not go into git’s source code don’t worry) and the first thing on that list is git’s heart and soul — the data model.
It's all about .git
We'll start by initializing a git repository:
git init
Git tells us it has created a .git
directory in our project’s directory so let’s take a quick peek into it:
$ tree .git/
.git/
├── HEAD
├── config
├── description
├── hooks
│ ├── applypatch-msg.sample
│ ├── commit-msg.sample
│ ├── post-update.sample
│ ├── pre-applypatch.sample
│ ├── pre-commit.sample
│ ├── pre-push.sample
│ ├── pre-rebase.sample
│ ├── pre-receive.sample
│ ├── prepare-commit-msg.sample
│ └── update.sample
├── info
│ └── exclude
├── objects
│ ├── info
│ └── pack
└── refs
├── heads
└── tags8 directories, 14 files
Some of these files and directories may sound familiar (particularly HEAD
) but for now, focus on the .git/objects
directory. Right now it's empty, but we will change that in a moment.
Let’s add an index.php
file
touch index.php
fill it with some content
<?php
echo "Hello World";
and a README.md
file
touch README.md
with some content as well:
# Description
This is my hello world project
Now let’s stage and commit them:
git add .
git commit -m "Initial Commit"
OK, nothing special here, adding and committing — we’ve all “been there, done that”.
If we look back at the .git
directory we can see that the .git/objects
directory now contains some files and subdirectories (bear in mind they will have different names on your computer!):
├── objects
│ ├── 5d
│ │ └── 92c127156d3d86b70ae41c73973434bf4bf341
│ ├── a6
│ │ └── dbf05551541dc86b7a49212b62cfe1e9bb14f2
│ ├── cf
│ │ └── 59e02c3d2a2413e2da9e535d3c116af1077906
│ ├── f8
│ │ └── 9e64bdfcc08a8b371ee76a74775cfe096655ce
│ ├── info
│ └── pack
Every object in git has a so-called checksum header (the unique identifier of an object) and the first two characters of that checksum are used as a directory name while the rest is used as a file (object) name. Let's look at what these objects are.
Blobs, trees, and ...
The first kind of object that git creates when we commit some file(s) are blob objects. Git uses them to represent the content of files. In our case there is two of them, one for each file we committed:
They contain the full content of our files, so you can think of them as snapshots of our files (at the time of the commit). To generate the checksum header git takes the content of an object, feeds it to a hashing function and the output is the checksum header. This is why it also serves as a unique identifier of an object.
The next kind of object git creates are tree objects. They are used to represent the project's folder structure and in our simple example, git needs only one tree object. It contains a list of all files in our project with pointers to their blob objects:
Lastly, git creates a commit object. It contains some metadata data (author, time..) and a pointer to its tree object:
Content of our .git/objects
directory it should make more sense now:
├── objects
│ ├── 5d
│ │ └── 92c127156d3d86b70ae41c73973434bf4bf341
│ ├── a6
│ │ └── dbf05551541dc86b7a49212b62cfe1e9bb14f2
│ ├── cf
│ │ └── 59e02c3d2a2413e2da9e535d3c116af1077906
│ ├── f8
│ │ └── 9e64bdfcc08a8b371ee76a74775cfe096655ce
│ ├── info
│ └── pack
Using git log
we can see our commit history:
commit a6dbf05551541dc86b7a49212b62cfe1e9bb14f2
Author: zspajich <zspajich@gmail.com>
Date: Tue Jan 23 13:31:43 2018 +0100Initial Commit
By knowing the naming convention we mentioned earlier we can locate this commit object in .git/object
:
├── objects
│ ├── a6
│ │ └── dbf05551541dc86b7a49212b62cfe1e9bb14f2
To display its content we can’t simply use the cat
command since these are not plain text files but git has a cat-file
command we can use instead:
git cat-file commit a6dbf05551541dc86b7a49212b62cfe1e9bb14f2
Now we see the content of the commit object:
tree f89e64bdfcc08a8b371ee76a74775cfe096655ce
author zspajich <zspajich@gmail.com> 1516710703 +0100
committer zspajich <zspajich@gmail.com> 1516710703 +0100Initial Commit
First line is the pointer to a tree object and to examine it’s content we can use git ls-tree
command:
git ls-tree f89e64bdfcc08a8b371ee76a74775cfe096655ce
As expected it does contain a list of our files with pointers to blob objects:
100644 blob cf59e02c3d2a2413e2da9e535d3c116af1077906 README.md
100644 blob 5d92c127156d3d86b70ae41c73973434bf4bf341 index.php
Let's look into the blob object representing index.php
using the cat-file
command:
git cat-file blob 5d92c127156d3d86b70ae41c73973434bf4bf341
Sure enough, it has the same content as our index.php
file:
<?
echo "Hello World!"
There you go. Now you know what happens when committing files.
Let's see what happens if we now edit (add some code magic) and commit our index.php
file:
Git creates another blob object (a new snapshot) to represent the new content of index.php
. As for README.md
, since it didn't change, git can reuse the existing blog for it (we'll see it in a moment).
When creating the tree **object, git updates the pointer for index.php
to its new blob while the README.md
pointer stays the same:
As before, the commit object has a pointer to the tree object but also a pointer to its parent commit object because every commit except the first one has at least one parent:
Now that we know how git handles file adding and editing, the only thing remaining is file deletion. What if we delete our index.php
file?
It’s rather simple — git deletes the file entry (filename with a pointer to its blob object) from the tree object. In other words, our commit’s tree object no longer has a pointer to a blob object representing index.php
(but that blob object is still there in .git/objects
)
Nested folders
In our real-life projects, the folder structure is much more elaborate than in this simple example. As said, tree objects represent the folder structure of your project, and the same way folders can be nested tree objects can also be nested (point to other tree objects). For example:
Here, our project base folder has one README.md
file and one sub-directory app
which has two files ( app.php
and app_dev.php
).
So there you have it - git's data model. Blob objects represent the content of files, tree objects represent the folder structure of the project, while commit objects contain metadata and have pointers to their parents.
In the next post, we'll take a look at branching - what branches are and why having a bunch of them is very cheap in git.
Top comments (3)
This is pretty much copy-pasted from medium.com/hackernoon/https-medium... have some decency man.
Yes, it is because I am the author of that blog post man :)
I have refreshed the text a bit and images also, but yes it is the same post. I added a canonical_url property to the post, I don't know if it is a custom here to further emphasize the fact it is a re-post?
This is a very clear and useful description of rudimentary but fundamental git internals, thank you