Git Internals: What Really Happens When You Commit
To most developers, git commit is a black box: you stage changes, type a message, and a hash appears. But under that hash sits a small, elegant content-addressed database. Understanding it turns Git from a set of memorized incantations into a system you can reason about.
The Object Model: Four Kinds of Objects
Everything Git stores durably lives in the object database, usually under .git/objects. There are exactly four object types, and they compose into the entire history of your repository.
- Blob — the raw contents of a file. Just bytes; no filename, no permissions, no history. Two identical files anywhere in your project produce the same blob.
- Tree — a directory listing. It maps names to blobs (files) and other trees (subdirectories), recording the mode (e.g.
100644for a normal file,100755for executable,040000for a subtree) and the object hash of each entry. - Commit — a snapshot pointer. It references exactly one top-level tree, zero or more parent commits, author/committer metadata with timestamps, and the commit message.
- Tag — an annotated tag object pointing at another object (usually a commit), with its own message and tagger. This is distinct from a lightweight tag, which is just a ref.
Every object is stored the same way: Git prepends a header of the form <type> <size>\0 to the content, computes a hash over the result, and writes the zlib-compressed bytes to a file named after that hash.
Content Addressing with SHA
The hash is the address. Git computes it from the object's content, so the name and the data are inseparable. Historically this is SHA-1 (a 40-character hex string); Git also supports SHA-256 repositories (64 hex characters) for installations that need stronger collision resistance.
You can reproduce the calculation yourself. For a blob containing the text hello followed by a newline:
$ printf 'hello\n' | git hash-object --stdin
ce013625030ba8dba906f756967f9e9ca394464a
That is just the SHA-1 of the bytes blob 6\0hello\n. Because the address derives from the content, three properties fall out for free:
- Deduplication — identical content is stored once, no matter how many commits or paths reference it.
- Integrity — if a stored byte is corrupted, its recomputed hash no longer matches its filename, and Git detects the damage.
- Cheap comparison — comparing two trees, files, or whole snapshots is just comparing hashes.
The Index: Where Staging Lives
The staging area is not a vague concept; it is a concrete binary file, .git/index. It is a flat, sorted list of every tracked path along with the blob hash, mode, and cached stat information (size, mtime) Git uses to detect changes quickly.
When you run git add file.txt, Git does two things: it writes a blob for the file's current contents into the object database, and it records that path-to-blob mapping in the index. The working tree, the index, and HEAD are three distinct states, which is exactly what git status reports the differences between.
$ git add README.md
$ git ls-files --stage
100644 8b9c2d... 0 README.md
The index already holds blob hashes before you ever commit. This is why staging can capture a specific version of a file even if you keep editing it afterward.
How a Commit Snapshots Trees
This is the crux. A commit does not store a diff against its parent. It stores a complete snapshot, expressed as a single tree object. When you commit, Git turns the flat index into a hierarchy of tree objects (one per directory), writes any that don't already exist, and creates a commit object pointing at the root tree.
Let's open a real commit with the cat-file plumbing command:
$ git cat-file -p HEAD
tree a4e3f1c8b2d6e0f9a1c5b7d3e2f4a6c8b0d1e3f5
parent 9f1c0a2b3d4e5f60718293a4b5c6d7e8f90a1b2c
author 1719500000 +0000
committer 1719500000 +0000
Add README and initial layout
Follow the tree to see the directory it describes:
$ git cat-file -p a4e3f1c
100644 blob 8b9c2d... README.md
040000 tree 1d2e3f... src
100644 blob c4f5a6... .gitignore
And cat-file can tell you any object's type and the actual file content:
$ git cat-file -t a4e3f1c
tree
$ git cat-file -p 8b9c2d
# My Project
...
Here is the key efficiency: if a commit only changes one file in src/, only the blob for that file, the src tree, and the root tree are new. Every other directory's tree object is identical to the previous commit's, so Git reuses it by hash. A "full snapshot" is cheap precisely because unchanged subtrees are shared, not copied.
Refs and HEAD: Naming the Snapshots
A commit hash is unfriendly to type, so Git layers human-readable names on top. A ref is simply a file under .git/refs whose contents are a commit hash. A branch is nothing more than that.
$ cat .git/refs/heads/main
9f1c0a2b3d4e5f60718293a4b5c6d7e8f90a1b2c
HEAD is a special ref that usually points at another ref rather than directly at a commit:
$ cat .git/HEAD
ref: refs/heads/main
This indirection is the whole trick. When you commit on main, Git creates the commit object, then rewrites refs/heads/main to hold the new hash. Because HEAD points at the branch, it follows along automatically. When HEAD contains a raw hash instead of a ref: line, you are in "detached HEAD" state.
Why Branching Is Cheap
Creating a branch creates a single small file containing a 40-character hash. There is no copying of files, no duplication of history, no snapshot to materialize.
$ git branch feature # writes .git/refs/heads/feature with HEAD's hash
$ git switch feature # rewrites .git/HEAD to point at the new ref
Commits form a directed acyclic graph through their parent links. A branch is just a movable label naming one node in that graph; merging is finding a common ancestor and creating a commit with two parents. Since branches are pointers and history is shared immutable objects, you can have hundreds of branches at essentially zero storage cost. The same model is why operations like git log, diffing two branches, or checking out an old commit are fast: they walk and compare hashes, not file contents.
Putting It Together
A single git commit performs a precise sequence:
- Read the index (already populated with blob hashes from
git add). - Build tree objects bottom-up from the index, writing any that are new.
- Create a commit object referencing the root tree and the current
HEADcommit as parent. - Update the branch ref that
HEADpoints to, so it now names the new commit.
Practical Takeaway
Git is a content-addressed key-value store with a thin layer of mutable pointers on top. The objects (blobs, trees, commits, tags) are immutable and shared by hash; only refs move. Once you internalize that, the everyday commands stop being magic: reset moves a ref, checkout/switch repoints HEAD, and a "lost" commit is usually still in the object database, reachable through git reflog until garbage collection runs. When something looks broken, reach for git cat-file -p and git ls-files --stage and read the objects directly. The data model is small enough to hold in your head, and that understanding pays off every time history gets complicated.