Page MenuHomePhabricator

Git, Slowly and Painfully
Open, WishlistPublic

Assigned To
Authored By
Mar 16 2018, 3:25 AM
"Like" token, awarded by tycho.tatitscheff."100" token, awarded by jcox."100" token, awarded by thoughtpolice."Party Time" token, awarded by ftdysa."100" token, awarded by asherkin."Mountain of Wealth" token, awarded by avivey."The World Burns" token, awarded by jmeador.


Chapter 0: Git is a Computer Program

Git is a computer program. It is a very good computer program, and worthwhile to understand.

This guide can help you understand Git. Other guides can also help you understand Git. This table can help you understand how other guides compare to this guide.

Other GuidesThis Guide
You LearnHow to do things with Git.Git is a computer program which does exactly the things you tell it to do.

Chapter 1: Git is a Hash Function

A function is a small part of a computer program. Functions can accept input and produce output. For example, the strtolower() function takes some text as input and produces the same text in lowercase as output:

strtolower("TURTLE")   -> "turtle"
strtolower("JALApeño") -> "jalape?o"

This guide has a lot of PHP example code. PHP is a very good programming language for writing computer programs. Unfortunately, Git is not written in PHP. Instead, Git is written in a mixture of Perl and shell script wrapping a single large Awk expression. Git is still a very good computer program.

A hash function is a special type of function which takes some text as input and produces a long, unique piece of random looking output. The output isn't really random: the same input always produces exactly the same output. However, different input produces completely different output, even if the input is only a little bit different:

sha1("turtle")  -> "75105193bfdd0db68cd7b988dda79744a9baea41"
sha1("turtles") -> "e465b6f3d264569e1d1cedc84635cb003e6e8ead"

Some popular hash functions include SHA1, SHA256, and Git.

The output of a particular hash function always has the same length. For example, SHA1 always outputs 40 hexadecimal characters ("0" through "9", plus "a" through "f"), no matter how long or short the input is. Git also always outputs 40 hexadecimal characters.

Here is how to use the Git hash function:

$ echo -n turtle | git hash-object --stdin

Here's something interesting:

sha1("blob 6\0turtle") -> "8ae3728bf4106a8b57eaa4b6ad4641a34dbf1a6c"

That's the same hash. The Git hash function is almost exactly the same as the SHA1 hash function.

Git is a hash function, but it's not a very good hash function. You're better off using SHA256.

Chapter 2: Git is a Key-Value Store

A key-value store is a program that lets you take some data (called a "value"), give it a name (called a "key"), and store it. Later, you can get the data back if you remember what you named it.

Popular key-value stores include memcache, MySQL, and Git.

In MySQL, you can store some data like this:

INSERT INTO storage (k, v) VALUES ("shopping-list", "milk");

If you remember the name you used as a key, you can read the data back later:

mysql> SELECT * FROM storage WHERE k = 'shopping-list';
| k             | v    |
| shopping-list | milk |
1 row in set (0.00 sec)

A key-value store will usually let you choose the name you want to give your data. Git doesn't let you choose the name. It automatically assigns your data a name by using the Git hash function to name it.

This is important: it's okay for Git to do this because every different piece of data you might want to store will always have a different hash, so the name Git assigns to each piece of data is always unique to that data. You can always use the name to retrieve exactly the same data you stored.

Here's how to write some data to the Git key-value store:

$ echo milk | git hash-object -w --stdin

That hash is the new key that Git assigned to your data. Here's how to get your data back:

$ git cat-file -p 7c95f9a9836102046c3f2e33b88e3bf498a00ea5

Git is a key-value store, but it's not a very good key-value store. You're better off using MySQL.

Chapter 3: Git is a File-Backed Storage System

When you put some data into Git, it is stored in files on disk.

If we create an empty Git storage directory, we'll see it doesn't take up much disk space.

$ git init
$ du -sh .git/
 60K	.git

Now let's write about 100MB of data into the Git key-value store:

$ head -c100000000 /dev/urandom | git hash-object -w --stdin
$ du -sh .git/
 96M	.git

When we wrote the data into Git, it wrote about 100MB of files to disk. The files with our data are somewhere in the .git/ directory.

In this case, Git wrote everything into a file named .git/objects/f7/52715dc02a0b6d9f62cc3115aad4aeaadddb69:

$ du -h .git/objects/f7/52715dc02a0b6d9f62cc3115aad4aeaadddb69
 95M	.git/objects/f7/52715dc02a0b6d9f62cc3115aad4aeaadddb69

Git does not always put your data there, but it always puts it in files somewhere on disk.

Event Timeline

epriestley triaged this task as Wishlist priority.Mar 16 2018, 3:25 AM
epriestley created this task.
ftdysa added a subscriber: ftdysa.
hskiba added a subscriber: hskiba.Mar 19 2018, 7:29 AM
jcox awarded a token.Mar 21 2018, 10:52 PM
23r0 added a subscriber: 23r0.Mar 30 2018, 12:29 PM