Back to List

You are viewing an unformatted version of this file. To get the formatted version, you have to enable JavaScript.
# About youtube Ids This document explains the secret behind Youtube ids ## Requirements It's recommended that the reader understands Base64 and binary to some degree and is able to convert numbers from one base to another. ## General Format of the Id A YouTube video id is always a string of 11 characters. The character set for this is: - All letters uppercase (`26`) - All letters lowercase (`26`) - All digits (`10`) - The symbols "_" and "-" (`2`) It could be described as the regular expression `^[\w\-]{11}$` This gives us `26+26+10+2=64` symbols. Because of that, the encoding is known as Base64. Normal Base64 uses `+` and `/` but these are not safe for URL usage, so `+` is replaced with `-` and `/` with `_`, but apart from that the encoding is identical. ## How Base64 Works Base64 works by mapping 8-bit binary into a smaller space. Binary uses 8 bits for each byte, Base64 has only 6. The smallest common multiple of 8 and 6 is `3*8=4*6=24` This means we can map 3 bytes of 8 bit binary into 4 bytes of 6 bit binary evenly: ![Base256 to Base64 Mapping](data/yt_mapping.png) ### Multiples of 3 Because of these characteristics, the source data is processed in groups of 3 bytes. This becomes problematic if the source data length is not a multiple of 3. To solve this problem, the following is done: 1. Pad the last chunk to the right with zero bytes to be 3 bytes long. 2. Convert the chunk into the 4 Base64 characters. 3. Replace the rightmost characters with `=`. Replace as many characters as we added bytes to the source. This means the end of a Base64 can only have 3 possible variations: - `xxxx`: The source length is an exact multiple of 3 - `xxx=`: The source had to be padded with 1 byte - `xx==`: The source had to be padded with 2 bytes The variant `x===` doesn't exists because that would indicate a padding of 3 bytes was added, but the source is only processed in blocks of 3 so that would mean there was no block. #### Decoding Decoding the last chunk works similarily: 1. All `=` are replaced with any valid Base64 character 2. The 4 characters are decoded into 3 bytes 3. At the end of the decoded bytes, as many bytes are discarded as there were `=` ## How 11 Symbols are Possible You might realize that 11 is not a valid length for Base64 because only multiples of 4 are allowed. Because the length of Base64 has to be a multiple of 4, it's unnecessary to have any `=` at the end at all. One can just add one or two if the base64 string is too short. ## Deconstructing the Id If we take the id `QH2-TGUlwu4` we know we have to extend it to be a multiple of 4 characters. This results in `QH2-TGUlwu4=`. If we split this up into its chunks we get `QH2-` and `TGUl` and `wu4=`. Because each 4 character Base64 chunk is 3 8-bit bytes we know that the 8-bit data is 9 bytes long. Because of the single `=` at the end, we subtract one byte, leaving us with 8. 8 is a nice number in this regards because it means that the internal id of a youtube video is very likely a 64 bit integer. ### The Last Chunk The last chunk of the id (in this case `wu4=`) contains data of two 8-bit bytes (because we subtrage 1 for the padding). Two bytes require `2*8=16` bits of data, but the 3 characters provide `3*6=18` bits. This means the last two bits encoded in the Base64 chunk are not evaluated: ![Last segment of an 8-byte base64 decoding](data/yt_mapping_last.png) The reason is that replacing a Base64 character with `=` only replaces 6 bits, not 8. Replacing a single Base64 character leaves us with two excess bits. A decoder will silently discard excess bits, so we can set them to whatever we want. This gives us four possible bit combinations. If we set them to all possibilities, we will get these ids: - `QH2-TGUlwu4=` - `QH2-TGUlwu5=` - `QH2-TGUlwu6=` - `QH2-TGUlwu7=` #### Redirection limitation Youtube used to redirect you to the id with the bits set to zero. Since the new design that was forced upon us they no longer do this. Now, you just get an error message. ## Conclusion - Youtube video ids are Base64 encoded using an URL safe variant - The `=` in ids is stripped because it's unnecessary - The id decodes to an 8-byte (64 bit) integer - We have two bits we can freely define without altering the real id - Each and every video id has four Base64 representations - Youtube always redirects us to the ids with the excess bits unset - Because youtube forces the excessive two bits to zero, 3 out of 4 possible Base64 ids are invalid ### The Redirection *Text in this chapter no longer applies. Since the new design, YT no longer redirects.* Youtube happily redirects from `00000000001` to `00000000000`. The redirection happens for non-existing ids too, which indicates that they use the 64 bit integer as database lookup key and not the Base64 characters. They always decode the Base64 number and redirect if needed before processing it. It's interesting though that they bother to redirect for non existing ids. It looks like the first thing they do is decoding the id and immediately re-encoding the result to check if it matches the original. ### Number of Videos Because of the 64 bit integer, we will *only* be able to upload `18'446'744'073'709'551'616` videos before running out of ids. ## Better Regex Because 3 out of 4 ids are invalid, a video can't end with certain characters. Only these are allowed (in ascending order): A E I M Q U Y c g k o s w 0 4 8 This makes this a more strict regular expression that still matches all possibly valid video ids: ^[\w\-]{10}[AEIMQUYcgkosw048]=?$ ## Demo [You can steal this section here](data/youtube_id_demo.html) The demo below takes a video id and shows if it's a base id and what other ids are used for the same video. It will not check if the id is valid. It runs completely in your browser and doesn't even decodes the id.
Waiting for Id...