16th February 2025
Back last year, I wrote a piece about technical interview questions. Although the title mentions Laravel, most of the questions are applicable regardless of the stack - assuming it's for a web dev position. Since then I've added another question to my list: can you describe hashing, encrypting and encoding, giving examples of where it would be appropriate (or not) to use each of them?
These three terms all describe processes for changing data from one form into another, but in very different ways. Unfortunately, they're often misunderstood, even by people who work with computers and that write the software we use every day. Every few months a company appears in the news because they didn't understand the difference, used the wrong one, and exposed customer information. By understanding the basics of what each process does and where they can be used, we can avoid ending up in the news.
Hashing a file is like taking its fingerprint. The fingerprint identifies the file and if you hash the same file in the future, it will always give you the same fingerprint. There are many different ways (algorithms) of hashing a file, some are faster and some are slower and they can produce hashes of different lengths, so we can use a different one depending on what we are using the hashes for.
An MD5 hash looks like something this: d1477051e96160180da7aa8704a98458
Each algorithm results in an output of a consistent length, regardless of how much data you put into it. Hashing a single letter or the complete works of Shakespeare will both result in a hash of the same length. Hashing algorithms are defined so that changing even a single character of the input data will result in a very different result.
A common use for hashing is comparing files. Hashing a file is deterministic, meaning it always gives the same result, so we can use it to see if two files are the same. Imagine you've got two copies of that important PowerPoint presentation you were preparing for work, one on your laptop and the other on a USB stick. They're both helpfully named Presentation_Final_FINAL_v2.ppt but the modified date on each of the files is different. How can you find out if they are the same? You could spend ages reading through each of the slides and comparing the contents, or you could hash each of the files and see if they produce the same result. If they produce different hashes it won't tell you what has changed, but if they are the same you will know very quickly that they are identical. One of the fastest algorithms for hashing a file is MD5 and it is often used for this sort of file comparison.
So why might we want a hashing algorithm to be slower? Another common use for hashing is for storing and comparing passwords. Unlike encrypting and encoding, hashing is one-way - this simply means you cannot recreate a file from its hash and it explains why passwords are stored as hashes. When you try to log in to a website with your username and password, the website will hash the password you enter and compare it with the hash that was generated when you first registered. This is how they can confirm you have supplied the correct password without knowing what the password is.
If a hacker manages to obtain the database of a website, they will have a list of all the usernames and password hashes, but because hashes are one-way they cannot reverse the hashes to get the passwords. What they can do is hash a list of common passwords (don't use common passwords folks!) using the same hashing algorithm and compare the result they get with what is in the database. If the passwords were hashed with a fast hashing algorithm (like MD5), then they can generate billions of hashes a second, potentially recovering some of the passwords in the database and giving them the ability them to log in as those users. So when we hash passwords to be stored in a database, we explicitly choose a slower hashing algorithm. These algorithms still take mere fractions of a second to run, so a user logging into the site won't notice, but it slows the rate at which a hacker can generate hashes from billions a second, down to thousands, making it much less attractive to try and "brute force".
Although it's not possible to regenerate the input data from a hash, the fact that passwords are stored as hashes and that they can be generated quickly means that large numbers of hashes can be stored, alongside the input data, in what is known as a "rainbow table". A rainbow table is simply a list of passwords, alongside the corresponding hash of the password. You can then search the list of hashes for any that match and see what password generated the hash.
One protection against rainbow tables is "salting". This simply means adding a random string to the input data before passing it through the hashing algorithm. That way, even if multiple people use the same password, the resulting hashes are different, rendering rainbow tables useless.
Another factor in how strong or reliable a hashing algorithm is, is the likelihood of what are called "hash collisions". This is where two different files result in the same hash (they have the same fingerprint). Normally, if you change only a single character in a text file, you will get wildly different results, but methods have been found using weaker algorithms, like MD5, where attackers can craft files to generate pre-determined hashes.
Encodings are like alphabets. The English language is encoded using an alphabet of 26 letters. Translate a sentence from English to Russian and it will instead use the Cyrillic alphabet, which has 33 characters. Translate it to Hebrew and it will use the 22 characters of the Hebrew alphabet. The information contained in the sentence hasn't changed, only how it is encoded.
If you only speak English and need to communicate with someone who only speaks Russian, you need to translate everything you want to say into Russian. In the same way, you can change the encoding of computer data in order to be able to communicate with systems that understand those encodings.
Computers store everything as binary. Like the language translation, they can store and represent all the same information, but they do it using an alphabet with only two characters; 1 and 0. In the same way, Morse code represents the English alphabet using only two characters; a dot and a dash. Transmitting something in binary, or morse-code, results in a greater number of "characters" being transmitted, but the ways in which they can be communicated are simplified; just like how you can send a message using Morse code by simply flashing your torch on and off.
A very common encoding is Base64. As the name suggests, it's made up of an alphabet of 64 characters. This allows binary data to be encoded and transmitted on channels that are designed to support text. An example of this is embedding images into a web-page. Most of the time, we include an image by referencing a copy of the file that is stored somewhere else e.g.
1<img src="location/of/the/image-file.jpg" alt="A remotely referenced image file" />
but it's also possible to embed the image directly into the HTML file by encoding it with Base64 e.g.
1<img src="...kJafg==" alt="An image, encoded as Base64, embedded in the file" />
Encoding something is not secure and treating it as secure is one of the most common mistakes people make when handling data. They encode a piece of text as Base64 and assume that just because they cannot read it, nobody can. If you can't read Russian, then translating a sentence to Russian will hide its meaning from you, but anyone that understands Russian can still read it. In the same way, changing the encoding of a piece of data on a computer may make it unreadable to you, but it can still be understood, as long as you have the right alphabet.
Encrypting a file is a means of securing the content so only the holders of a key (like a password) can read it. Once a file has been encrypted you can transmit it in the open, where anybody might get hold of it, knowing that only people with a copy of the key can read the message locked inside. Like encoding, encrypting a file retains all the contents of the original file, but now a key is required to make it readable.
Encryption comes in two flavours; symmetric and asymmetric. With symmetric encryption, the same key is used to encrypt and decrypt a message. This simplifies things a little, but it means you need to find a secure way to share the key with the person you want to communicate with, because if anybody else gets hold of the key, they can also use it to read your messages.
Asymmetric encryption works using key pairs, which sounds more complicated, but brings a couple of advantages. One of the keys is the private key and, as the name suggests, it shouldn't be shared with anyone. The other key is the public key, which can be shared freely. Any message that is encrypted with the private key can only be decrypted with the public key and, vice-versa, any message encrypted using the public key can only be decrypted using the private key.
This gives two abilities; firstly, if someone wants to send you a secret message, they encrypt it using your public key (which you share freely). Then, only the private key can be used to decrypt it. So even if someone else picks up the encrypted message, they cannot use the public key to decrypt it.
The second ability is proof of identity, which really means proof that you hold the private key of a key pair. You encrypt some agreed message using the private key, then anyone that has a copy of the public key can verify you hold the private key by simply decrypting the message using the related public key.
Encrypting a message is the only way that it can be shared securely. If you need to secure a message or piece of data, then it cannot be hashed or encoded, but must be encrypted.
Hopefully, this note has given you a basic understanding of what hashing, encryption and encoding are, their differences, where they can be used and, most importantly, where they shouldn't be used. There's obviously a lot more to each of these processes than what I've covered here, so I'd encourage you to go digging to find out more. To get you started there are a couple of links at the bottom of this page that you may find interesting. I look forward to not reading about you in the news soon!