Notions of security: Perfect Secrecy

The definition of “security” may seem intuitive to most people (basically, other people can’t see your communications without your say-so), but formally defining a complete and rigorous definition of security is actually very difficult.

It’s also important to note that there isn’t a single, all-encompassing definition of security. There are many, and choosing to meet a particular definition of security has its benefits and drawbacks compared to other security schemes. Let’s explore one such definition: the notion of perfect secrecy.

Perfect secrecy

In layman’s terms, an encryption scheme is perfectly secret if an eavesdropper cannot gain any additional information about the original encrypted message simply by observing an eavesdropped ciphertext.

This is simple enough to think about, but vague notions like “additional information” need to be defined formally in order to have a definition that can be met.

Formally, the definition of perfect secrecy is as follows:

An encryption is considered perfectly secret if and only if for every probability distribution over the message space, for every m that is an element of the message space and every c that is an element of the ciphertext space, the following holds true for the randomly selected message M and randomly selected ciphertext C:

Pr[M = m | C = c] = Pr[M = m]

Or: being given a particular ciphertext does not change the probability of what the original message was (a.k.a. it doesn’t give any additional information on the original message).

This seems like a pretty good definition. However, it turns out that just because an encryption scheme meets the security notion of perfect secrecy doesn’t mean that it’s necessarily a secure scheme for all purposes. Let’s take the one-time pad as an example.

The One-Time Pad

The one-time pad encryption scheme is defined as the 3 algorithms, Gen, Enc, and Dec:

Gen generates a random uniform key k string of bits of length n
Enc takes a message m of length n bits and outputs a ciphertext c, which is the result of m XOR k (the bitwise XOR of the message and key)
Dec takes a ciphertext c of length n bits and outputs a message m, which is the result of c XOR k (the bitwise XOR of the message and key)

For example, let’s say Alice wants to securely encrypt and send the string “hi” to her friend Bob.

The pair generate and secretly share a key (perhaps written on a piece of paper, exchanged in a coffee shop). Let’s say the key is 10101010 10101010. Alice takes her message, “hi”, and converts it to binary (01101000 01101001). Alice runs Enc on her message, XORing it with the key.

    m   01101000 01101001
XOR k   10101010 10101010
-------------------------
    c   11000010 11000011

Alice sends the resulting ciphertext 11000010 11000011 to Bob. This ciphertext gets intercepted by an eavesdropper, but he can’t make head nor tail of it.

Bob receives the ciphertext and runs Dec on it. XORing it with the key again reverses the XOR performed by Enc.

    c   11000010 11000011
XOR k   10101010 10101010
-------------------------
m       01101000 01101001

Proving perfect secrecy of the One-Time Pad

The one-time pad is a scheme that meets the definition of perfect secrecy but has notable weaknesses. These weaknesses were exploited by the United States’ Venona project during the Cold War to crack Soviet messages encrypted using the one-time pad.

Weaknesses of the One-Time Pad

The first, most glaring weakness is one you may have noticed about key generation: the key in the one-time pad scheme must be the same length as the message you want to encrypt. This is impractical because you have to know the exact message you want to send before you generate and share the key; and if you’re going through the effort of secretly sharing a key, why not just share the message in the first place?

Perhaps if you happen to have a need to send many messages of the same length, sharing the key is practical because it can be used for each of the messages. However, this exposes another weakness of the one-time pad.

Note that the scheme is called the one-time pad. This is because, despite fulfilling the definition of perfect secrecy, the scheme is not secure when the same key is used more than once!

Because it is so difficult to produce and share a new key for each message, there is a common risk of communicators reusing keys with this scheme. Such was the case of the Soviet Union during the Cold War.

Breaking the Multi-Time Pad

When a the same key is used multiple times (i.e. multi-time pad), the attacker is able to glean information about the key and/or original messages from examining the ciphertexts in concert in a number of ways.

Intuitive Guessing

By knowing some factor related to the message being encrypted that is independent of the key and actual knowledge of the message itself, many intuitive guesses can be made about certain parts of the message and used to solve for the corresponding portion of the key.

As one example, let’s say that an attacker can be reasonably sure that the messages being encrypted are English sentences (e.g. domestic communications in the U.S. between businesses, where English would be the primary language). Sentences end in some punctuation (., !, ?). If multiple messages are encrypted using the same key (all messages and keys being the same length by necessity) then the resulting ciphertext endings may look something like this:

...2944A801E
...388569A1E
...28F409A1E
...5954C9D0F
...1954C831E
...4DD4C9D1E
...09305A71E

6 of these 7 ciphertexts end in 1E, which is the hexadecimal equivalent of 0001 1110. Simply by knowing with reasonable confidence that these are grammar-aware English sentences, it can be guessed that 1E correspond to the period, and 0F corresponds to another punctuation character.

Let’s try this guess. We XOR 1E with the ASCII representation of the . character (2E)

    0001 1110	ciphertext (1E)
XOR 0010 1110	our guess (.)
-----------------------------
    0011 0000	hypothesis-key

And then, we XOR the hypothesis-key with 0F:

    0011 0000	hypothesis-key
XOR 0000 1111	ciphertext (0F)
-----------------------------
    0011 1111	result message (3F)

3F is the character ?; our guess succeeds, so we have easily gleaned a bit of information just from a little, easily inferred or guessed knowledge about the message.

There are many other scenarios in which we can use factors independent of the encryption scheme to obtain some portion of the key or the messages. For example, if we know that the ciphertexts we have are encryptions of emails, if we see the first 4 or 5 characters are often the same among the ciphertexts we can guess that these correspond to a “Dear “, “Hello “, or “Hi “ beginning address.

Programmatic Guessing

Guessing at the beginning and ends of the ciphertexts can only get an attacker so far. Usually you can get the keys for the first few and last characters, but because the one-time pad scheme has a key that differs for each index, gaining this information doesn’t generally assist in obtaining the rest of the key/messages.

Another tactic that can be employed is to programmatically guess common words in the message language for each offset across each ciphertext. For example, “the” is a common English word and is pretty much guaranteed to be in any sufficiently-long English message. If we guess “the” for every offset from 0 to l-3 of the ciphertexts and for each ciphertext, we can observe the resulting 3-character messages produced and observe, intuitively, if they make sense (e.g. the string “can” makes sense, but “pzl” does not).

The attack code:

# Verifier function. Returns false if any of the characters in the possible solution are not English language characters.
def verify_results(results):
    for i in results:
        if i < 32 or i > 122: 
            return False
        else:
            if i < 48:
                if i != 32 and i != 33 and i != 44 and i != 45 and i != 46: return False
            elif i > 57 and i < 65:
                if i != 63 and i != 58 and i != 59: return False
            elif i > 90 and i < 97:
                return False
    return True	

# List of ciphertexts encrypted with the same key
c = [ ... ]

# Put ciphertexts into byte arrays
for i,e in enumerate(c):
    c[i] = bytearray.fromhex(e)

# ASCII representation of the word, "the"
the_ascii = [116, 104, 101]

# Try for each offset
for i in range(0, len(c)-3):
    
	# Try for each ciphertext
    for n in range(0, 7):
        crib_word = [the_ascii[0]^c[n][i], the_ascii[1]^c[n][i+1], the_ascii[2]^c[n][i+2]]       
        ciphers = []
        legit = True
        
        for j in range(0, 7):
            results = [crib_word[0]^c[j][i], crib_word[1]^c[j][i+1], crib_word[2]^c[j][i+2]]
            
            if verify_results(results):
                ciphers.append(results)
            else:
                # If the verification fails, end
                legit = False
                break
                
        if legit:
            print "Results for offset", i, "on ciphertext", n+1, ":"
            for x in ciphers:
                print chr(x[0]), chr(x[1]), chr(x[2])

One thing to note is that though the generation of results is programmatic, processing needs to be done manually. For n ciphertexts of length l, we would produce n(l-2) result sets, with each set containing n-1 results to process (the ones not being guessed for “the” on), for a total of n(n-1)(l-2) elements to process. O(n^2) elements grows very quickly and will quickly outstrip a person’s ability to manually process, so we introduce the verify_results function to clean the results of any guesses that produce unprintable or non-English characters, dramatically reducing the number of results needed to be processed.

For example, let’s run the code on the following set of ciphertexts:

c = ["BB3A65F6F0034FA957F6A767699CE7FABA855AFB4F2B520AEAD612944A801E",
     "BA7F24F2A35357A05CB8A16762C5A6AAAC924AE6447F0608A3D11388569A1E",
     "A67261BBB30651BA5CF6BA297ED0E7B4E9894AA95E300247F0C0028F409A1E",
     "A57261F5F0004BA74CF4AA2979D9A6B7AC854DA95E305203EC8515954C9D0F",
     "BB3A70F3B91D48E84DF0AB702ECFEEB5BC8C5DA94C301E0BECD241954C831E",
     "A6726DE8F01A50E849EDBC6C7C9CF2B2A88E19FD423E0647ECCB04DD4C9D1E",
     "BC7570BBBF1D46E85AF9AA6C7A9CEFA9E9825CFD5E3A0047F7CD009305A71E"]

The output:

Results for offset 6 on ciphertext 2 :
l a n
t h e
r r e
h o u
k   t
s   p
e   c
Results for offset 8 on ciphertext 5 :
n n i
e   o
e n t
u l d
t h e
p u r
c a d
Results for offset 11 on ciphertext 4 :
: x  
: s y
t o l
t h e
- ? s
1 m  
1 k  
Results for offset 18 on ciphertext 1 :
t h e
d u n
d : t
c : t
s : f
7 n h
r n t
Results for offset 19 on ciphertext 7 :
r y t
o r  
  h o
  h o
  z o
t t a
t h e
Results for offset 23 on ciphertext 3 :
9 r s
; ; t
t h e
0 t  
8 t w
t t n
t o h
Results for offset 26 on ciphertext 3 :
d s o
e o s
t h e
c r i
7 r i
r : i
v t

The 7 31-character ciphertexts would normally produce (7)(6)(31) = 1,302 results, but our verify function cleaned the results down to only 7 results; pretty useful.

Observing the results, there are 3 results that make any sense intuitively for English sentences:

Results for offset 6 on ciphertext 2 :
l a n
t h e
r r e
h o u
k   t
s   p
e   c
Results for offset 8 on ciphertext 5 :
n n i
e   o
e n t
u l d
t h e
p u r
c a d
Results for offset 19 on ciphertext 7 :
r y t
o r  
  h o
  h o
  z o
t t a
t h e

We cannot know for certain that these guesses are actually correct, but we can assume that they are and use these to explore the rest of the messages. Note that offsets 6 and 8 overlap on the 8th index, and the two results agree with one another, so although it doesn’t confirm anything, it suggests that this guess is a good one.

The resulting message guesses would be:

    0  1  2  3  4  5  6  7  8  9 10 11  ...   19 20 21   ...
c1                    l  a  n  n  i            r  y  t  
c2                    t  h  e  _  o            o  r  _
c3                    r  r  e  n  t            _  h  o
c4                    h  o  u  l  d            _  h  o
c5                    k  _  t  h  e            _  z  o
c6                    s  _  p  u  r            t  t  a
c7                    e  _  c  a  d            t  h  e

Thanks to this tactic, we now have more intuitive guesses to make. The result for c4’s indices 5-10 look like they should almost definitely be “should”. If we try guessing s for c4 on index 5:

    0  1  2  3  4  5  6  7  8  9 10 11  ...   19 20 21   ...
c1                 p  l  a  n  n  i            r  y  t  
c2                 _  t  h  e  _  o            o  r  _
c3                 u  r  r  e  n  t            _  h  o
c4                 s  h  o  u  l  d            _  h  o
c5                 n  k  _  t  h  e            _  z  o
c6                 i  s  _  p  u  r            t  t  a
c7                 n  e  _  c  a  d            t  h  e

Now we have more guesses: c3 for index 4 is probably c (“current”), and c1 for index 11 is probably g (“planning”). This illustrates the cascading effect of intuitive guessing: making intuitive guesses will often give way to more intuitive guesses. In fact, I was able to solve this entire ciphertext set just by chaining intuitive guesses:

    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
c1  I  _  a  m  _  p  l  a  n  n  i  n  g  _  a  _  s  e  c  r  e  t  _  m  i  s  s  i  o  n  .
c2  H  e  _  i  s  _  t  h  e  _  o  n  l  y  _  p  e  r  s  o  n  _  t  o  _  t  r  u  s  t  .
c3  T  h  e  _  c  u  r  r  e  n  t  _  p  l  a  n  _  i  s  _  t  o  p  _  s  e  c  r  e  t  .
c4  W  h  e  n  _  s  h  o  u  l  d  _  w  e  _  m  e  e  t  _  t  o  _  d  o  _  t  h  i  s  ?
c5  I  _  t  h  i  n  k  _  t  h  e  y  _  s  h  o  u  l  d  _  f  o  l  l  o  w  _  h  i  m  .
c6  T  h  i  s  _  i  s  _  p  u  r  e  r  _  t  h  a  n  _  t  h  a  t  _  o  n  e  _  i  s  .
c7  N  o  t  _  o  n  e  _  c  a  d  e  t  _  i  s  _  b  e  t  t  e  r  _  t  h  a  n  _  1  .

Note that when indices 19-21 were reached, they were found to not make any sense in the context of the rest of the messages and were discarded.

Breaking the reused one-time pad is pretty simple this way, but this is still a mostly manual method. As the number of ciphertexts and the lenght of the messages it quickly becomes an unreasonable amount of work for a human to do manually. That’s why it’s useful to introduce a completely automated method of attack.

Frequency Analysis

If you’ve ever played Scrabble, you know that some letters are worth more points than others. This is because these letters are less common in the English language. The English letter frequency distribution looks something like this.

If you take the sum of the squares of each of the 26 letters, the result comes out to approximately 0.065. Having this target value, we now have an optimization problem where we can try and approach the letter distribution of our own guess to as close to the English language (that is, to have a sum of squares of letter frequency as close to 0.065) as possible - and computers excel at optimization.

So let’s once again set the scenario as an attacker having a set of ciphertexts all encrypted using the same key. Each index of these ciphertexts are encrypted using the same key-character. If we try decrypting all of the characters in this one index with the same key, it will produce a frequency distribution of the lowercase English characters. We can repeat this for each of the possible keys (256 possible 8-bit keys) and select the key that produces the sum of squares value closest to 0.065. Repeat this for each index of the ciphertexts, and we have a complete key which we can use to decode the messages.

Frequency analysis has the advantage that it can be automated by a computer (it would in fact be very difficult for a human to do manually), making it useful for large sets of ciphertexts.

However, this method has some weaknesses. First of all, it needs a sufficiently large number of ciphertexts to accurately recreate letter distributions. For example, our 7 ciphertext example is not nearly enough letters to be consistent with the English letter distribution; it would need a significant amount, say 100 or more. Additionally the method isn’t case-aware because it doesn’t differentiate between upper- and lower-case letters, whereas most English language sentences will contain a mix of upper- and lower-case characters. It also doesn’t account for punctuation and non-letter characters. Both of these are not necessarily big problems, however, if we choose to analyze only on lower-case characters. English text is primarily lower-case letters, and if we simply remove non-lowercase letters from our analysis we will still have a distribution that should roughly be consistent with the English letter distribution.

Summary

Defining a comprehensive and useful notion of security is quite difficult. As we’ve illustrated here with the concept of perfect secrecy, even definitions that seem very strong intuitively and theoretically often have their own practical failings when set to work in the real world.

It’s important to choose a notion of security that is strong enough to keep it secure from modern computational attacks, but also weak enough to allow practical flexibility in real-world use. This balance is different depending on the participating parties’ needs and capabilities - a casual online game may need a weak and flexible encryption scheme (it needs efficiency for the fast network communications and only minimal security), whereas emails between national security leadership may need a very strong scheme (very difficult for even the most persistent attackers, and being a government/security issue there are lots of resources and protocol available to conform to the needs of the scheme). It is important that parties choose encryption methods that best cater to their needs.