11 Hashing, Merkle-Damgard transform, SHA, HMAC
-----------------------------------------------

Hash functions serve a related but slightly different purpose to that
of MACs. Unlike those (and private key encryption) they are not
immediately useful on their own but rather are building blocks and
indeed are "a workhorse of cryptography" like e.g. solving systems of
linear equations is to engineering.

Intuitively, a hash function H() transforms a long string m into a
short digest ("hash") t=H(m) such that it is extremely difficult to
produce two texts x =/= x' with H(x)=H(x'). Such pair x,x' is known as
a *collision*.

As such, the notion cannot really be made sense of in the context of
computational complexity for once we have x=/=x' with H(x)=H(x') then
we can just hardwire x,x' into a program. In order to get around this,
we introduce as before randomly chosen keys and security parameters.
This will allow us to formulate meaningful security goals and
meaningful security proofs for various constructions involving hash
functions. 

Definition: A *hash function* consists of a (probabilistic polytime)
key generation algorithm Gen(n) as before and a polytime function H
taking as input a key s ("seed") and m:{0,1}^* then
H_s(m):{0,1}^{l(n)} where l(n) is a fixed, and typically simple,
length function, e.g. l(n)=n.

There may also be another length function l'(n)>l(n) and
H_s:{0,1}^{l'(n)} -> {0,1}^{l(n)}. In that case, we have a
fixed-length hash function.

Definition: A hash function Pi=(Gen,H) as above is *collision resistant* if
for all probabilistic polytime adversaries A one has
 
            Pr[Hash-coll_{A,Pi}(n)=1] = negl(n)

for some negligible function negl(n). Here, Hash-coll_{A,Pi}(n) is the
following experiment:

1) s <-Gen(n) /* s = "seed" */
2) A is given s and then outputs x,x'
3) The result of the experiment is 1 if x=/=x' and H_s(x)=H_s(x') and
0 otherwise.

Thus, the adversary must provide a collision given a key s and the
probability that it succeeds in doing so should be negligible. In
contrast, in the case of MACs the adversary does not get access to the
key.


Weaker notions of security
- - - - - - - - - - - - - -

In this context one may ask why being able to produce a collision
should be considered dangerous. Firstly, if a hash function is
collision resistant then it follows that it is also not possible to
forge a message x' =/=x such that H(x)=H(x') given x. So collision
resistance is certainly a "good enough" notion. However, one could
instead ask for the weaker property of "second preimage resistance":
it should be difficult to, given x, produce x'=/=x such that
H(x)=(x').

A typical application of hash functions is message integrity: If we
send someone a message x via some untrusted channel (which can be
modified by an adversary) and the hash value t=H(x) via a trusted
channel then the receiver can check that the message has not been
tampered with by checking that t=H(x') with x' being the received
(possibly modified) message.

While being able to produce an arbitrary collision may be harmless,
being able to find a collision in a given set of texts of size ~
2^{l(n)} can be very dangerous and is not covered by "second preimage
resistance": Consider the following scenario: we produce a letter L,
have its hash signed by a responsible person, and then send it to some
recipient. Now, it is very easy to produce functions
f,g:{0,1}^{l(n)}->{0,1}^* such that for each z the string f(z) is,
say, a different rejection letter, and g(z) is a different acceptance
letter. E.g. by introducing various switches dear/esteemed, Sir/Madam
or pleased/delighted etc. Or, if one is dealing with images rather
than letters by slightly altering intensity values. Then if, one can
produce z, z' such that H(f(z))=H(g(z')) one can have the hash
t=H(f(z)) signed and communicate along with g(z'), for which the hash
is also valid due to the collision. While this still requires
collisions of a special kind it is sufficiently close to arbitrary
collisions and for this reason one considers collision resistance as
the "gold standard" for hash functions.


Birthday paradox
- - - - - - - - -

Let us see, how many steps a brute force attack requires. In order to
do that, we (simplifyingly) assume that H_s is a random function. Let
us write H(x)=H_s(x). So we assume that H is drawn uniformly at random
from {0,1}^*->{0,1}^{l} where l=l(n). 

Clearly, if we fix x then the likelihood for x' drawn uniformly at
random to satisfy H(x)=H(x') is 2^{-l(n)} and thus, we need to try out
ca N/2 values (N=2^l) to get a 50% success probability. This also
holds, if x has been drawn at random. Things are different, if we
randomly choose a set of values S and then *search* for a collision
within S. Of course, if |S|>N then we find a collision with certainty,
but perhaps surprisingly |S|>=2*sqrt(N) ensures that the likelihood of
a collision within S is above 50%.

So, to find a collision one can choose a set S of size 2*sqrt(N), sort
it according to hash values and then search (in linear time) for the
desired collision. If one does not find one, repeat with new data.

Thus, for example if l(n)=64 then trying out 2^63 values will (with
probability >=50%) lead to a collision with a fixed left hand side,
but trying out 2^33 values will suffice to get an arbitrary collision
using the birthday paradox. While 2^63 steps can only be carried out
with special hardware resources, carrying out ~2^33 steps is trivial on
today's computers. One must bear this in mind when trying to estimate
"bits of security" of a hash function. Even assuming that the hash
function does not contain systematic security holes we will only get
one half of the hash size (size of the digest) many bits of security.


Improved birthday attack
- - - - - - - - - - - - -  

The implementation of the "birthday attack" sketched above takes not
only roughly O(sqrt(N)) steps but also O(sqrt(N)) memory which is a
more expensive resource. Indeed, storing 2^33 thirty-two byte words
requires 64TB. We now describe a version of the birthday attack that
runs in constant space.

If we choose x0:{0,1}^l uniformly at random then, assuming as we did,
that H is a random function then

S={x0,H(x0),H(H(x0)),H(H(H(x0))),...,H^{2*sqrt(N)-1}(x0)}

is also random and thus contains a collision with >=50%
likelihood. Let us write xi:=H^i(x0). Suppose that i,j are such that
xi=xj, but x_{i-1} =/= x_{j-1}, so x_{i-1},x_{j-1} form a
collision. Note that xi=H(x_{i-1}). Assuming i<j and writing j=i+d we
then also have x_i=x_{i+kd} for all k>=0 and thus x_{i+z}=x_{i+kd+z}
for all k>=0 and z>=0. If we choose k such that kd>=i then
x_{kd}=x_{2kd} follows with z=kd-i.

This suggests the following
algorithm for finding a collision with high likelihood:

x0 <- {0,1}^l /* uniformly at random */
x = x0;y = x0; 
for (r=1..4*sqrt(N)) {
   x = H(x); /* x=H^r(x0) */
   y = H(H(y)); /* y=H^{2r}(x0) */
   if (x==y) break;}
if (x!=y) start over with new x0; else
y=x;x=x0; /* y=H^{kd}(x0) */
if (x==x0) start over with new x0; else /* very unlikely */ 
for (i=0..r) do {
   x = H(x); /* x=H^i(x0) */
   y = H(y); /* y=H^{i+kd}(x0) */
   if (x==y) break; /* must happen before loop runs out */}
return H^{i-1}(x0),H^{i+kd-1}(x0); /* can optimize recomputation */


Merkle-Damgard Transform
- - - - - - - - - - - - -

The Merkle-Damgard Transform constructs a collision-resistant hash
function (Gen,H) from a given *fixed length* collision-resistant hash
function (Gen,h) with length 2*l(n), i.e. the h-hash is half as long
as the input. The key generation remains unchanged and H is as follows.

1) Given key s and input x:{0,1}^* divide message into blocks x1,x2,...,x_B
of length l=l(n) possibly padding the last block with zeroes. Set x_{B+1}=L. 

2) Set z0=0...0 /* l zeroes */ 

3) for i=1..B+1 do zi = h_s(z_{i-1} || x_i)

4) Output z_{B+1}


  x1           x2              x_B          L  
0 |  _____     |  _____         |  _____    |  _____
| |_|     |    |_|     |        |_|     |   |_|     |
|___| h_s |______| h_s |_...._____| h_s |_____| h_s |-----> H_s(x)
    |_____|      |_____|          |_____|     |_____|


Theorem: If (Gen,h) is a collision-resistant fixed length hash
function of length l'(n)=2*l(n) then (Gen,H) is a collision-resistant
hash function.

Proof: We first show that any collision in H_s yields a collision in
h_s.  Suppose that H_s(x)=H_s(x') and write x_1..x_B,L for the blocks
and length of x and x'_1..x'_B' L' for x'. Also write z_1..z_{B+1},
z'_1..z'_B' for the intermediate results. We have H_s(x)=h_s(z_B||L)
and H_s(x')=h_s(z'_B'||L'), so, if L=/=L' we already have produced the
desired collision in h_s. This also holds, if L=L' and z_B =/= z'_B
(notice that B=B' in this case). If L=L' and z_B=z'_B then, since
z_B=h_s(z_{B-1}||x_B) we get a collision unless x_B=x'_B. Continuing
in this way, we eventually find our collision since x_i =/= x'_i for
at least one i.

Now, suppose that A is an adversary against H_s. We construct an
adversary A' against h_s as follows. If A outputs x,x' we check
whether H_s(x)=H_s(x') and if so, construct the corresponding
collision h_s(y)=h_s(y') in h_s as described above and output
y,y'. Otherwise, we output 0,0. The likelihood for succeeding, i.e.,
for the output to be a h_s-collision is negligible by assumption and
thus the probability that A output a collision is negligible as
well. Notice that the computations we carried out as A' were clearly
in polynomial time. QED

The well-known and standardized cryptographic hash functions MD5,
SHA-1, SHA-2 use the Merkle-Damgard transform. The underlying
fixed-length hash function (called compression function) in this
context is built using the heuristic principles that were described in
Section 7 (Practical block ciphers. Principles: s/p- and Feistel
networks).  We remark here that MD5 and SHA-1 are no longer consider
sufficiently secure whereas SHA-2 is.

The latest standard SHA-3 which is not yet widely used but might
eventually supersede SHA-2 does not use Merkle-Damgard but instead
relies on the so-called *sponge construction*, see Wikipedia. 


MACs from hash functions
- - - - - - - - - - - - - 

A natural question is whether one can use hash functions also for the
purpose of authentication, i.e. MAC. First, we note that naive
attempts (which according to KL have been actually used in practice!)


do not work. Consider for example the definition Mac_k(x) = H_s(k||x)
where s is some public seed. If H_s was constructed with the
Merkle-Damgard transform from h_s then we can easily forge MACs as
follows: if t is the MAC of x then t'=h_s(t||L') is the
MAC of (x || |x|) (glossing over the padding).

Suppose that we are given a *fixed-length* MAC (Gen,Mac) capable of
authenticating messages of length l(n). Suppose furthermore, that we
have a collision-resistant hash function (Gen',H) which produces
hashes of length l(n). We can then build a MAC (Gen'',Mac'') for
arbitrary length messages as follows:

Gen''(n): k<-Gen(n); s<-Gen'(n);return (k,s) /* use some encoding of pairs */
Mac''_{(k,s)}(m): return Mac_k(H_s(m))

Theorem: Assuming that (Gen,Mac) is secure and (Gen',H) is
collision-resistant then (Gen'',Mac'') is secure.

Proof: Suppose that an adversary A asks questions Q and then succeeds
in forging a MAC t=Mac''_{(k,s)}(x), i.e., x is not in Q. If H_s(x) is
different from all the H_s(y) for y:Q then A has succeeded in forging
a MAC for Mac_k. Namely then t=Mac_k(H_s(x)) and H_s(x) was not
previously asked. If, however, H_s(x) is contained in {H_s(y) | y:Q}
then A has produced a collision. Since both events occur with
negligible probability the probability that A can forge a MAC is also
negligible. QED 

Suppose that H_s() above has been constructed using the Merkle-Damgard
construction from the fixed-length hash function h_s(). For seed s
define a fixed-length MAC by

                    Mac_k(m) = h_s(k||m)

We now make the assumption that h_s() is such that this is a secure
fixed-length MAC. Notice that the abovedescribed extension attack on
MACs of this sort does not apply due to the fixed length.

If we now apply the above hash-and-MAC construction then it turns out
that the arbitrary-length MAC obtained satisfies "essentially"

Mac''_{(k,s)}(m) = H_s(k||H_s(m))

because the last H_s degenerates to h_s due to the shortness of
H_s(m). The quoted "essentially" refers to the fact that for this
equation to hold exactly one needs to assume some conventions about
how exactly the padding in the Merkle-Damgard construction is
performed. We want to ignore these details here and refer to the
literature.


The HMAC construction
- - - - - - - - - - -

The widely used and standardised HMAC construction is essentially the
above except that another key is added to the message m prior to
invoking the inner hash function and that the two keys that are thus
needed are derived from a single one by XOR-ing with certain fixed
words ipad and opad:

HMAC_{(k,s)}(m) = H_s( (k+opad) || H_s((k+ipad) || m))

In practice, ipad and opad are chosen to be 00110110 and 01011100 each
repeated as often as necessary.

The additional inner keying with k+ipad allows one to weaken the
assumptions on the collision freeness of H_s(). The details of these
assumptions and their practical relevance are still subject to debate
(see a 2012 paper by Koblitz and Menezes); however, no practical
attack has been found to date and HMAC is considered secure in
practice.