11 Hashing, Merkle-Damgard transform, SHA, HMAC ----------------------------------------------- Hash functions serve a related but slightly different purpose to that of MACs. Unlike those (and private key encryption) they are not immediately useful on their own but rather are building blocks and indeed are "a workhorse of cryptography" like e.g. solving systems of linear equations is to engineering. Intuitively, a hash function H() transforms a long string m into a short digest ("hash") t=H(m) such that it is extremely difficult to produce two texts x =/= x' with H(x)=H(x'). Such pair x,x' is known as a *collision*. As such, the notion cannot really be made sense of in the context of computational complexity for once we have x=/=x' with H(x)=H(x') then we can just hardwire x,x' into a program. In order to get around this, we introduce as before randomly chosen keys and security parameters. This will allow us to formulate meaningful security goals and meaningful security proofs for various constructions involving hash functions. Definition: A *hash function* consists of a (probabilistic polytime) key generation algorithm Gen(n) as before and a polytime function H taking as input a key s ("seed") and m:{0,1}^* then H_s(m):{0,1}^{l(n)} where l(n) is a fixed, and typically simple, length function, e.g. l(n)=n. There may also be another length function l'(n)>l(n) and H_s:{0,1}^{l'(n)} -> {0,1}^{l(n)}. In that case, we have a fixed-length hash function. Definition: A hash function Pi=(Gen,H) as above is *collision resistant* if for all probabilistic polytime adversaries A one has Pr[Hash-coll_{A,Pi}(n)=1] = negl(n) for some negligible function negl(n). Here, Hash-coll_{A,Pi}(n) is the following experiment: 1) s <-Gen(n) /* s = "seed" */ 2) A is given s and then outputs x,x' 3) The result of the experiment is 1 if x=/=x' and H_s(x)=H_s(x') and 0 otherwise. Thus, the adversary must provide a collision given a key s and the probability that it succeeds in doing so should be negligible. In contrast, in the case of MACs the adversary does not get access to the key. Weaker notions of security - - - - - - - - - - - - - - In this context one may ask why being able to produce a collision should be considered dangerous. Firstly, if a hash function is collision resistant then it follows that it is also not possible to forge a message x' =/=x such that H(x)=H(x') given x. So collision resistance is certainly a "good enough" notion. However, one could instead ask for the weaker property of "second preimage resistance": it should be difficult to, given x, produce x'=/=x such that H(x)=(x'). A typical application of hash functions is message integrity: If we send someone a message x via some untrusted channel (which can be modified by an adversary) and the hash value t=H(x) via a trusted channel then the receiver can check that the message has not been tampered with by checking that t=H(x') with x' being the received (possibly modified) message. While being able to produce an arbitrary collision may be harmless, being able to find a collision in a given set of texts of size ~ 2^{l(n)} can be very dangerous and is not covered by "second preimage resistance": Consider the following scenario: we produce a letter L, have its hash signed by a responsible person, and then send it to some recipient. Now, it is very easy to produce functions f,g:{0,1}^{l(n)}->{0,1}^* such that for each z the string f(z) is, say, a different rejection letter, and g(z) is a different acceptance letter. E.g. by introducing various switches dear/esteemed, Sir/Madam or pleased/delighted etc. Or, if one is dealing with images rather than letters by slightly altering intensity values. Then if, one can produce z, z' such that H(f(z))=H(g(z')) one can have the hash t=H(f(z)) signed and communicate along with g(z'), for which the hash is also valid due to the collision. While this still requires collisions of a special kind it is sufficiently close to arbitrary collisions and for this reason one considers collision resistance as the "gold standard" for hash functions. Birthday paradox - - - - - - - - - Let us see, how many steps a brute force attack requires. In order to do that, we (simplifyingly) assume that H_s is a random function. Let us write H(x)=H_s(x). So we assume that H is drawn uniformly at random from {0,1}^*->{0,1}^{l} where l=l(n). Clearly, if we fix x then the likelihood for x' drawn uniformly at random to satisfy H(x)=H(x') is 2^{-l(n)} and thus, we need to try out ca N/2 values (N=2^l) to get a 50% success probability. This also holds, if x has been drawn at random. Things are different, if we randomly choose a set of values S and then *search* for a collision within S. Of course, if |S|>N then we find a collision with certainty, but perhaps surprisingly |S|>=2*sqrt(N) ensures that the likelihood of a collision within S is above 50%. So, to find a collision one can choose a set S of size 2*sqrt(N), sort it according to hash values and then search (in linear time) for the desired collision. If one does not find one, repeat with new data. Thus, for example if l(n)=64 then trying out 2^63 values will (with probability >=50%) lead to a collision with a fixed left hand side, but trying out 2^33 values will suffice to get an arbitrary collision using the birthday paradox. While 2^63 steps can only be carried out with special hardware resources, carrying out ~2^33 steps is trivial on today's computers. One must bear this in mind when trying to estimate "bits of security" of a hash function. Even assuming that the hash function does not contain systematic security holes we will only get one half of the hash size (size of the digest) many bits of security. Improved birthday attack - - - - - - - - - - - - - The implementation of the "birthday attack" sketched above takes not only roughly O(sqrt(N)) steps but also O(sqrt(N)) memory which is a more expensive resource. Indeed, storing 2^33 thirty-two byte words requires 64TB. We now describe a version of the birthday attack that runs in constant space. If we choose x0:{0,1}^l uniformly at random then, assuming as we did, that H is a random function then S={x0,H(x0),H(H(x0)),H(H(H(x0))),...,H^{2*sqrt(N)-1}(x0)} is also random and thus contains a collision with >=50% likelihood. Let us write xi:=H^i(x0). Suppose that i,j are such that xi=xj, but x_{i-1} =/= x_{j-1}, so x_{i-1},x_{j-1} form a collision. Note that xi=H(x_{i-1}). Assuming i=0 and thus x_{i+z}=x_{i+kd+z} for all k>=0 and z>=0. If we choose k such that kd>=i then x_{kd}=x_{2kd} follows with z=kd-i. This suggests the following algorithm for finding a collision with high likelihood: x0 <- {0,1}^l /* uniformly at random */ x = x0;y = x0; for (r=1..4*sqrt(N)) { x = H(x); /* x=H^r(x0) */ y = H(H(y)); /* y=H^{2r}(x0) */ if (x==y) break;} if (x!=y) start over with new x0; else y=x;x=x0; /* y=H^{kd}(x0) */ if (x==x0) start over with new x0; else /* very unlikely */ for (i=0..r) do { x = H(x); /* x=H^i(x0) */ y = H(y); /* y=H^{i+kd}(x0) */ if (x==y) break; /* must happen before loop runs out */} return H^{i-1}(x0),H^{i+kd-1}(x0); /* can optimize recomputation */ Merkle-Damgard Transform - - - - - - - - - - - - - The Merkle-Damgard Transform constructs a collision-resistant hash function (Gen,H) from a given *fixed length* collision-resistant hash function (Gen,h) with length 2*l(n), i.e. the h-hash is half as long as the input. The key generation remains unchanged and H is as follows. 1) Given key s and input x:{0,1}^* divide message into blocks x1,x2,...,x_B of length l=l(n) possibly padding the last block with zeroes. Set x_{B+1}=L. 2) Set z0=0...0 /* l zeroes */ 3) for i=1..B+1 do zi = h_s(z_{i-1} || x_i) 4) Output z_{B+1} x1 x2 x_B L 0 | _____ | _____ | _____ | _____ | |_| | |_| | |_| | |_| | |___| h_s |______| h_s |_...._____| h_s |_____| h_s |-----> H_s(x) |_____| |_____| |_____| |_____| Theorem: If (Gen,h) is a collision-resistant fixed length hash function of length l'(n)=2*l(n) then (Gen,H) is a collision-resistant hash function. Proof: We first show that any collision in H_s yields a collision in h_s. Suppose that H_s(x)=H_s(x') and write x_1..x_B,L for the blocks and length of x and x'_1..x'_B' L' for x'. Also write z_1..z_{B+1}, z'_1..z'_B' for the intermediate results. We have H_s(x)=h_s(z_B||L) and H_s(x')=h_s(z'_B'||L'), so, if L=/=L' we already have produced the desired collision in h_s. This also holds, if L=L' and z_B =/= z'_B (notice that B=B' in this case). If L=L' and z_B=z'_B then, since z_B=h_s(z_{B-1}||x_B) we get a collision unless x_B=x'_B. Continuing in this way, we eventually find our collision since x_i =/= x'_i for at least one i. Now, suppose that A is an adversary against H_s. We construct an adversary A' against h_s as follows. If A outputs x,x' we check whether H_s(x)=H_s(x') and if so, construct the corresponding collision h_s(y)=h_s(y') in h_s as described above and output y,y'. Otherwise, we output 0,0. The likelihood for succeeding, i.e., for the output to be a h_s-collision is negligible by assumption and thus the probability that A output a collision is negligible as well. Notice that the computations we carried out as A' were clearly in polynomial time. QED The well-known and standardized cryptographic hash functions MD5, SHA-1, SHA-2 use the Merkle-Damgard transform. The underlying fixed-length hash function (called compression function) in this context is built using the heuristic principles that were described in Section 7 (Practical block ciphers. Principles: s/p- and Feistel networks). We remark here that MD5 and SHA-1 are no longer consider sufficiently secure whereas SHA-2 is. The latest standard SHA-3 which is not yet widely used but might eventually supersede SHA-2 does not use Merkle-Damgard but instead relies on the so-called *sponge construction*, see Wikipedia. MACs from hash functions - - - - - - - - - - - - - A natural question is whether one can use hash functions also for the purpose of authentication, i.e. MAC. First, we note that naive attempts (which according to KL have been actually used in practice!) do not work. Consider for example the definition Mac_k(x) = H_s(k||x) where s is some public seed. If H_s was constructed with the Merkle-Damgard transform from h_s then we can easily forge MACs as follows: if t is the MAC of x then t'=h_s(t||L') is the MAC of (x || |x|) (glossing over the padding). Suppose that we are given a *fixed-length* MAC (Gen,Mac) capable of authenticating messages of length l(n). Suppose furthermore, that we have a collision-resistant hash function (Gen',H) which produces hashes of length l(n). We can then build a MAC (Gen'',Mac'') for arbitrary length messages as follows: Gen''(n): k<-Gen(n); s<-Gen'(n);return (k,s) /* use some encoding of pairs */ Mac''_{(k,s)}(m): return Mac_k(H_s(m)) Theorem: Assuming that (Gen,Mac) is secure and (Gen',H) is collision-resistant then (Gen'',Mac'') is secure. Proof: Suppose that an adversary A asks questions Q and then succeeds in forging a MAC t=Mac''_{(k,s)}(x), i.e., x is not in Q. If H_s(x) is different from all the H_s(y) for y:Q then A has succeeded in forging a MAC for Mac_k. Namely then t=Mac_k(H_s(x)) and H_s(x) was not previously asked. If, however, H_s(x) is contained in {H_s(y) | y:Q} then A has produced a collision. Since both events occur with negligible probability the probability that A can forge a MAC is also negligible. QED Suppose that H_s() above has been constructed using the Merkle-Damgard construction from the fixed-length hash function h_s(). For seed s define a fixed-length MAC by Mac_k(m) = h_s(k||m) We now make the assumption that h_s() is such that this is a secure fixed-length MAC. Notice that the abovedescribed extension attack on MACs of this sort does not apply due to the fixed length. If we now apply the above hash-and-MAC construction then it turns out that the arbitrary-length MAC obtained satisfies "essentially" Mac''_{(k,s)}(m) = H_s(k||H_s(m)) because the last H_s degenerates to h_s due to the shortness of H_s(m). The quoted "essentially" refers to the fact that for this equation to hold exactly one needs to assume some conventions about how exactly the padding in the Merkle-Damgard construction is performed. We want to ignore these details here and refer to the literature. The HMAC construction - - - - - - - - - - - The widely used and standardised HMAC construction is essentially the above except that another key is added to the message m prior to invoking the inner hash function and that the two keys that are thus needed are derived from a single one by XOR-ing with certain fixed words ipad and opad: HMAC_{(k,s)}(m) = H_s( (k+opad) || H_s((k+ipad) || m)) In practice, ipad and opad are chosen to be 00110110 and 01011100 each repeated as often as necessary. The additional inner keying with k+ipad allows one to weaken the assumptions on the collision freeness of H_s(). The details of these assumptions and their practical relevance are still subject to debate (see a 2012 paper by Koblitz and Menezes); however, no practical attack has been found to date and HMAC is considered secure in practice.