diff options
author | somov <somov@yandex-team.ru> | 2022-02-10 16:45:49 +0300 |
---|---|---|
committer | Daniil Cherednik <dcherednik@yandex-team.ru> | 2022-02-10 16:45:49 +0300 |
commit | 7489e4682331202b9c7d863c0898eb83d7b12c2b (patch) | |
tree | 9142afc54d335ea52910662635b898e79e192e49 /contrib/libs/highwayhash/README.md | |
parent | a5950576e397b1909261050b8c7da16db58f10b1 (diff) | |
download | ydb-7489e4682331202b9c7d863c0898eb83d7b12c2b.tar.gz |
Restoring authorship annotation for <somov@yandex-team.ru>. Commit 2 of 2.
Diffstat (limited to 'contrib/libs/highwayhash/README.md')
-rw-r--r-- | contrib/libs/highwayhash/README.md | 700 |
1 files changed, 350 insertions, 350 deletions
diff --git a/contrib/libs/highwayhash/README.md b/contrib/libs/highwayhash/README.md index 4d120be2a2..2d311c66c1 100644 --- a/contrib/libs/highwayhash/README.md +++ b/contrib/libs/highwayhash/README.md @@ -1,350 +1,350 @@ -Strong (well-distributed and unpredictable) hashes: - -* Portable implementation of - [SipHash](https://www.131002.net/siphash/siphash.pdf) -* HighwayHash, a 5x faster SIMD hash with [security - claims](https://arxiv.org/abs/1612.06257) - -## Quick Start - -To build on a Linux or Mac platform, simply run `make`. For Windows, we provide -a Visual Studio 2015 project in the `msvc` subdirectory. - -Run `benchmark` for speed measurements. `sip_hash_test` and `highwayhash_test` -ensure the implementations return known-good values for a given set of inputs. - -64-bit SipHash for any CPU: - - #include "highwayhash/sip_hash.h" - using namespace highwayhash; - const HH_U64 key2[2] HH_ALIGNAS(16) = {1234, 5678}; - char in[8] = {1}; - return SipHash(key2, in, 8); - -64, 128 or 256 bit HighwayHash for the CPU determined by compiler flags: - - #include "highwayhash/highwayhash.h" - using namespace highwayhash; - const HHKey key HH_ALIGNAS(32) = {1, 2, 3, 4}; - char in[8] = {1}; - HHResult64 result; // or HHResult128 or HHResult256 - HHStateT<HH_TARGET> state(key); - HighwayHashT(&state, in, 8, &result); - -64, 128 or 256 bit HighwayHash for the CPU on which we're currently running: - - #include "highwayhash/highwayhash_target.h" - #include "highwayhash/instruction_sets.h" - using namespace highwayhash; - const HHKey key HH_ALIGNAS(32) = {1, 2, 3, 4}; - char in[8] = {1}; - HHResult64 result; // or HHResult128 or HHResult256 - InstructionSets::Run<HighwayHash>(key, in, 8, &result); - -C-callable 64-bit HighwayHash for the CPU on which we're currently running: - - #include "highwayhash/c_bindings.h" - const uint64_t key[4] = {1, 2, 3, 4}; - char in[8] = {1}; - return HighwayHash64(key, in, 8); - -## Introduction - -Hash functions are widely used, so it is desirable to increase their speed and -security. This package provides two 'strong' (well-distributed and -unpredictable) hash functions: a faster version of SipHash, and an even faster -algorithm we call HighwayHash. - -SipHash is a fast but 'cryptographically strong' pseudo-random function by -Aumasson and Bernstein [https://www.131002.net/siphash/siphash.pdf]. - -HighwayHash is a new way of mixing inputs which may inspire new -cryptographically strong hashes. Large inputs are processed at a rate of 0.24 -cycles per byte, and latency remains low even for small inputs. HighwayHash is -faster than SipHash for all input sizes, with 5 times higher throughput at 1 -KiB. We discuss design choices and provide statistical analysis and preliminary -cryptanalysis in https://arxiv.org/abs/1612.06257. - -## Applications - -Unlike prior strong hashes, these functions are fast enough to be recommended -as safer replacements for weak hashes in many applications. The additional CPU -cost appears affordable, based on profiling data indicating C++ hash functions -account for less than 0.25% of CPU usage. - -Hash-based selection of random subsets is useful for A/B experiments and similar -applications. Such random generators are idempotent (repeatable and -deterministic), which is helpful for parallel algorithms and testing. To avoid -bias, it is important that the hash function be unpredictable and -indistinguishable from a uniform random generator. We have verified the bit -distribution and avalanche properties of SipHash and HighwayHash. - -64-bit hashes are also useful for authenticating short-lived messages such as -network/RPC packets. This requires that the hash function withstand -differential, length extension and other attacks. We have published a formal -security analysis for HighwayHash. New cryptanalysis tools may still need to be -developed for further analysis. - -Strong hashes are also important parts of methods for protecting hash tables -against unacceptable worst-case behavior and denial of service attacks -(see "hash flooding" below). - -## SipHash - -Our SipHash implementation is a fast and portable drop-in replacement for -the reference C code. Outputs are identical for the given test cases (messages -between 0 and 63 bytes). - -Interestingly, it is about twice as fast as a SIMD implementation using SSE4.1 -(https://goo.gl/80GBSD). This is presumably due to the lack of SIMD bit rotate -instructions. - -SipHash13 is a faster but weaker variant with one mixing round per update and -three during finalization. - -We also provide a data-parallel 'tree hash' variant that enables efficient SIMD -while retaining safety guarantees. This is about twice as fast as SipHash, but -does not return the same results. - -## HighwayHash - -We have devised a new way of mixing inputs with AVX2 multiply and permute -instructions. The multiplications are 32x32 -> 64 bits and therefore infeasible -to reverse. Permuting equalizes the distribution of the resulting bytes. - -The internal state occupies four 256-bit AVX2 registers. Due to limitations of -the instruction set, the registers are partitioned into two 512-bit halves that -remain independent until the reduce phase. The algorithm outputs 64 bit digests -or up to 256 bits at no extra cost. - -In addition to high throughput, the algorithm is designed for low finalization -cost. The result is more than twice as fast as SipTreeHash. - -For older CPUs, we also provide an SSE4.1 version (80% as fast for large inputs -and 95% as fast for short inputs) and a portable version (10% as fast). - -Statistical analyses and preliminary cryptanalysis are given in -https://arxiv.org/abs/1612.06257. - -## Versioning and stability - -SipHash and HighwayHash 1.0 are 'fingerprint functions' whose input -> hash -mapping will not change. This is important for applications that write hashes to -persistent storage. - -HighwayHash has not yet reached 1.0 and may still change in the near future. We -will announce when it is frozen. - -## Speed measurements - -To measure the CPU cost of a hash function, we can either create an artificial -'microbenchmark' (easier to control, but probably not representative of the -actual runtime), or insert instrumentation directly into an application (risks -influencing the results through observer overhead). We provide novel variants of -both approaches that mitigate their respective disadvantages. - -profiler.h uses software write-combining to stream program traces to memory -with minimal overhead. These can be analyzed offline, or when memory is full, -to learn how much time was spent in each (possibly nested) zone. - -nanobenchmark.h enables cycle-accurate measurements of very short functions. -It uses CPU fences and robust statistics to minimize variability, and also -avoids unrealistic branch prediction effects. - -We compile the C++ implementations with a patched GCC 4.9 and run on a single -core of a Xeon E5-2690 v3 clocked at 2.6 GHz. CPU cost is measured as cycles per -byte for various input sizes: - -Algorithm | 8 | 31 | 32 | 63 | 64 | 1024 ----------------- | ----- | ---- | ---- | ---- | ---- | ---- -HighwayHashAVX2 | 7.34 | 1.81 | 1.71 | 1.04 | 0.95 | 0.24 -HighwayHashSSE41 | 8.00 | 2.11 | 1.75 | 1.13 | 0.96 | 0.30 -SipTreeHash | 16.51 | 4.57 | 4.09 | 2.22 | 2.29 | 0.57 -SipTreeHash13 | 12.33 | 3.47 | 3.06 | 1.68 | 1.63 | 0.33 -SipHash | 8.13 | 2.58 | 2.73 | 1.87 | 1.93 | 1.26 -SipHash13 | 6.96 | 2.09 | 2.12 | 1.32 | 1.33 | 0.68 - -SipTreeHash is slower than SipHash for small inputs because it processes blocks -of 32 bytes. AVX2 and SSE4.1 HighwayHash are faster than SipHash for all input -sizes due to their highly optimized handling of partial vectors. - -Note that previous measurements included the initialization of their input, -which dramatically increased timings especially for small inputs. - -## CPU requirements - -SipTreeHash[13] requires an AVX2-capable CPU (e.g. Haswell). HighwayHash -includes a dispatcher that chooses the best available (AVX2, SSE4.1 or portable) -implementation at runtime, as well as a directly callable function template that -can only run on the CPU for which it was built. SipHash[13] and -ScalarSipTreeHash[13] have no particular CPU requirements. - -Our implementations use custom AVX2 vector classes with overloaded operators -(e.g. `const V4x64U a = b + c`) for type-safety and improved readability vs. -compiler intrinsics (e.g. `const __m256i a = _mm256_add_epi64(b, c)`). - -We intend to port HighwayHash to other SIMD-capable platforms, especially ARM. - -Our instruction_sets dispatcher avoids running newer instructions on older CPUs -that do not support them. However, intrinsics, and therefore also any vector -classes that use them, require a compiler flag that also enables the compiler to -generate code for that CPU. This means the intrinsics must be placed in separate -translation units that are compiled with the required flags. It is important -that these source files and their headers not define any inline functions, -because that might break the one definition rule and cause crashes. - -To minimize dispatch overhead when hashes are computed often (e.g. in a loop), -we can inline the hash function into its caller using templates. The dispatch -overhead will only be paid once (e.g. before the loop). The template mechanism -also avoids duplicating code in each CPU-specific implementation. - -## Defending against hash flooding - -To mitigate hash flooding attacks, we need to take both the hash function and -the data structure into account. - -We wish to defend (web) services that utilize hash sets/maps against -denial-of-service attacks. Such data structures assign attacker-controlled -input messages `m` to a hash table bin `b` by computing the hash `H(s, m)` -using a hash function `H` seeded by `s`, and mapping it to a bin with some -narrowing function `b = R(h)`, discussed below. - -Attackers may attempt to trigger 'flooding' (excessive work in insertions or -lookups) by finding multiple `m` that map to the same bin. If the attacker has -local access, they can do far worse, so we assume the attacker can only issue -remote requests. If the attacker is able to send large numbers of requests, -they can already deny service, so we need only ensure the attacker's cost is -sufficiently large compared to the service's provisioning. - -If the hash function is 'weak', attackers can easily generate 'hash collisions' -(inputs mapping to the same hash values) that are independent of the seed. In -other words, certain input messages will cause collisions regardless of the seed -value. The author of SipHash has published C++ programs to generate such -'universal (key-independent) multicollisions' for CityHash and Murmur. Similar -'differential' attacks are likely possible for any hash function consisting only -of reversible operations (e.g. addition/multiplication/rotation) with a constant -operand. `n` requests with such inputs cause `n^2` work for an unprotected hash -table, which is unacceptable. - -By contrast, 'strong' hashes such as SipHash or HighwayHash require infeasible -attacker effort to find a hash collision (an expected 2^32 guesses of `m` per -the birthday paradox) or recover the seed (2^63 requests). These security claims -assume the seed is secret. It is reasonable to suppose `s` is initially unknown -to attackers, e.g. generated on startup or even per-connection. A timing attack -by Wool/Bar-Yosef recovers 13-bit seeds by testing all 8K possibilities using -millions of requests, which takes several days (even assuming unrealistic 150 us -round-trip times). It appears infeasible to recover 64-bit seeds in this way. - -However, attackers are only looking for multiple `m` mapping to the same bin -rather than identical hash values. We assume they know or are able to discover -the hash table size `p`. It is common to choose `p = 2^i` to enable an efficient -`R(h) := h & (p - 1)`, which simply retains the lower hash bits. It may be -easier for attackers to compute partial collisions where only the lower `i` bits -match. This can be prevented by choosing a prime `p` so that `R(h) := h % p` -incorporates all hash bits. The costly modulo operation can be avoided by -multiplying with the inverse (https://goo.gl/l7ASm8). An interesting alternative -suggested by Kyoung Jae Seo chooses a random subset of the `h` bits. Such an `R` -function can be computed in just 3 cycles using PEXT from the BMI2 instruction -set. This is expected to defend against SAT-solver attacks on the hash bits at a -slightly lower cost than the multiplicative inverse method, and still allows -power-of-two table sizes. - -Summary thus far: given a strong hash function and secret seed, it appears -infeasible for attackers to generate hash collisions because `s` and/or `R` are -unknown. However, they can still observe the timings of data structure -operations for various `m`. With typical table sizes of 2^10 to 2^17 entries, -attackers can detect some 'bin collisions' (inputs mapping to the same bin). -Although this will be costly for the attacker, they can then send many instances -of such inputs, so we need to limit the resulting work for our data structure. - -Hash tables with separate chaining typically store bin entries in a linked list, -so worst-case inputs lead to unacceptable linear-time lookup cost. We instead -seek optimal asymptotic worst-case complexity for each operation (insertion, -deletion and lookups), which is a constant factor times the logarithm of the -data structure size. This naturally leads to a tree-like data structure for each -bin. The Java8 HashMap only replaces its linked list with trees when needed. -This leads to additional cost and complexity for deciding whether a bin is a -list or tree. - -Our first proposal (suggested by Github user funny-falcon) avoids this overhead -by always storing one tree per bin. It may also be worthwhile to store the first -entry directly in the bin, which avoids allocating any tree nodes in the common -case where bins are sparsely populated. What kind of tree should be used? -Scapegoat and splay trees only offer amortized complexity guarantees, whereas -treaps require an entropy source and have higher constant factors in practice. -Self-balancing structures such as 2-3 or red-black trees require additional -bookkeeping information. We can hope to reduce rebalancing cost by realizing -that the output bits of strong `H` functions are uniformly distributed. When -using them as keys instead of the original message `m`, recent relaxed balancing -schemes such as left-leaning red-black or weak AVL trees may require fewer tree -rotations to maintain their invariants. Note that `H` already determines the -bin, so we should only use the remaining bits. 64-bit hashes are likely -sufficient for this purpose, and HighwayHash generates up to 256 bits. It seems -unlikely that attackers can craft inputs resulting in worst cases for both the -bin index and tree key without being able to generate hash collisions, which -would contradict the security claims of strong hashes. Even if they succeed, the -relaxed tree balancing still guarantees an upper bound on height and therefore -the worst-case operation cost. For the AVL variant, the constant factors are -slightly lower than for red-black trees. - -The second proposed approach uses augmented/de-amortized cuckoo hash tables -(https://goo.gl/PFwwkx). These guarantee worst-case `log n` bounds for all -operations, but only if the hash function is 'indistinguishable from random' -(uniformly distributed regardless of the input distribution), which is claimed -for SipHash and HighwayHash but certainly not for weak hashes. - -Both alternatives retain good average case performance and defend against -flooding by limiting the amount of extra work an attacker can cause. The first -approach guarantees an upper bound of `log n` additional work even if the hash -function is compromised. - -In summary, a strong hash function is not, by itself, sufficient to protect a -chained hash table from flooding attacks. However, strong hash functions are -important parts of two schemes for preventing denial of service. Using weak hash -functions can slightly accelerate the best-case and average-case performance of -a service, but at the risk of greatly reduced attack costs and worst-case -performance. - -## Third-party implementations / bindings - -Thanks to Damian Gryski for making us aware of these third-party -implementations or bindings. Please feel free to get in touch or -raise an issue and we'll add yours as well. - -By | Language | URL ---- | --- | --- -Damian Gryski | Go and SSE | https://github.com/dgryski/go-highway/ -Lovell Fuller | node.js bindings | https://github.com/lovell/highwayhash -Vinzent Steinberg | Rust bindings | https://github.com/vks/highwayhash-rs - -## Modules - -### Hashes - -* c_bindings.h declares C-callable versions of SipHash/HighwayHash. -* sip_hash.cc is the compatible implementation of SipHash, and also provides - the final reduction for sip_tree_hash. -* sip_tree_hash.cc is the faster but incompatible SIMD j-lanes tree hash. -* scalar_sip_tree_hash.cc is a non-SIMD version. -* state_helpers.h simplifies the implementation of the SipHash variants. -* highwayhash.h is our new, fast hash function. -* hh_avx2.h, hh_sse41.h and hh_portable.h are its various implementations. -* highwayhash_target.h chooses the best available implementation at runtime. - -### Infrastructure - -* arch_specific.h offers byte swapping and CPUID detection. -* compiler_specific.h defines some compiler-dependent language extensions. -* data_parallel.h provides a C++11 ThreadPool and PerThread (similar to - OpenMP). -* instruction_sets.h and targets.h enable efficient CPU-specific dispatching. -* nanobenchmark.h measures elapsed times with < 1 cycle variability. -* os_specific.h sets thread affinity and priority for benchmarking. -* profiler.h is a low-overhead, deterministic hierarchical profiler. -* tsc_timer.h obtains high-resolution timestamps without CPU reordering. -* vector256.h and vector128.h contain wrapper classes for AVX2 and SSE4.1. - -By Jan Wassenberg <jan.wassenberg@gmail.com> and Jyrki Alakuijala -<jyrki.alakuijala@gmail.com>, updated 2017-02-07 - -This is not an official Google product. +Strong (well-distributed and unpredictable) hashes: + +* Portable implementation of + [SipHash](https://www.131002.net/siphash/siphash.pdf) +* HighwayHash, a 5x faster SIMD hash with [security + claims](https://arxiv.org/abs/1612.06257) + +## Quick Start + +To build on a Linux or Mac platform, simply run `make`. For Windows, we provide +a Visual Studio 2015 project in the `msvc` subdirectory. + +Run `benchmark` for speed measurements. `sip_hash_test` and `highwayhash_test` +ensure the implementations return known-good values for a given set of inputs. + +64-bit SipHash for any CPU: + + #include "highwayhash/sip_hash.h" + using namespace highwayhash; + const HH_U64 key2[2] HH_ALIGNAS(16) = {1234, 5678}; + char in[8] = {1}; + return SipHash(key2, in, 8); + +64, 128 or 256 bit HighwayHash for the CPU determined by compiler flags: + + #include "highwayhash/highwayhash.h" + using namespace highwayhash; + const HHKey key HH_ALIGNAS(32) = {1, 2, 3, 4}; + char in[8] = {1}; + HHResult64 result; // or HHResult128 or HHResult256 + HHStateT<HH_TARGET> state(key); + HighwayHashT(&state, in, 8, &result); + +64, 128 or 256 bit HighwayHash for the CPU on which we're currently running: + + #include "highwayhash/highwayhash_target.h" + #include "highwayhash/instruction_sets.h" + using namespace highwayhash; + const HHKey key HH_ALIGNAS(32) = {1, 2, 3, 4}; + char in[8] = {1}; + HHResult64 result; // or HHResult128 or HHResult256 + InstructionSets::Run<HighwayHash>(key, in, 8, &result); + +C-callable 64-bit HighwayHash for the CPU on which we're currently running: + + #include "highwayhash/c_bindings.h" + const uint64_t key[4] = {1, 2, 3, 4}; + char in[8] = {1}; + return HighwayHash64(key, in, 8); + +## Introduction + +Hash functions are widely used, so it is desirable to increase their speed and +security. This package provides two 'strong' (well-distributed and +unpredictable) hash functions: a faster version of SipHash, and an even faster +algorithm we call HighwayHash. + +SipHash is a fast but 'cryptographically strong' pseudo-random function by +Aumasson and Bernstein [https://www.131002.net/siphash/siphash.pdf]. + +HighwayHash is a new way of mixing inputs which may inspire new +cryptographically strong hashes. Large inputs are processed at a rate of 0.24 +cycles per byte, and latency remains low even for small inputs. HighwayHash is +faster than SipHash for all input sizes, with 5 times higher throughput at 1 +KiB. We discuss design choices and provide statistical analysis and preliminary +cryptanalysis in https://arxiv.org/abs/1612.06257. + +## Applications + +Unlike prior strong hashes, these functions are fast enough to be recommended +as safer replacements for weak hashes in many applications. The additional CPU +cost appears affordable, based on profiling data indicating C++ hash functions +account for less than 0.25% of CPU usage. + +Hash-based selection of random subsets is useful for A/B experiments and similar +applications. Such random generators are idempotent (repeatable and +deterministic), which is helpful for parallel algorithms and testing. To avoid +bias, it is important that the hash function be unpredictable and +indistinguishable from a uniform random generator. We have verified the bit +distribution and avalanche properties of SipHash and HighwayHash. + +64-bit hashes are also useful for authenticating short-lived messages such as +network/RPC packets. This requires that the hash function withstand +differential, length extension and other attacks. We have published a formal +security analysis for HighwayHash. New cryptanalysis tools may still need to be +developed for further analysis. + +Strong hashes are also important parts of methods for protecting hash tables +against unacceptable worst-case behavior and denial of service attacks +(see "hash flooding" below). + +## SipHash + +Our SipHash implementation is a fast and portable drop-in replacement for +the reference C code. Outputs are identical for the given test cases (messages +between 0 and 63 bytes). + +Interestingly, it is about twice as fast as a SIMD implementation using SSE4.1 +(https://goo.gl/80GBSD). This is presumably due to the lack of SIMD bit rotate +instructions. + +SipHash13 is a faster but weaker variant with one mixing round per update and +three during finalization. + +We also provide a data-parallel 'tree hash' variant that enables efficient SIMD +while retaining safety guarantees. This is about twice as fast as SipHash, but +does not return the same results. + +## HighwayHash + +We have devised a new way of mixing inputs with AVX2 multiply and permute +instructions. The multiplications are 32x32 -> 64 bits and therefore infeasible +to reverse. Permuting equalizes the distribution of the resulting bytes. + +The internal state occupies four 256-bit AVX2 registers. Due to limitations of +the instruction set, the registers are partitioned into two 512-bit halves that +remain independent until the reduce phase. The algorithm outputs 64 bit digests +or up to 256 bits at no extra cost. + +In addition to high throughput, the algorithm is designed for low finalization +cost. The result is more than twice as fast as SipTreeHash. + +For older CPUs, we also provide an SSE4.1 version (80% as fast for large inputs +and 95% as fast for short inputs) and a portable version (10% as fast). + +Statistical analyses and preliminary cryptanalysis are given in +https://arxiv.org/abs/1612.06257. + +## Versioning and stability + +SipHash and HighwayHash 1.0 are 'fingerprint functions' whose input -> hash +mapping will not change. This is important for applications that write hashes to +persistent storage. + +HighwayHash has not yet reached 1.0 and may still change in the near future. We +will announce when it is frozen. + +## Speed measurements + +To measure the CPU cost of a hash function, we can either create an artificial +'microbenchmark' (easier to control, but probably not representative of the +actual runtime), or insert instrumentation directly into an application (risks +influencing the results through observer overhead). We provide novel variants of +both approaches that mitigate their respective disadvantages. + +profiler.h uses software write-combining to stream program traces to memory +with minimal overhead. These can be analyzed offline, or when memory is full, +to learn how much time was spent in each (possibly nested) zone. + +nanobenchmark.h enables cycle-accurate measurements of very short functions. +It uses CPU fences and robust statistics to minimize variability, and also +avoids unrealistic branch prediction effects. + +We compile the C++ implementations with a patched GCC 4.9 and run on a single +core of a Xeon E5-2690 v3 clocked at 2.6 GHz. CPU cost is measured as cycles per +byte for various input sizes: + +Algorithm | 8 | 31 | 32 | 63 | 64 | 1024 +---------------- | ----- | ---- | ---- | ---- | ---- | ---- +HighwayHashAVX2 | 7.34 | 1.81 | 1.71 | 1.04 | 0.95 | 0.24 +HighwayHashSSE41 | 8.00 | 2.11 | 1.75 | 1.13 | 0.96 | 0.30 +SipTreeHash | 16.51 | 4.57 | 4.09 | 2.22 | 2.29 | 0.57 +SipTreeHash13 | 12.33 | 3.47 | 3.06 | 1.68 | 1.63 | 0.33 +SipHash | 8.13 | 2.58 | 2.73 | 1.87 | 1.93 | 1.26 +SipHash13 | 6.96 | 2.09 | 2.12 | 1.32 | 1.33 | 0.68 + +SipTreeHash is slower than SipHash for small inputs because it processes blocks +of 32 bytes. AVX2 and SSE4.1 HighwayHash are faster than SipHash for all input +sizes due to their highly optimized handling of partial vectors. + +Note that previous measurements included the initialization of their input, +which dramatically increased timings especially for small inputs. + +## CPU requirements + +SipTreeHash[13] requires an AVX2-capable CPU (e.g. Haswell). HighwayHash +includes a dispatcher that chooses the best available (AVX2, SSE4.1 or portable) +implementation at runtime, as well as a directly callable function template that +can only run on the CPU for which it was built. SipHash[13] and +ScalarSipTreeHash[13] have no particular CPU requirements. + +Our implementations use custom AVX2 vector classes with overloaded operators +(e.g. `const V4x64U a = b + c`) for type-safety and improved readability vs. +compiler intrinsics (e.g. `const __m256i a = _mm256_add_epi64(b, c)`). + +We intend to port HighwayHash to other SIMD-capable platforms, especially ARM. + +Our instruction_sets dispatcher avoids running newer instructions on older CPUs +that do not support them. However, intrinsics, and therefore also any vector +classes that use them, require a compiler flag that also enables the compiler to +generate code for that CPU. This means the intrinsics must be placed in separate +translation units that are compiled with the required flags. It is important +that these source files and their headers not define any inline functions, +because that might break the one definition rule and cause crashes. + +To minimize dispatch overhead when hashes are computed often (e.g. in a loop), +we can inline the hash function into its caller using templates. The dispatch +overhead will only be paid once (e.g. before the loop). The template mechanism +also avoids duplicating code in each CPU-specific implementation. + +## Defending against hash flooding + +To mitigate hash flooding attacks, we need to take both the hash function and +the data structure into account. + +We wish to defend (web) services that utilize hash sets/maps against +denial-of-service attacks. Such data structures assign attacker-controlled +input messages `m` to a hash table bin `b` by computing the hash `H(s, m)` +using a hash function `H` seeded by `s`, and mapping it to a bin with some +narrowing function `b = R(h)`, discussed below. + +Attackers may attempt to trigger 'flooding' (excessive work in insertions or +lookups) by finding multiple `m` that map to the same bin. If the attacker has +local access, they can do far worse, so we assume the attacker can only issue +remote requests. If the attacker is able to send large numbers of requests, +they can already deny service, so we need only ensure the attacker's cost is +sufficiently large compared to the service's provisioning. + +If the hash function is 'weak', attackers can easily generate 'hash collisions' +(inputs mapping to the same hash values) that are independent of the seed. In +other words, certain input messages will cause collisions regardless of the seed +value. The author of SipHash has published C++ programs to generate such +'universal (key-independent) multicollisions' for CityHash and Murmur. Similar +'differential' attacks are likely possible for any hash function consisting only +of reversible operations (e.g. addition/multiplication/rotation) with a constant +operand. `n` requests with such inputs cause `n^2` work for an unprotected hash +table, which is unacceptable. + +By contrast, 'strong' hashes such as SipHash or HighwayHash require infeasible +attacker effort to find a hash collision (an expected 2^32 guesses of `m` per +the birthday paradox) or recover the seed (2^63 requests). These security claims +assume the seed is secret. It is reasonable to suppose `s` is initially unknown +to attackers, e.g. generated on startup or even per-connection. A timing attack +by Wool/Bar-Yosef recovers 13-bit seeds by testing all 8K possibilities using +millions of requests, which takes several days (even assuming unrealistic 150 us +round-trip times). It appears infeasible to recover 64-bit seeds in this way. + +However, attackers are only looking for multiple `m` mapping to the same bin +rather than identical hash values. We assume they know or are able to discover +the hash table size `p`. It is common to choose `p = 2^i` to enable an efficient +`R(h) := h & (p - 1)`, which simply retains the lower hash bits. It may be +easier for attackers to compute partial collisions where only the lower `i` bits +match. This can be prevented by choosing a prime `p` so that `R(h) := h % p` +incorporates all hash bits. The costly modulo operation can be avoided by +multiplying with the inverse (https://goo.gl/l7ASm8). An interesting alternative +suggested by Kyoung Jae Seo chooses a random subset of the `h` bits. Such an `R` +function can be computed in just 3 cycles using PEXT from the BMI2 instruction +set. This is expected to defend against SAT-solver attacks on the hash bits at a +slightly lower cost than the multiplicative inverse method, and still allows +power-of-two table sizes. + +Summary thus far: given a strong hash function and secret seed, it appears +infeasible for attackers to generate hash collisions because `s` and/or `R` are +unknown. However, they can still observe the timings of data structure +operations for various `m`. With typical table sizes of 2^10 to 2^17 entries, +attackers can detect some 'bin collisions' (inputs mapping to the same bin). +Although this will be costly for the attacker, they can then send many instances +of such inputs, so we need to limit the resulting work for our data structure. + +Hash tables with separate chaining typically store bin entries in a linked list, +so worst-case inputs lead to unacceptable linear-time lookup cost. We instead +seek optimal asymptotic worst-case complexity for each operation (insertion, +deletion and lookups), which is a constant factor times the logarithm of the +data structure size. This naturally leads to a tree-like data structure for each +bin. The Java8 HashMap only replaces its linked list with trees when needed. +This leads to additional cost and complexity for deciding whether a bin is a +list or tree. + +Our first proposal (suggested by Github user funny-falcon) avoids this overhead +by always storing one tree per bin. It may also be worthwhile to store the first +entry directly in the bin, which avoids allocating any tree nodes in the common +case where bins are sparsely populated. What kind of tree should be used? +Scapegoat and splay trees only offer amortized complexity guarantees, whereas +treaps require an entropy source and have higher constant factors in practice. +Self-balancing structures such as 2-3 or red-black trees require additional +bookkeeping information. We can hope to reduce rebalancing cost by realizing +that the output bits of strong `H` functions are uniformly distributed. When +using them as keys instead of the original message `m`, recent relaxed balancing +schemes such as left-leaning red-black or weak AVL trees may require fewer tree +rotations to maintain their invariants. Note that `H` already determines the +bin, so we should only use the remaining bits. 64-bit hashes are likely +sufficient for this purpose, and HighwayHash generates up to 256 bits. It seems +unlikely that attackers can craft inputs resulting in worst cases for both the +bin index and tree key without being able to generate hash collisions, which +would contradict the security claims of strong hashes. Even if they succeed, the +relaxed tree balancing still guarantees an upper bound on height and therefore +the worst-case operation cost. For the AVL variant, the constant factors are +slightly lower than for red-black trees. + +The second proposed approach uses augmented/de-amortized cuckoo hash tables +(https://goo.gl/PFwwkx). These guarantee worst-case `log n` bounds for all +operations, but only if the hash function is 'indistinguishable from random' +(uniformly distributed regardless of the input distribution), which is claimed +for SipHash and HighwayHash but certainly not for weak hashes. + +Both alternatives retain good average case performance and defend against +flooding by limiting the amount of extra work an attacker can cause. The first +approach guarantees an upper bound of `log n` additional work even if the hash +function is compromised. + +In summary, a strong hash function is not, by itself, sufficient to protect a +chained hash table from flooding attacks. However, strong hash functions are +important parts of two schemes for preventing denial of service. Using weak hash +functions can slightly accelerate the best-case and average-case performance of +a service, but at the risk of greatly reduced attack costs and worst-case +performance. + +## Third-party implementations / bindings + +Thanks to Damian Gryski for making us aware of these third-party +implementations or bindings. Please feel free to get in touch or +raise an issue and we'll add yours as well. + +By | Language | URL +--- | --- | --- +Damian Gryski | Go and SSE | https://github.com/dgryski/go-highway/ +Lovell Fuller | node.js bindings | https://github.com/lovell/highwayhash +Vinzent Steinberg | Rust bindings | https://github.com/vks/highwayhash-rs + +## Modules + +### Hashes + +* c_bindings.h declares C-callable versions of SipHash/HighwayHash. +* sip_hash.cc is the compatible implementation of SipHash, and also provides + the final reduction for sip_tree_hash. +* sip_tree_hash.cc is the faster but incompatible SIMD j-lanes tree hash. +* scalar_sip_tree_hash.cc is a non-SIMD version. +* state_helpers.h simplifies the implementation of the SipHash variants. +* highwayhash.h is our new, fast hash function. +* hh_avx2.h, hh_sse41.h and hh_portable.h are its various implementations. +* highwayhash_target.h chooses the best available implementation at runtime. + +### Infrastructure + +* arch_specific.h offers byte swapping and CPUID detection. +* compiler_specific.h defines some compiler-dependent language extensions. +* data_parallel.h provides a C++11 ThreadPool and PerThread (similar to + OpenMP). +* instruction_sets.h and targets.h enable efficient CPU-specific dispatching. +* nanobenchmark.h measures elapsed times with < 1 cycle variability. +* os_specific.h sets thread affinity and priority for benchmarking. +* profiler.h is a low-overhead, deterministic hierarchical profiler. +* tsc_timer.h obtains high-resolution timestamps without CPU reordering. +* vector256.h and vector128.h contain wrapper classes for AVX2 and SSE4.1. + +By Jan Wassenberg <jan.wassenberg@gmail.com> and Jyrki Alakuijala +<jyrki.alakuijala@gmail.com>, updated 2017-02-07 + +This is not an official Google product. |