.

hashing for dollars

Existing string munging on RX

Autokey

  • Used for CRIDs
  • Traditional key-based encryption
  • Jumbles text, retains symbol form and length
  • Very fast
  • Decryptable
  • Not particularly secure

Autokey

  • Input: all our words from loose using have lost their edge.
  • Key: secret
  • Output: kbu cjd hprjy qmud ilujd xpxkq xxsw hvng zgksj nqwa.

Autokey

  • Input: CRID EXAMPLE
  • Key: secret
  • Output: kbu cjd hprjy qmud ilujd xpxkq xxsw hvng zgksj nqwa.

MD5

  • Readily available
  • Non-decryptable
  • Very secure
  • Slow

MD5

  • Input: EXAMPLE HERE
  • Output: a68bf07db30dffa768179ecd86ab7adc

PubID/SiteID

  • Get a bunch of them
  • All different formats, lengths
  • Coming from different places
  • May include ones that a partner has seen already

First try: Autokey

Input Output
537122145-537226058 i2wu8sex4-unrnfnhhk
38944-104934 wezhx-0potoh
559903-112284 i4y26t-eu1rso
557793.AOL.com LLC-86591 dbw05g.AOL.bdf LLC-ewi20
8769-xapi:52856:se4JGbVW7HXU gdv2-tno7:y8yiz:r3oJGvVWsHXU

Business value

  • Uniform format
  • Easily understood
  • Judge content as coming from RX

Second try: MD5

Input Output
537122145-537226058 a67668e7c37a6b470ce17bb0aa2a329b
8769-xapi:52856:se4JGbVW7HXU 9a42e2dfddc2aa62558a7a10ce79dfb3

Third try: Siphash

Input Hex Output Decimal Output
537122145-537226058 42efc98c01b5b6d9 4823295329098381017
8769-xapi:52856:se4JGbVW7HXU d22e925f35323826 15145203534505588774

Siphash Performance

Algorithm Runtime
autokey 0.504 ms
md5 2.512 ms
siphash 9.464 ms

Google's HighwayHash

  • Faster (0.913 ms)
  • Cryptographic

Cryptographically Secure Hashes

  • MD5
  • SHA

Necessary for RX

  • Indexes our list
  • Are mostly unique
  • Fast
  • Short

Google's FarmHash

  • Based on CityHash
  • Based on MurmurHash
  • Not cryptographically secure
  • VERY fast

Farmhash Performance

Algorithm Runtime
autokey 0.504 ms
md5 2.512 ms
siphash 9.464 ms
farmhash 0.756 ms

Farmhash 32 bit Performance

Algorithm Runtime
autokey 0.504 ms
md5 2.512 ms
siphash 9.464 ms
farmhash64 0.756 ms
farmhash32 0.274 ms

Farmhash 32 bit output

Input Hex Output Decimal Output
537122145-537226058 8b8b9c75 2341182581
8769-xapi:52856:se4JGbVW7HXU b247f335 2991059765

Collisions

Collisions

algorithm 268k inputs 1mm inputs 6mm inputs
farmhash32 7 124 4171

Collisions

algorithm 268k inputs 1mm inputs 6mm inputs
farmhash32 7 124 4171
farmhash64 0 0 0

Figuring out bitsize

  • Run on 204,663 inputs
Digits Collisions Digits Collisions
1 204663 7 5486
2 204582 8 571
3 203772 9 53
4 195682 10 5
5 148710 11 1
6 45136 12 0

Statistical analysis

  • Treat hashes like random pickers
  • What's the chance you're going to pick the same one?

Statistical analysis

  • Treat hashes like random pickers
  • What's the chance you're going to pick the same one?
  • Start with the number of ways you can pick all of them:

\[{m \choose 2}\]

Choosing

\[{2 \choose 2} = 1\] \[{3 \choose 2} = 3\] \[{4 \choose 2} = 6\] \[{5 \choose 2} = 10\] \[{100 \choose 2} = 4950\]

Chance of the hash algorithm picking one way

\[\frac{1}{2^{b}}\]

Chance of the hash algorithm picking one way

\[\frac{1}{2^{b}}\]

\[\frac{1}{2^{32}} = 2.3 \times 10^{-10}\]

\[\frac{1}{2^{64}} = 5.4 \times 10^{-20}\]

Combining it all

\[{m \choose 2} \cdot \frac{1}{2^{b}}\]

Combining it all

Bits Collisions Predicted Bits Collisions Predicted
3.32 204663 \(2.0 \times 10^9\) 23.25 5486 2099
6.64 204582 \(2.0 \times 10^8\) 26.57 571 210
9.96 203772 \(2.1 \times 10^7\) 29.89 53 21
13.28 195682 \(2.1 \times 10^6\) 33.21 5 2.1
16.60 148710 210837 36.54 1 0.2
19.93 45136 20966 39.86 0 0.02

\[{204663 \choose 2} \cdot \frac{1}{2^{b}}\]

Master Equation

\[{m \choose 2} \cdot \frac{1}{2^{b}} = 1\]

Useful form 1

You know how many inputs (m). You how know many bits (b). How many collisions (c) should I get?

\[c = {m \choose 2} \cdot \frac{1}{2^{b}}\]

Useful form 2

You know how many inputs (m). You want the smallest number of bits (b) such that collisions should be rare.

\[b = \log_2(m^2 -m) - 1\]

Useful form 3

You know how many bits (b). You want to know the number of inputs (m) before you are likely to get collisions.

\[m = \frac{\big(\sqrt{2^{b+3} + 1} + 1\big)}{2}\]