Hashlib

Hashlib#

Hashlib is a library for creating hashes in Python.

Basic exmaples#

You need to select the hash algorithm which is available in hashlib and pass bytes to be hashed. And call hexdigest method from the result.

import hashlib
hashlib.md5(b"hello").hexdigest()
'5d41402abc4b2a76b9719d911017c592'

To use hashlib with string variables, convert them to bytes using the str.encode method.

import hashlib
hashlib.md5("hello".encode()).hexdigest()
'5d41402abc4b2a76b9719d911017c592'

To apply hashlib to a numeric variable, simply apply the bytes function to the number in question. Gotten object can be used as argument to hashlib.

import hashlib
numeric_value = 10
hashlib.md5(bytes(numeric_value)).hexdigest()
'a63c90cc3684ad8b0a2176a6a8fe9005'

Sample split#

There is a nice application of hashes in statistics - you can generate independent and reproducible grouping. For unique identifiers of your records you just generate hashes, convert them to int and use the remainder of the division to get the identifier of the group. It can be particularly useful for A/B testing.

So just use the int function for the result of applying the hash function and specify the type of int as the second argument - for the unique identifier of the sample you will get a unique number.

import hashlib
numeric_value = 10

hash = hashlib.md5(bytes(numeric_value)).hexdigest()
print("hash", hash)
print("number", int(hash, 16))
hash a63c90cc3684ad8b0a2176a6a8fe9005
number 220966321958208791582980441802673131525

The following example shows how to divide the sample into three almost equal groups - dividing the remainder by 3 will define the final group.

import numpy as np
import pandas as pd

np.random.seed(10)
sample = np.random.normal(3, 10, 10000)
groupbs_number = 3

(pd.Series(
    map(
        lambda n: int(hashlib.md5(bytes(n)).hexdigest(), 16)%groupbs_number,
        sample
    )
).value_counts(normalize=True)*100).to_frame()
proportion
0 33.93
1 33.52
2 32.55

The following example divides the sample into 20/80 per cent proportions. We have just used a marker that is 100 more than 20. As you can see, we’ve divided the sample into groups of almost 20/80.

import numpy as np
import pandas as pd

np.random.seed(10)
sample = np.random.normal(3, 10, 10000)
groupbs_number = 3

(pd.Series(
    map(
        lambda n: (int(hashlib.md5(bytes(n)).hexdigest(), 16)%100) > 20,
        sample
    )
).value_counts(normalize=True)*100).to_frame()
proportion
True 79.44
False 20.56