Hashlib#
Hashlib is a library for creating hashes in Python.
Basic exmaples#
You need to select the hash algorithm which is available in hashlib
and pass bytes to be hashed. And call hexdigest
method from the result.
import hashlib
hashlib.md5(b"hello").hexdigest()
'5d41402abc4b2a76b9719d911017c592'
To use hashlib
with string variables, convert them to bytes using the str.encode
method.
import hashlib
hashlib.md5("hello".encode()).hexdigest()
'5d41402abc4b2a76b9719d911017c592'
To apply hashlib
to a numeric variable, simply apply the bytes
function to the number in question. Gotten object can be used as argument to hashlib
.
import hashlib
numeric_value = 10
hashlib.md5(bytes(numeric_value)).hexdigest()
'a63c90cc3684ad8b0a2176a6a8fe9005'
Sample split#
There is a nice application of hashes in statistics - you can generate independent and reproducible grouping. For unique identifiers of your records you just generate hashes, convert them to int
and use the remainder of the division to get the identifier of the group. It can be particularly useful for A/B testing.
So just use the int
function for the result of applying the hash function and specify the type of int
as the second argument - for the unique identifier of the sample you will get a unique number.
import hashlib
numeric_value = 10
hash = hashlib.md5(bytes(numeric_value)).hexdigest()
print("hash", hash)
print("number", int(hash, 16))
hash a63c90cc3684ad8b0a2176a6a8fe9005
number 220966321958208791582980441802673131525
The following example shows how to divide the sample into three almost equal groups - dividing the remainder by 3 will define the final group.
import numpy as np
import pandas as pd
np.random.seed(10)
sample = np.random.normal(3, 10, 10000)
groupbs_number = 3
(pd.Series(
map(
lambda n: int(hashlib.md5(bytes(n)).hexdigest(), 16)%groupbs_number,
sample
)
).value_counts(normalize=True)*100).to_frame()
proportion | |
---|---|
0 | 33.93 |
1 | 33.52 |
2 | 32.55 |
The following example divides the sample into 20/80 per cent proportions. We have just used a marker that is 100 more than 20. As you can see, we’ve divided the sample into groups of almost 20/80.
import numpy as np
import pandas as pd
np.random.seed(10)
sample = np.random.normal(3, 10, 10000)
groupbs_number = 3
(pd.Series(
map(
lambda n: (int(hashlib.md5(bytes(n)).hexdigest(), 16)%100) > 20,
sample
)
).value_counts(normalize=True)*100).to_frame()
proportion | |
---|---|
True | 79.44 |
False | 20.56 |