TF-IDF#
TF-IDF is a method for extracting features for machine learning models from textual information.
from math import log
from collections import Counter
from IPython.display import HTML
import latex2mathml.converter
# set of phrases that I'll be using as the example at this page
phrases = [
'a penny saved is a penny earned',
'the quick brown fox jumps over the lazy dog',
'beauty is in the eye of the beholder',
'early to bed and early to rise makes a man healthy wealthy and wise',
'give credit where credit is due',
"if at first you don't succeed try try again",
'justice delayed is justice denied',
'keep your friends close and your enemies closer',
'no pain no gain',
'quickly come quickly go',
'united we stand divided we fall',
'when in rome do as the romans do'
]
TF - term frequency#
Term frequency is a metric for words in any next. It can be calculated using a formula:
Where:
\(t\) - some word;
\(d\) - some text;
\(n_t\) - number of occurrences of word \(t\) in document \(d\);
\(\sum_i n_i\) - number of words in text \(d\).
So in the following cell I calculate the term frequencies of the words for some phrases. The result here is a table that contains original phrase
and Term frequency
, which for each word from the Original phrase
corresponds to \(tf\) in the form <word>-<tf>
.
So let’s take the logic of the first phrase - “a penny saved is a penny earned” - one step at a time:
Total count of words - \(\sum_i n_i = 7\);
You can find the word “a” twice in the phrase so - \(n_{'a'} = 2 \Rightarrow tf('a')=\frac{2}{7} \approx 0.29\);
You can find the word “penny” twice in the phrase so - \(n_{'penny'}=2 \Rightarrow tf('penny')= \frac{2}{7} \approx 0.29\);
All other words occur once so \(tf\) for them can me computed as \(\frac{1}{7} \approx 0.14\).
html_table = "<tr><th>Original phrase</th><th>Terms frequency</th></tr>"
tf_dict = {}
for p in phrases:
words_in_phrase = dict(Counter(p.split()))
words_count = sum(words_in_phrase.values())
phrase_tfs = {word:number/words_count for word, number in words_in_phrase.items()}
tf_dict[p] = phrase_tfs
tf_dict
counts_line = "<br>".join(
[
key + " - " + str(round(value, 2))
for key, value in phrase_tfs.items()
]
)
html_table += f"<tr><td>{p}</td><td>{counts_line}</td></tr>"
HTML("<table>" + html_table + "</table>")
Original phrase | Terms frequency |
---|---|
a penny saved is a penny earned | a - 0.29 penny - 0.29 saved - 0.14 is - 0.14 earned - 0.14 |
the quick brown fox jumps over the lazy dog | the - 0.22 quick - 0.11 brown - 0.11 fox - 0.11 jumps - 0.11 over - 0.11 lazy - 0.11 dog - 0.11 |
beauty is in the eye of the beholder | beauty - 0.12 is - 0.12 in - 0.12 the - 0.25 eye - 0.12 of - 0.12 beholder - 0.12 |
early to bed and early to rise makes a man healthy wealthy and wise | early - 0.14 to - 0.14 bed - 0.07 and - 0.14 rise - 0.07 makes - 0.07 a - 0.07 man - 0.07 healthy - 0.07 wealthy - 0.07 wise - 0.07 |
give credit where credit is due | give - 0.17 credit - 0.33 where - 0.17 is - 0.17 due - 0.17 |
if at first you don't succeed try try again | if - 0.11 at - 0.11 first - 0.11 you - 0.11 don't - 0.11 succeed - 0.11 try - 0.22 again - 0.11 |
justice delayed is justice denied | justice - 0.4 delayed - 0.2 is - 0.2 denied - 0.2 |
keep your friends close and your enemies closer | keep - 0.12 your - 0.25 friends - 0.12 close - 0.12 and - 0.12 enemies - 0.12 closer - 0.12 |
no pain no gain | no - 0.5 pain - 0.25 gain - 0.25 |
quickly come quickly go | quickly - 0.5 come - 0.25 go - 0.25 |
united we stand divided we fall | united - 0.17 we - 0.33 stand - 0.17 divided - 0.17 fall - 0.17 |
when in rome do as the romans do | when - 0.12 in - 0.12 rome - 0.12 do - 0.25 as - 0.12 the - 0.12 romans - 0.12 |
IDF - inverse document frequency#
For each word of a text from the given set of texts. It can be calculated using fromula:
Where:
\(D\) - set of texts;
\(\left| A \right|\) - number of elements in the set A;
\(\left| \left\{ d_i \in D | t \in d_i \right\} \right|\) - number of documents \(d_i\) from set \(D\) that contains word \(t\);
Note the denominator of the formula contains exactly the number of documents in which the word is included, not the number of occurrences of the word in any documents it’s always true \(\left| D \right| \geq \left| \left\{ d_i \in D | t \in d_i \right\} \right| \Rightarrow idf(t,D) \geq 0\).
So in the following cell there is an example of calculating \(idf\) for a set of texts. It’s displayed like a table that contains all the words from a set of texts and for each word it calculates the occurrences of the word in the set of texts and it’s \(idf\).
Let’s take word “the” for example it occurs in 3 of 12 texts so it’s \(idf=log(\frac{12}{5}) \approx 1.39\).
phrases_number = len(phrases)
word_in_documents = Counter([w for p in phrases for w in set(p.split())])
words_idf = {}
# we need to transform directry to MathJax here.
# because quarto doesn't recognise $$ patterns in
# output cells
math_jax_expression = latex2mathml.converter.convert(
"\left|\left\{d_i \in D | t \in d_i\\right\}\\right|"
)
html_table = (
"<tr><th>Word</th>"
f"<th>{math_jax_expression}</th>"
"<th>Inverse document frequency</th></tr>"
)
for word, number in word_in_documents.items():
idf = log(phrases_number/number)
words_idf[word] = idf
html_table += (
f"<tr><td>{word}</td>"
f"<td>{number}</td>"
f"<td>{round(idf, 2)}</td></tr>"
)
HTML("<table>" + html_table + "</table>")
Word | Inverse document frequency | |
---|---|---|
saved | 1 | 2.48 |
a | 2 | 1.79 |
penny | 1 | 2.48 |
earned | 1 | 2.48 |
is | 4 | 1.1 |
over | 1 | 2.48 |
jumps | 1 | 2.48 |
brown | 1 | 2.48 |
dog | 1 | 2.48 |
fox | 1 | 2.48 |
quick | 1 | 2.48 |
lazy | 1 | 2.48 |
the | 3 | 1.39 |
beauty | 1 | 2.48 |
beholder | 1 | 2.48 |
in | 2 | 1.79 |
of | 1 | 2.48 |
eye | 1 | 2.48 |
early | 1 | 2.48 |
makes | 1 | 2.48 |
man | 1 | 2.48 |
healthy | 1 | 2.48 |
wealthy | 1 | 2.48 |
rise | 1 | 2.48 |
bed | 1 | 2.48 |
to | 1 | 2.48 |
and | 2 | 1.79 |
wise | 1 | 2.48 |
where | 1 | 2.48 |
due | 1 | 2.48 |
credit | 1 | 2.48 |
give | 1 | 2.48 |
you | 1 | 2.48 |
again | 1 | 2.48 |
don't | 1 | 2.48 |
succeed | 1 | 2.48 |
if | 1 | 2.48 |
try | 1 | 2.48 |
first | 1 | 2.48 |
at | 1 | 2.48 |
denied | 1 | 2.48 |
delayed | 1 | 2.48 |
justice | 1 | 2.48 |
keep | 1 | 2.48 |
close | 1 | 2.48 |
your | 1 | 2.48 |
enemies | 1 | 2.48 |
closer | 1 | 2.48 |
friends | 1 | 2.48 |
pain | 1 | 2.48 |
gain | 1 | 2.48 |
no | 1 | 2.48 |
come | 1 | 2.48 |
quickly | 1 | 2.48 |
go | 1 | 2.48 |
fall | 1 | 2.48 |
united | 1 | 2.48 |
we | 1 | 2.48 |
divided | 1 | 2.48 |
stand | 1 | 2.48 |
romans | 1 | 2.48 |
as | 1 | 2.48 |
do | 1 | 2.48 |
when | 1 | 2.48 |
rome | 1 | 2.48 |
This metric shows how often a word occurs in general across all texts. If the word is really common, there’s a high probability that it’s just an article, pretext or something like that - it doesn’t make much sense in the sentence. For example, let’s consider the extreme value, if you find any word \(t'\) in any text, it means that \(\frac{\left| D \right|}{\left| \left\{ d_i \in D | t \in d_i \right\} \right|} = 1 \Rightarrow log \frac{\left| D \right|}{\left| \left\{ d_i \in D | t \in d_i \right\} \right|} = 0\) - the presence of the word \(t'\) in no way makes it possible to distinguish one entry from another.
TF-IDF#
\(ft_{idf}\) is a final metric of the TF-IDF analysis and can be calculated as the product of TF and IDF. So for each word in each text from the set of texts we compute it’s own value \(tf_{idf}\). So in the following cell I combine results from two previous sections to compute \(tf_{idf}\). For example, for word “a” in phrase “a penny saved is a penny earned” \(tf_{idf} = 0.29*1.79 \approx 0.51\).
html_table = (
f"""<tr>
<th>Phrase</th>
<th>{latex2mathml.converter.convert("tf_{idf}")}</th>
</tr>"""
)
for phrase, tfs in tf_dict.items():
phrase_tf_idf_line = "<br>".join([
(
word + " - " +
str(round(words_idf[word]*tf,2))
)
for word, tf in tfs.items()
])
html_table += f"<tr><td>{phrase}</td><td>{phrase_tf_idf_line}</td></tr>"
HTML("<table>" + html_table + "</table>")
Phrase | |
---|---|
a penny saved is a penny earned | a - 0.51 penny - 0.71 saved - 0.35 is - 0.16 earned - 0.35 |
the quick brown fox jumps over the lazy dog | the - 0.31 quick - 0.28 brown - 0.28 fox - 0.28 jumps - 0.28 over - 0.28 lazy - 0.28 dog - 0.28 |
beauty is in the eye of the beholder | beauty - 0.31 is - 0.14 in - 0.22 the - 0.35 eye - 0.31 of - 0.31 beholder - 0.31 |
early to bed and early to rise makes a man healthy wealthy and wise | early - 0.35 to - 0.35 bed - 0.18 and - 0.26 rise - 0.18 makes - 0.18 a - 0.13 man - 0.18 healthy - 0.18 wealthy - 0.18 wise - 0.18 |
give credit where credit is due | give - 0.41 credit - 0.83 where - 0.41 is - 0.18 due - 0.41 |
if at first you don't succeed try try again | if - 0.28 at - 0.28 first - 0.28 you - 0.28 don't - 0.28 succeed - 0.28 try - 0.55 again - 0.28 |
justice delayed is justice denied | justice - 0.99 delayed - 0.5 is - 0.22 denied - 0.5 |
keep your friends close and your enemies closer | keep - 0.31 your - 0.62 friends - 0.31 close - 0.31 and - 0.22 enemies - 0.31 closer - 0.31 |
no pain no gain | no - 1.24 pain - 0.62 gain - 0.62 |
quickly come quickly go | quickly - 1.24 come - 0.62 go - 0.62 |
united we stand divided we fall | united - 0.41 we - 0.83 stand - 0.41 divided - 0.41 fall - 0.41 |
when in rome do as the romans do | when - 0.31 in - 0.22 rome - 0.31 do - 0.62 as - 0.31 the - 0.17 romans - 0.31 |
For each record, you can perform some aggregations on the numbers received. Common aggregations are maximum and average.