Skip to main content
Ctrl+K
Knowledge base - Home
  • Knowlege

Data science

  • Gradient descent
  • Decision tree
  • Model selection
    • Regularization
  • Metrics
    • Cross entropy
    • Confusion matrix
    • Pearson correlation coefficient
    • CAP curve
    • GINI coeficient
    • Jaccard index
  • Ranking task
    • Collaborative filtering
      • Prediction functions
      • Selfmade memory based
    • Task generation
    • Metrics
      • precision@k
      • recall@k
      • AP@k (average precision)
  • NLP
    • Preprocessing
    • TF-IDF
    • Text CNN
    • Recurrent text gen
    • Recurrent translations
  • DL mechanisms
    • Recurrent

Math

  • Intorduction
  • Geometry
    • Triangle
    • Cone
  • Trigonometry
    • Identities
  • MatStat
    • Probability
      • Сonditional probability
    • Central limit theorem
    • Maximum likelihood
    • Statistical testing
      • Binomial test
  • Latex

Algorithms

  • Intro
  • Complexity
  • Binary search
  • Event sort
  • Data structures
    • Binary search trees
  • Dynamic programming
  • Quick sort
  • Specific tasks
    • Emergency task
    • Sinling islands
    • Number reversal

Docker

  • Overview
  • Containers
    • Run
    • Execute in
    • Check logs
  • Images
    • Build image
      • Build context
      • Ignore files
    • Dockerfile directives
      • Execute with start
  • Volumes
  • Networks
    • Default networks
    • Containers communication
    • Local network
  • Acesss to containers
  • Compose
    • Up/down
    • Service configuration

Git

  • Overview
  • File statuses
    • Commit
    • Remove from staging
  • Difference
  • Checkout
  • Branches
    • Merge
    • Remotes
  • Log
  • Previous commit ~
  • How git ignore works
  • Stash
  • Configuration
    • Variables

SQL

  • Intro
  • Select
    • JOIN
      • Join types
    • Sorting (ORDER BY)
    • Unique values (DISTINCT)
    • Empty values
    • Aggregation functions
      • Collect array
    • Group by
      • Conditions on aggregats (HAVING)
    • Window functions (OVER)
      • Functions
        • Values shifting (LAG/LEAD)
    • Expand array
    • Conditional (CASE)
    • Records range (OFFSET/LIMIT)
    • Performance
  • Operating tables
    • Tables properties
  • Data types
    • Date and time
      • Date components (DATE_PART)
      • Add period
    • JSON
  • Load/Modify
  • Versions issues
  • Python interaction
    • Python from container
    • Python on host
    • Create database

Layout

  • Terminology
  • HTML
    • Other
    • Text format
    • Tables
    • List tags
    • SVG
      • Text
      • Path
  • CSS
    • Other
    • Style application
    • Margin (out space)
    • Padding (in space)
    • Overflow
    • Elements position (display)
  • Forms

Linux

  • Intro
  • Bash
  • Manage input/output
  • Filesystem
    • Files and directories
    • Archiving
  • Variables
  • System information
  • Backup
  • Network
    • Curl
    • SSH
    • Netcat (nc)
    • VPN
  • Users & groups
    • Users manipulations
  • Package management
  • Path
  • Cron
  • GnuPG

Other

  • VS Code
  • Python with apache
  • Kubernetes
  • Redis
    • Zset
  • Nginx
    • Reverse proxy
      • Proxy pass
      • Cache
    • Variables
    • Server
    • Configuration
    • Logs
    • Connections management
  • Gitlab
  • Ansible
  • Java script
  • Go templates
  • .ipynb

Binary search

Contents

  • Formal description
  • Last entry
  • Search by answer

Binary search#

Binary search is an algorithm used to find a specific element within an ordered sequence. At each step, it divides the sequence into two equal sub-sequences and continuest searching in only one of them. This page provides an accurate description.

from typing import Any, Callable

Formal description#

Suppose you are processing the sequence: \(x_1, x_2, \ldots, x_{k-1}, x_k, x_{k+1}, \ldots, x_{n-1}, x_n\). The element you are trying to find is \(x_k\).

In this context, the fact that the sequence is ordered means that one can define a function \(f\) such that it takes the value 0 for all elements before \(x_k\) and 1 for all elements from \(x_k\) onward. More formally,

\[\begin{split} f(x_i) = \begin{cases} 0, & i \in \{1, \ldots, k-1\} \\[6pt] 1, & i \in \{k, \ldots, n\}. \end{cases} \end{split}\]

Define variables that indicate the beginning and the end of the interval beingconsidered: \(l_0 = 0, r_0 = n\).

At each step, the middle point of the interval is counted: \(p_j = \left\lfloor \frac{l_j + r_j}{2} \right\rfloor\). Depending on the value of \(f(x_{p_j})\), we decide whether to move \(l\) or \(r\):

\[\begin{split}r_j = \begin{cases} p_{j - 1}, f(x_{p_j}) = 1 \\ r_{j - 1}, f(x_{p_j}) = 0. \end{cases} \end{split}\]
\[\begin{split} l_j = \begin{cases} p_{j - 1}, f(x_{p_j}) = 0 \\ l_{j - 1}, f(x_{p_j}) = 1. \end{cases} \end{split}\]

Procedure continues until \(l_j + 1 < r_j\). The fact \(l_j + 1 = r_j\) garantees that \(x_{r_j} = x_k\)

Note: A very common mistake in binary search is to forget to add \(1\) in the stopping condition \(l_j + 1 < r_j\). Omitting it can lead to an infinite loop. At some point, you may reach the situation where \(l_j = k - 1\) and \(r_j = k\). Then:

\[ p_j = \left\lfloor \frac{l_j + r_j}{2} \right\rfloor = \left\lfloor \frac{(k - 1) + k}{2} \right\rfloor = \left\lfloor k - 1 + 0.5 \right\rfloor = k - 1. \]

So,

\[ f(x_{k - 1}) = 0 \quad \Longrightarrow \quad r_{j + 1} = r_j, \quad l_{j + 1} = p_j = k - 1 = l_j. \]

As a result, \(r_j = r_{j - 1}\) and \(l_j = l_{j - 1}\) — the search interval does not shrink, and the loop runs forever.


The following cell defines a generalised implemention of the binary search:

def first_entry(sequence: list[Any], fun: Callable[[int], int]) -> int:
    '''
    Implementation of the first entry binary search.

    Parameters
    ----------
    sequence: list[Any]
        Sequence of elements to be processed.
    fun: Callable
        Function that takes value True for appropriate side of the sequence.
    
    Returns
    -------
    The first index when fun takes value True.
    '''
    l, r = -1, len(sequence)
    while l + 1 < r:
        p = (l + r) // 2
        if fun(sequence[p]):
            l = p
        else:
            r = p
    return r

The following cell shows how the function implemented earlier used to find a specific number (12) in a list of integers.

first_entry(
    [1, 7, 9, 10, 12, 12, 16, 21],
    lambda x: x < 12
)
4

Last entry#

Binary search algorithm in different sources can be defined slightly different, but important is the one option - last occurance approach.

The previously defined algorithm, in the case where there are multiple elements with the same value, \(x_k = x_{k+1} = x_{k+2} = \ldots\), will return \(k\), which corresponds to the first occurrence of an element for which \(f(x) = 1\).

The widely accepted alternative last occurrence approach sugests that \(f\) is defined as:

\[\begin{split} f(x_i) = \begin{cases} 1, & i \in \{1, \ldots, k\} \\[6pt] 0, & i \in \{k + 1, \ldots, n\}. \end{cases} \end{split}\]

Accordingly, the following alterations are expected to be implemented in the mechanisms employed for the computation of interval borders and the midpoint of an interval:

\[ p_j = \left\lceil \frac{l_j + r_j}{2} \right\rceil \]
\[\begin{split}r_j = \begin{cases} p_{j - 1}, f(x_{p_j}) = 0 \\ r_{j - 1}, f(x_{p_j}) = 1. \end{cases} \end{split}\]
\[\begin{split} l_j = \begin{cases} p_{j - 1}, f(x_{p_j}) = 1 \\ l_{j - 1}, f(x_{p_j}) = 0. \end{cases} \end{split}\]

Such changes occur in the final steps of the algorithm:

  • When \(\frac{l_j + r_j}{2} = k + \varepsilon\), where \(\varepsilon \in (0, 1)\), we have \(p_j = \lceil k + \varepsilon \rceil = k + 1\). Since \(f(x_{k+1}) = 0\), the right boundary is moved to the \((k+1)\)-th position: \(r_j = k + 1\).

  • When \(\frac{l_j + r_j}{2} = k - 1 + \varepsilon\), where \(\varepsilon \in (0, 1)\), we have \(p_j = \lceil k - 1 + \varepsilon \rceil = k\). Since \(f(x_k) = 1\), the left boundary is moved to the \(k\)-th position: \(l_j = k\).

After all the steps, we have \(l_j = k\) and \(r_j = k + 1\). Even if there are multiple identical values, \(x_{k - l} = x_{k - l + 1} = \ldots = x_{k - 1} = x_k\), the algorithm will still point, by the final left boundary, to the \(k\)-th element — the first occurrence of the desired element.


The next code defines universal implementation of the last entry approach:

def last_entry(sequence: list[Any], fun: Callable[[int], int]):
    '''
    Implementation of the last entry binary search.

    Parameters
    ----------
    sequence: list[Any]
        Sequence of elements to be processed.
    fun: Callable
        Function that takes value True for appropriate side of the sequence.
    
    Returns
    -------
    The last index when fun takes value True.
    '''

    l, r = -1, len(sequence)
    while l + 1 < r:
        p = (l + r + 1) // 2
        if fun(sequence[p]):
            l = p
        else:
            r = p
    return l

Following cell shows, use of the function defined earlier to find index of the 12 value in the list:

last_entry(
    [5, 7, 9, 12, 12, 14, 16, 18, 21, 22],
    lambda x: x <= 12
)
4

It returns 4, which is the second entry of the value we are looking for.

Search by answer#

The binary search by answer is an important method for solving a set of tasks.

It is used for tasks that require finding the minimum or maximum value under certain conditions. The conditions must be able to be represented as a binary function that changes value once at a certain point.

More formally, for the binary function \(f(x)\) is required to find \(\min_{f(x)=True}{(x)}\).


The best way to demonstrate this approach is to show the tasks that can be solved using this approach.

Given a sequence of numbers ordered in ascending order: \(a_1, a_2, \ldots, a_n\). Required to find \(L\) non-overlapping \(c\)-elements subsets of numbers such that the maximum difference is minimal:

\[\Delta = \min_{\{i_1, i_2, \ldots, i_L\}}\left[\max_j \left( a_{i_j + c} - a_{i_j} \right)\right]\]

Here, \(i_j, j=\overline{1,n}\) is the first index of the \(j\)-th group.

The binary search solution involves searching for smallest \(\Delta\) that allows the formation of \(L\) non-overlapping \(c\)-element subsets.

The following cell realizes the function that checks whether it is possible to find such subsets for given \(\Delta\) (D).

def check_groups(lst: list[int], L: int, c: int, D: int) -> bool:
    """
    Check if it possible to form L groups of c elements
    """
    i = 0
    counter = 0
    lst_len = len(lst)
    while i < lst_len - c + 1:
        if lst[i + c - 1] - lst[i] <= D:
            counter += 1
            i += c
        else:
            i += 1
        if counter >= L:
            return True
    return False

Consider set of numbers \(\left\{1, 5, 15, 20, 20\right\}\) and parameters \(L=2\), \(c=2\). For \(\Delta=4\) it still possible to find subsets \(\left\{1, 5\right\}, \left\{20, 20\right\}\):

check_groups(
    lst=[1, 5, 15, 20, 20], L=2, c=2, D=4
)
True

However, for \(\Delta = 1\), the subset \(\left\{1, 5\right\}\) is incorrect - it’s impossible to find the corresponding subsets.

check_groups(
    lst=[1, 5, 15, 20, 20], L=2, c=2, D=1
)
False

Now, just binserach for the first \(\Delta\) that corresponds to the requirements:

def solution(lst: list[int], L: int, c: int) -> int:
    left, right = 0, len(lst)
    while left + 1 < right:
        D = (left + right) // 2
        if check_groups(lst=lst, L=L, c=c, D=D):
            right = D
        else:
            left = D
    return right
solution(lst=[1, 5, 15, 20, 20], L=2, c=2)
4

previous

Complexity

next

Event sort

Contents
  • Formal description
  • Last entry
  • Search by answer

By Fedor Kobak

© Copyright 2023.