A Common Sense Guide to Data-Structures and Algorithms - Jay Wengrow

Note

This is one of the best beginner level books on the topic that I’ve encountered. In particular, the sections on Big O and Algorithmic performance come to mind, I don’t know that I’d recommend rereads of the book, but I would most certainly recommend anyone who is new to the topic to read this book end to end, with a pen in tow to make notes right in the book.

It’s that good.

Chapter 1 - Why Data Structures Matter

The Array

An array is a contiguous dataset. While we might say, in Python:

1
2
some_array = [10, 21, 2, 0, 8, 72, 8]
print(some_array[3])

Internally, the machine stores the length of the array, and the position of the first item.

Then, the computer can access the nth item by offsetting from the first. So to get the first item, you offset 0, to get the second, you offset 1.

This is why you get the 0-based indexing everywhere. Makes a lot of sense now that I think about it. I wonder how Visual Basic manages the 1-based index.

So for the computer, to access an item in a list, it is a single step. If you ask for the nth item, all the program needs to do is offset the memory position by n and return the value.

Note

If the array’s length < n, then you have a stack overflow. The program is trying to access memory that is not its own. It’s like trying to overdraw from your bank vault and reaching into the vault next to yours.

In a data structure, you perform the following operations:

  1. Read

  2. Search

  3. Insert

  4. Delete

When we want to measure the speed of these operations, this is where the entire idea of how a data structure is organized comes in. This is why some data structures are suitable for some usages, while some are not.

Note

This book’s section on data structures and why Array indexing is O(1) is quite good. Do give it a reread again.

For an array of size N:

  1. Getting item at index n: 1 step

  2. Setting item at index n: 1 step

  3. Appending 1 item to the end: 1 step

  4. Prepending 1 item to the beginning: N+1 steps

  5. Deleting an item in the list: N steps (1 step is the best case scenario, when you delete the last item)

Tip

For algorithmic speed measurement, always be a pessimist. Consider only the worst case scenario.

The Set

A set is an array of unique values. There are no duplicates.

Note

This is a generalization, but we’ll get to that in time.

Most set operations are identical to arrays. However, insertion is where sets and arrays differ.

Appending or prepending and item into a set of N items will take N steps. That’s because a set needs to fulfil its prime objective: no duplicates. To insert an item i into a set N, you will first need to search the entire set for the occurrence of i, and only when it doesn’t exist, do you insert it into the set.

Chapter 2 - Why Algorithms Matter

While data-structures have their own performance attributes, so do algorithms in general.

Ordered Arrays

An ordered array is almost identical to the classic array*, but the one caveat is that in an ordered array, the items are in order.

This has its benefits, but let’s look at its performance.

  1. Read - 1 step

  2. Insertion - N + 2 steps

  3. Deletion - N steps

Insertion again takes N + 2 steps because the item needs to be inserted into the right location. You cannot just insert an item at the very end.

Searching Ordered Arrays

While searching normal arrays is a matter of looking at each and every item, this is not necessarily so with an ordered array. Instead, we can use Binary Search.

Note

All source is going to be placed in their respective topic folder.

Binary Search will halve its steps each time it compares values to the search value. So there will be \(log_2N\) steps involved. It narrows its results down into a binary answer, hence the name.

Tip

To understand what \(log_2N\) means, or what \(log_MN\) means, read The Algorithm Design Manual’s chapter on Logarithms.

Q: For an ordered array with 100 elements, binary search takes 7 steps. How many steps does it take for 200 elements?

A: 8 steps

Tip

Every time the number of elements doubles, the steps involved in Binary Search increases by 1.

Tip

The Logarithm of N to the base M shows how many times N can be divided by M.

\(log_2N\) shows how many times N can be halved.

\(log_{10}N\) shows how many times N can be divided into tenths.

Chapter 3 - O Yes! Big O Notation

Note

This book ignores most of the mathematical basis for Big O, and explains stuff rather simply. I recommend this section a lot. However, for a mathematical explanation, read The Algorithm Design Manual’s sections on Big O.

Big O measures how many steps an algorithm will take in relation to its input size.

If there are N data elements, how many steps will the algorithm take?

\(O(N)\) says that the answer to the key question is that the algorithm will take N steps.

Warning

If you have preconceived notions about Big O, please leave them at the door, for now.

Consider the following:

1
2
3
4
def do_something(arr):
    length = len(arr)
    mid = arr[length // 2]
    print(f"{length=}, {mid=}")

How many steps is does this code example take?

The answer is 3.

What is Big O notation?

\(O(1)\).

The soul of Big O is what Big O is truly concerned about: how will an
algorithm’s performance change as the data increases?

Big O doesn’t care how many steps the algorithm actually takes. It instead cares about the rate of change of an algorithm given change in the input data.

\(O(1)\) and \(O(3)\) mean the same to Big O. This algorithm has no relationship with the input data. It will not get slower or faster given more or less data.

\(O(N)\) is a different matter, however. It means that the algorithm has a linear relationship with the data.

1
2
3
4
5
6
7
def do_something(arr):
    length = len(arr)
    mid = arr[length // 2]
    print(f"{length=}, {mid=}")
    for element in arr:
        print(f"{element=}")
        print(f"{2*element=}")

Here, the algorithm seems to take \(O(2N+3)\) steps. \(2N+3\) is a linear polynomial. Wait, I promised no mathematics. It means that it is a statement that can be conceived as a straight line on a graph.

To Big O, the number next to \(N\) does not matter. Neither does the value 3 that is at the end.

In Big O terms, the above program has a runtime of \(O(N)\).

Big O is also a worst-case-scenario measure. For an algorithm, always consider the worst case scenario when denoting it in Big O terms. Linear search might take 1 step if your first item is the item you were looking for, but in an array of size N, if it’s the last item you’re looking for, you need N steps. Big O in this scenario is \(O(N)\), not \(O(1)\).

Big O concerns itself with Logarithms quite a lot. Logarithms are the opposite of exponents.

\(2^3\) is 8. Then, \(log_28\) is 3.

\(10*2\) is 100. Then, \(log_{10}100\) is 2.

An alternative way of thinking about logarithms is: how many times do we need to halve (or thirds, or tenths, depending on the :math:`log_NM` base N) a number to get 1?

\(O(log_2N)\), for brevity: \(O(logN)\), measures how many times we need to divide a dataset into halves to arrive at our result in the worst case scenario.

Chapter 4 - Speeding Up Your Code with Big O

Note

Read up on the implementation of Bubble Sort.

Bubble Sort has the following steps: comparisons, and swaps. Note how many times it loops through an array.

No, not twice. For each element, it loops through the entire array. In fact, it doesn’t know if an array is sorted until it loops through the array at least once.

Welcome to \(O(N^2)\). You do not want to be here.

Any polynomial with a power of 2 is a quadratic polynomial. On a graph, it would look like a parabola, but not one that ever touches the ground like someone throwing a ball. No, this is an upwards-facing parabola. That means if you have 1 million data points, your algorithm will need to process \(1000000^2\) items. A million squared.

Any algorithm wherein you loop through the entire list, or even portions of the list but perhaps have a worst-case scenario where you loop through the entire list is of this order. If you can avoid it, do not implement one of these.

\(O(N^3)\) exists. Just pray that you do not meet him or his cousins.

Chapter 5 - Optimizing Code With and Without Big O

Note

Read up on the implementation of Selection Sort.

In Selection Sort, you loop through the array to find the smallest item and swap it with the first item. This first item is, in fact, a cursor which moves ahead and marks the sorted part of the array.

For each item in the array, Selection Sort looks ahead at the rest of the items in the array, comparing them and seeing if there’s something smaller than this item. It then selects the smallest item, if it is not the current item, and swaps it with the current item.

This way, after the first passthrough, the smallest item is right at the front. The algorithm ensures that at the Nth passthrough, the first N items are in order at the front of the array.

So let’s look at Selection Sort and Bubble Sort together.

Selection Sort has to look at the rest of the items for each item in the list.

Bubble Sort has to look at the entire list for each item in the list.

While this seems straight forward, and no one will fault you for saying Selection Sort is faster (exactly half the runtime) than Bubble Sort, however, this does not matter to Big O.

But here’s the funny thing in the world of Big O Notation: Selection Sort
and Bubble Sort are described in exactly the same way.

Tip

Remember to leave your knowledge or preconceived notions of Big O at the door. All you need to do is check this:

Does the algorithm have nested loops?

The exact time does not matter.

Why?

Big O Notation ignores constants.

It is easy to say Selection Sort is a \(O(N^2/2)\) algorithm. It would seem common sense.

However, consider the fact that the length of the array could grow astronomically. If you have an infinitely large array, then it does not matter.

\(O(N^2/2)\) is just \(O(N^2)\).

\(O(N/2+50000)\) is just \(O(N)\).

\(O(100000)\) is just \(O(1)\).

Big O Categories

Big O ignores the constants because it uses categories.

Big O Notation only concerns itself with general categories of algorithm speeds.
If we were to compare two buildings, one of which is a single-family home and
one of which is a skyscraper, it becomes almost moot to mention how many floors each one has.
Because the two buildings are so incredibly different in their sizes and functions.

This is the idea behind Big O. It categorizes algorithms into buckets wherein the bucket itself is a definitive measure of how quickly the algorithm grows with input data.

It won’t matter if your \(O(1)\) algorithm takes 10 minutes to run or if your \(O(N^2)\) takes 1 \({\mu}s\), at least not to Big O, because it doesn’t care how long each step takes, only how many steps there will be in relation to larger data set sizes.

Talking about \(O(2N)\) when compared to \(O(N^2)\) is like talking
about a two-story house compared to a skyscraper.

Chapter 6 - Optimizing for Optimistic Scenarios

[T]he worst-case scenario isn’t the only scenario worth considering.
Insertion Sort is a sorting algorithm that reveals the power of analyzing
scenarios beyond the worst-case.

Insertion Sort involves: removals, comparisons, shifts, and insertions.

Big O Notation only takes into account the highest order of N when we have multiple
orders added together.

This means that:

\(O(N^2 + 2N)\) = \(O(N^2)\)

\(O(N^4 + N^3 + N^2)\) = \(O(N^4)\)

This is because as \(N\) increases, the higher order values become so much larger than the lower-order ones that it becomes trivial to account for them.

Note

Take a look at this chart to understand the severity with which these categories vary.

In the worst-case scenario, Selection Sort is faster than Insertion Sort. However, it is not the case when we look at the average case.

Best, Average and Worst cases follow a Bell Curve, showing that the worst case is, just like the best case, an outlier.

Best Case

Average Case

Worst Case

Selection Sort

\(O(N^2/2)\)

\(O(N^2/2)\)

\(O(N^2/2)\)

Insertion Sort

\(O(N)\)

\(O(N^2/2)\)

\(O(N^2)\)

Being able to discern between best-, average- and worst-case scenarios is a
key skill in choosing the best algorithm for your needs, as well as taking
existing algorithms and optimizing them further to make them significantly
faster. [W]hile it’s good to be prepared for the worst case, average cases
are what happen most of the time.

Chapter 7 - Big O in Everyday Code

Note

I don’t have much to say about this chapter beyond that I really appreciate it being there. It nails the Big O Notation mindset into you, and so far, I’ve enjoyed it.

The problems solutions are fairly easy, and it’s a chapter you should read if you haven’t grokked Big O yet.

Tip

If an algorithm has 2 inputs of length \(N\) and \(M\) and it has a nested for loop that loops through them, then what would have been \(O(NxM)\) is reduced to \(O(N^2)\) because in the worst-case scenario, both these arrays would be the same size.

Chapter 8 - Blazing Fast Lookup with Hash Tables

A Hash Table is a data structure in which the indices are hashed and correlated with values in an array-like structure.

The example this book uses is alphabet-based hashing, using the product of the numberic value of each alphabet.

“dog” becomes \(4*15*7 = 420\), and the corresponding value associated with “dog” is stored in the 420 th location of an array that holds the values of the Hash Table.

This lookup mechanism is usually designed with the following assumptions

  1. The hashing of a key always yields the same value. This way, there can be a 1:1 relationship between keys and their values.

  2. The hashing is non-random, and doesn’t depend on execution time or machine. This enables the datastructure to be written to disk, if need be.

The efficiency of a hash table depends on three factors:

  1. How much data we’re storing in the hash table.

  2. How many cells are available in the hash table.

  3. Which hash function we’re using.

A good hash function is one that distributes its data across all
available cells. The more we can spread our data, the fewer collisions
we will have.
A good hash table strikes a balance of avoiding collisions while not consuming lots of memory.

To accomplish this, computer scientists have developed the following rule of thumb:
for every 7 data elements stored in the hash table, it should have 10 cells.

Hash tables have a lookup efficiency of \(O(N)\).

Hash tables are used in places where you’d usually use plenty of if-else-if-else statements. Or, you’d use them when you want to store information about a single entity together. This is most-visible in the JSON data format.

Algorithm to determine whether two words are anagrams.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
"""Algorithm to detect whether 2 words are an Anagram or not"""


def is_anagram(word_1: str, word_2: str) -> bool:
    """Checks whether 2 words is an anagram or not"""
    word_1_dict = {}
    for letter in word_1:
        word_1_dict[letter] = word_1_dict.get(letter, 0) + 1

    word_2_dict = {}
    for letter in word_2:
        word_2_dict[letter] = word_2_dict.get(letter, 0) + 1

    return word_1_dict == word_2_dict


def test_anagram():
    """Checks the anagram function"""
    assert is_anagram("mood", "doom")
    assert is_anagram("rot", "tor")
    assert is_anagram("something", "nothing") is False

Using a dictionary, Python’s implementation of a Hash Table, here, gives us an algorithm of \(O(N)\) order.

Exercises

Algorithm to yield the intersection of two arrays.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
"""Algorithm to determine the intersection of two arrays"""


def intersect_arrays(arr_1: list, arr_2: list) -> list:
    """Returns the intersection of two arrays"""
    from collections import OrderedDict
    arr_1_hash = OrderedDict()
    for item in arr_1:
        arr_1_hash[item] = arr_1_hash.get(item, 0) + 1

    arr_2_hash = OrderedDict()
    for item in arr_2:
        arr_2_hash[item] = arr_2_hash.get(item, 0) + 1

    intersection = []
    for key in arr_1_hash:
        count_1 = arr_1_hash[key]
        count_2 = arr_2_hash.get(key, 0)
        # include items that are in arr_1, and arr_2
        occurs_in_both = min(count_1, count_2)
        intersection.extend([key]*occurs_in_both)
    return intersection


def test_intersect_arrays():
    """Tests that the above algorithm works"""
    assert intersect_arrays([1, 2, 3], [2]) == [2]
    assert intersect_arrays([1, 1, 2, 3], [1, 2, 3]) == [1, 2, 3]
    assert intersect_arrays([2, 1, 3], [1, 2, 2, 3]) == [2, 1, 3]

Todo

Do the other exercises.

Todo

Write an implementation of Hash Tables in Rust and Python.

Chapter 9 - Crafting Elegant Code with Stacks and Queues

Stacks

A stack is almost an array, except:

  1. Data can be inserted only at the end of a stack.

  2. Data can be deleted only from the end of a stack.

  3. Only the last element of a stack can be read.

The end of a stack is referred to as its top, and the beginning of the stack is referred to as its bottom.

Tip

In Python, a simple list can be used as a Stack, since lists support pop and push methods. However, it is better to write a wrapper around a regular list since this way you can code checks in place so that users cannot insert an item wherever they want.

Stacks are useful when dealing with the order of things, especially when we care about what was the last item to be put into an array. Looking one step behind is a constant pattern when dealing with stacks.

Chapter 10 - Recursively Recurse with Recursion

Recursion has a lot to do with the stack. When a function calls itself, what is known as the Call Stack is assembled. This way, the output of one call is tied to the output of the other, and the eventual disassembly occurs when, the eventual last item returns something that is not tied to another call.

Consider the factorial function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
"""Program to find the factorial of a number"""


def factorial(n: int) -> int:
    """Function to find the factorial of a number"""
    if n <= 1:
        return 1
    else:
        return n * factorial(n-1)


def test_factorial():
    import math
    for number in range(500):
        assert factorial(number) == math.factorial(
            number), "Could not calculate the factorial of {}".format(number)

Here, the function factorial will not immediately return anything.

For factorial(5), it first calls factorial(5), and before it can return anything, it needs to invoke factorial(4).

For factorial(4), it first calls factorial(4), and before it can return anything, it needs to invoke factorial(3).

For factorial(3), it first calls factorial(3), and before it can return anything, it needs to invoke factorial(2).

For factorial(2), it first calls factorial(2), and before it can return anything, it needs to invoke factorial(1).

For factorial(1), it will return 1.

Notice that until the program reaches factorial(1), it must hold every instance of the function calls in memory. This creates an untenable situation: the case where the machine cannot keep track of this stack.

After each function call resolves, in the LIFO order, the call stack is popped, in that order, and the final result is returned to the user.

If the recursion call stack exceeds the short term meory of the computer, there is a Stack Overflow, and the program shuts down.

Exercises

1
2
3
4
5
6
7
8
9
"""For an array containing both numbers and other arrays, write a recursive funciton that prints out all the numbers, and just the numbers"""


def print_nums(arr):
    for item in arr:
        if isinstance(item, int):
            print(int)
        else:
            print_nums(item)

Chapter 11 - Learning to Write in Recursive

Note

This chapter exists to drill down how you write in recursive. Do all the exercises.

Find Directories
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
"""Recursive algorithm to find all directories within a given directory"""


def get_directories(path=None):
    """Returns a list of directories and subdirectories"""
    import os
    if path is None:
        path = os.getcwd()
    contents = os.listdir(path)
    directories = []
    for item in contents:
        abs_path = os.path.join(path, item)
        if os.path.isdir(abs_path):
            directories.append(abs_path)
            directories.extend(get_directories(abs_path))
    return directories
Double Arrays in-place
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
"""Recursive function to double every item in an array"""


def double_array(array, index=0):
    """Doubles every item in an array"""
    array[index] *= 2
    index += 1
    if index < len(array):
        double_array(array, index)


def test_double_array():
    import random
    input_array = [random.random() for _ in range(random.randint(0, 1000))]
    doubled_array = [x*2 for x in input_array]
    double_array(input_array)
    assert input_array == doubled_array, "Unable to double array"
Factorial Bottom-Up Recursion
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
"""Recursive function to get the nth factorial"""


def factorial(n):
    """Function to get the nth factorial"""
    if n <= 1:
        return 1
    else:
        return n*factorial(n-1)


def test_factorial():
    import random
    import math
    n = random.randint(1, 1000)

    assert factorial(n) == math.factorial(
        n), "unable to calculate the nth factorial"
Array Sum Top Down Approach
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
"""Recursive solution to array sum"""


def array_sum(arr):
    if len(arr) == 1:
        return arr[0]
    else:
        return arr[0] + array_sum(arr[1:])


def test_array_sum():
    import random
    input_array = [random.randint(0, 1000)
                   for _ in range(random.randint(1, 1000))]
    assert array_sum(input_array) == sum(input_array)
String Reversal (See EASY 0344. Reverse String also)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
"""Recursive solution to reverse a string"""


def reverse_string(input_string):
    """Reverse string"""
    if len(input_string) > 1:
        return input_string[-1] + reverse_string(input_string[:-1])
    else:
        return input_string[0]


def test_reverse_string():
    sample_strings = [
        "hello world",
        "peter piper picked a pack of pickled peppers"
        "the cake is a lie",
        "once upon a time"
    ]
    for st in sample_strings:
        assert reverse_string(st) == "".join(reversed(st))
Counting X
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
"""Recursive algorithm to count x in a string"""


def count_x(input_string):
    """Counts the occurrence of x in a string"""
    if len(input_string) == 1:
        return 1 if input_string[0] == "x" else 0
    elif len(input_string) > 1:
        return (1 if input_string[0] == "x" else 0) + count_x(input_string[1:])
    else:
        return 0


def test_count_x():
    """tests that the counting function works"""
    input_strings = [
        "x",
        "",
        "lmzxcvuljwsdf;laxlvjsdflweo asdflkxc;lvslweropupas xlx;lkvj;ljweorijxlj ;xlxx;lkj;lxx ",
        "something that doesn't have it"
    ]
    for string in input_strings:
        assert count_x(string) == string.count("x")
The Staircase Problem
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
"""Let’s say we have a staircase of N steps, and a person has the ability to climb
one, two, or three steps at a time. How many different possible “paths” can
someone take to reach the top? Write a function that will calculate this for N
steps.
"""


def number_of_paths(n):
    """Solves the staircase problem"""
    if n < 0:
        return 0
    elif n in (0, 1):
        return 1
    else:
        return (
            number_of_paths(n-1) + number_of_paths(n-2) + number_of_paths(n-3)
        )
Anagram Generation
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
"""Generate all anagrams of a set of letters"""


def anagrams_of(string: str):
    """Returns a list of anagrams of a string"""
    if len(string) == 1:
        # if the string is a single character, return a list containing that
        # single character.
        return [string[0]]
    # instantiate an empty placeholder
    collection = []
    # Get all the anagrams of the rest of the string, besides the first
    # character, until the end of the string.
    substring_anagrams = anagrams_of(string[1:])

    for substring_anagram in substring_anagrams:
        # now, while looping through this new list of sub-anagrams,
        for index in range(len(substring_anagram)+1):
            # loop through each character in this anagram, insert the first
            # character into a place dictated by the index of the sub-anagrams
            anagram = substring_anagram[:index] + \
                string[0] + substring_anagram[index:]
            # append to our collection
            collection.append(anagram)
    # return the result
    return collection


def test_anagrams_of():
    "tests anagrams_of"
    input_string = "abc"
    result = [
        "abc",
        "bac",
        "cab",
        "cba",
        "acb",
        "bca"
    ]
    anagrams = anagrams_of(input_string)
    for item in result:
        assert item in anagrams, "{} not found in anagrams of {}".format(
            item, input_string)

Examples

Total Number of Characters across all Strings in an Array
1

Given an array of numbers, write a recursive function to return only even numbers.
1

Given a number N, return the current number from the Triangular Numbers series.
1

Given a string, return the index of the first occurrence of x within the string.
1

Given a number of rows and columns, calculate the number of possible “shortest” paths from the upper-leftmost square to the lower-rightmost square.
1

Chapter 12 - Dynamic Programming

Chapter 13 - Recursive Algorithms for Speed

Chapter 14 - Node-Based Data Structures

Chapter 15 - Speeding Up All the Things with Binary Search Trees

Chapter 16 - Keeping Your Priorities Straight with Heaps

Chapter 17 - It Doesn’t Hurt to Trie

Chapter 18 - Connecting Everything with Graphs

Chapter 19 - Dealing with Space Constraints

Chapter 20 - Techniques for Code Optimization