What’s the difference between controlled and uncontrolled inputs in ReactJS?

Uncontrolled inputs are input elements that have their state stored strictly in the browser document object model (DOM). They behave like vanilla HTML inputs that you create without using a framework like React.

Uncontrolled (out of control? lol) inputs

There’s a couple of ways you can create uncontrolled inputs.

The first is to leave out the value attribute of an input.

function App() {
  return (
    <div>
      <input type="text" name="title"/>
    </div>
  );
}

This input behaves like a regular input. But if you need to access the value of this input inside your component – say, to submit the form or to do some other processing with the value – then you can’t access it without directly accessing the value from the DOM itself using the DOM API (i.e document.getElementsByTagName).

If you’re using ReactJS in the first place, you’re probably trying to avoid having to work directly with the DOM API. Now there are situations when you do want to read DOM state directly and React offers a way to do that with uncontrolled inputs through its own API called refs.

Here’s an example of reading state from our uncontrolled component with refs:

App() {
  const inputRef = useRef(null);
  const handleClick = () => {
    alert(inputRef.current.value);
  };
  return (
    <div>
      <input type="text" name="title" ref={inputRef} />
      <button onClick={handleClick}>Click</button>
    </div>
  );
}

We bind the input to a ref object that’s created using the useRef hook. This creates a connection to our input element and allows us to access the DOM value directly without having to use the DOM API. In most situations where you want or need to use uncontrolled inputs, refs are the way to go.

You can also create an uncontrolled input by setting a value attribute – but only if the value is null or undefined.

Here’s an example with undefined

function App() {
  return (
    <div>
      <input type="text" name="title" value={undefined} />
    </div>
  );
}

This behaves just like an input that doesn’t have a value attribute at all.

Controlled inputs

Controlled inputs that get their value from React component state rather than from the DOM. The component is the source of truth for the value of the input.

For example

function App() {
  const [title, setTitle] = useState("dog day afternoon");
  return (
    <div>
      <input type="text" name="title" value={title} />
    </div>
  );
}

If you tried to type into this input, the value won’t actually change! That’s because nothing in the component is currently writing to the title variable. That’s why most controlled inputs will come with change handlers.

function App() {
  const [title, setTitle] = useState("dog day afternoon");
  return (
    <div>
      <input type="text" name="title" value={title} onChange={(e) => setTitle(e.target.value) }/>
    </div>
  );
}

Danger! Changing a controlled input to an uncontrolled input between renders

Now that we’ve covered the difference between a controlled and uncontrolled component – what happens if this same input element changes from a controlled element to an uncontrolled one?

To demonstrate what happens, lets add a toggle handler that toggles the title state between a string value and undefined.

function App() {
  const [title, setTitle] = useState("dog day afternoon");
  const toggleTitle = () => {
    if (title) {
      setTitle(undefined);
    } else {
      setTitle("dog day afternoon");
    }
  }
  return (
    <div>
      <input type="text" name="title" value={title} onChange={(e) => setTitle(e.target.value) }/>
      <button onClick={toggleTitle}>Toggle</button>
    </div>
  );
}

Clicking toggle produces the following error

Warning: A component is changing a controlled input to be uncontrolled. This is likely caused by the value changing from a defined to undefined, which should not happen. Decide between using a controlled or uncontrolled input element for the lifetime of the component. More info: https://reactjs.org/link/controlled-components

In most cases, you can fix this by ensuring that you don’t supply undefined or null to your inputs. You can do this with some additional data processing or validation before that value is bound to the input at render time.

Web Application Session Management Primer

Most web applications need to handle user sessions at some point. A common use-case is to remember an authenticated user across requests. Since HTTP is a stateless protocol, the only way for servers to know that the current request is related to a previous request by the same user is to associate them with some common identifier provided by the browser containing information about the user.

A popular and simple way that browser clients provide this identifier is to do it completely through cookies – this is also known as client side session management.

Here’s a typical scenario involving cookie sessions:

  1. User submits login credentials for web app
  2. Server authenticates the user credentials and asks the browser to set a cookie containing information (such as user ID) that allows the server to re-authenticate the user on subsequent requests
  3. After the cookie is set, the browser forwards cookies to the app as long as it’s active and not expired.
  4. The app reads the cookie and the user ID that the cookie contains and then performs a user lookup using the ID. If the user exists, they’re authenticated and the application can respond accordingly.

In most production systems, the cookie contents are typically masked and cryptographically signed so that they can’t tampered with. This is important because if cookie contents can be easily modified, any authenticated user can impersonate any other user just by changing the contents of the cookie like user_id.

Drawbacks of client side management

While a fully client side approach that’s easy to set up – and very secure if done right – there’s some drawbacks as well.

  • If your server keys are exposed, the contents of any cookie generated by the key can be tampered with. This can be bad if the cookies contain personally identifiable information (PII) or worse, user credentials.
  • You are reliant on the browser to expire user sessions since the entirety of session state lives in the browser. To get rid of a cookie, you need to make sure to set cookie expiry headers and wait for the browser to expire them. If cookies are being cryptographically signed, you can technically rotate your key to invalidate all sessions… but that’s a nuclear option.

The one-sided nature of an applications control over session life and potential for leaking sensitive information into the client leads many applications to adopt a server side session management – a different, more operationally expensive approach that comes with a different set of trade-offs.

Server Side Management

Server side session management (SSM) is a bit of a misnomer because session related data is not completely server side. Remember, HTTP is a stateless protocol – the only way for a remote server to tell that requests are related is from information passed along by the client.

SSM still uses cookies for session management, so what part of it is server side?

Instead of storing the actual contents of a session like user id (encrypted or not, doesn’t matter) directly in the cookie, it stores a reference to the data in the form of a session ID – also known as a session token. When the client makes a request to the web application, it forwards this cookie containing the session token and the application uses it to look up the actual contents.

With a server side approach, you can use cookies to persist a common session state without risk of exposing data on the cookie itself since the actual data is stored server side. You can… have your cookie and eat it too.

Okay, so where is the actual content stored?

The two primary locations for session contents are either on same application servers receiving requests or in a database running on another server.

Storing session data on application servers

The easiest option is on the application server itself, either on disk or in memory. On a single server setup, sessions will go down with the server if in memory but can persist across restarts if persisted to disk.

Most high-traffic production apps are clustered so there are groups of app servers acting as single system. In a clustered environment behind an load balancer, you can’t store session data on specific instances without using sticky sessions (route requests from same client to the same server). Otherwise, requests are likely going to end up on different servers each time and create a bad user experience; in one request, a user may be identified as being logged in but then in another subsequent request they’re suddenly logged out.

I would avoid sticky sessions because it requires you to have long-lived servers. If you’re doing blue-green deployments and are frequently tearing down old servers and deploying new ones, you’re constantly going to be purging session data. This fundamentally limits your ability to horizontally scale your web instances.

Instead of storing session data on app servers directly, I recommend storing session data in a separate datastore on an external, shared server.

Storing session data on databases in external servers

With an external storage setup, there’s two primary options as far as databases go – either in-memory or on-disk. We’re also going to incur an additional network call for every request.

If we’re using an in-memory, distributed cache store like Redis or Memcached:

  • Storage limit is limited by RAM, but with option of horizontally scaling out by adding more instances to a cluster. Ideal for short-lived, volatile (can be evicted at anytime without serious impact on application) sessions.
  • Fast reads and writes – a fetch from RAM is orders of magnitudes faster than fetch from disk.
  • The instance needs to be properly secured and fire-walled, otherwise you risk compromising potentially sensitive user data.
  • It’s not as easy to setup as a cookie only session because you need to create and manage additional infrastructure.
  • Automatic expiry management and cleanup is available in stores like Redis.

If we’re using an on-disk store such as PostgreSQL or MySQL:

  • The storage limit per instance is much, much larger. This is ideal for large, persistent session data. At the time of writing you can get a 64 terabyte server from AWS.
  • Reads and writes will be slower compared to in-memory stores.
  • Expiry will be more manual. Unlike in memory databases, you cant just restart the instance if you want to wipe all sessions at once. You’ll likely need to run background jobs to expire session data.

In summary

While there’s no one size fits all solution depending on your use case, here’s some general guidelines:

  • Avoid storing PII or any sensitive information as session content – this applies regardless of whether you use client or server side management. When encryption keys are compromised, you should assume exposure of user data so it’s best to be sure you’re not storing anything sensitive.
  • Keep session state to a minimum – if it seems like your per-session data storage requirements are high, maybe it shouldn’t be treated as session data? It may make sense to persist alongside your primary application state.
  • Lean on trusted, open-source libraries to ensure security best practices (rails by default encrypts session tokens) instead of rolling your own home-grown session management solution. Most popular web frameworks like rails have many battle-tested solutions around session management such as devise.
  • Avoid sticky sessions. They’re unreliable and the drawbacks are usually not what you’re willing to sign up for. For clustered services that use server side sessions, just go with a distributed cache.
  • Know how to securely manage your instance and understand its failure modes and replication behavior, especially if you’re dealing with high traffic applications.
  • Avoid re-using your primary application database as a session database because they can experience very different traffic patterns. For example, with certain session configurations, Rails apps will create a new session for every new user that visits the application – this can put undue write load on your application server.

Learn Huffman Encoding With Python

Lets walk through an implementation of huffman encoding and decoding in Python. For the purposes of demonstrating key ideas, I’m going to just treat bits as plaintext strings just to keep things simple. While this means that the output isn’t truly compressed as bytes, you will hopefully take away a deeper understanding of how it works under the hood.

Theoretical Overview

Huffman encoding works by exploiting the unequal distribution of character occurrences in text. Rather than encoding every single character with the same number of bites, it encodes characters that occur more frequently with smaller number of bits and those that occur less frequently with greater number of bits.

For example, lets say we have the text abc.

Without compression, each character in abc will take up a fixed number of bits – lets say a byte (8 bits) per character using an ASCII character set. That’s 3 bytes for 3 characters or 24 bits total.

If we used a variable length encoding, we can instead use as many bits as we need to identify a character. To illustrate this concept, lets map each character to a variable number of bits like so:

a = 0
b = 10
c = 110

Then abc will be 010110 which is only 6 bits. That’s 18 bits (75%) less compared to the uncompressed version!

But here’s the catch: we need to make sure that these codes are prefix free, meaning that no code in our set is a prefix of another code. Why? This is best understood with an example. Lets add another character, d, to our previous set.

a = 0
b = 01
c = 110
d = 10

Now consider if we wanted to encode acb, we would have 011001. But upon decoding it, it can be misinterpreted as bdb (01 – 10 – 01). That’s because the bits for b contains the prefix of a – so if you read from left to right, you can either read 0 and stop (which gives you a) or read both 0 and 1 (which gives you b). When do you stop reading?

Unless we introduce a delimiter into our output bit stream, we can’t tell where the bits of one character ends and another starts. The only way to tell without a delimiter is to have codes that introduce no ambiguity, and you can accomplish that by ensuring that the codes are prefix free.

This presents two additional challenges that the creator of the huffman encoding solved:

  1. How do we generate these prefix free codes?
  2. How do we generate optimal prefix free codes such that we are able to assign shorter codes to higher frequency characters?

The prefix free codes are created by constructing a binary trie. The edges in the tree represent 0’s and 1’s and the leaf nodes of this binary tree represent a unique character in a text. Therefore, the paths represent the code for the character at the leaf. Since the characters are at the leaf nodes, all the paths to those nodes are unique and non-overlapping, making the codes prefix free. To attain optimatily, the trie is constructed bottom up, starting with characters that occur the least often so that the eventual codes (made up of paths from the root of the trie to leaf nodes) are shortest for those that occur the most often.

Implementation Overview

Here’s an overview of both compression and decompression steps:

Compression

  1. read the text and figure out character frequency
  2. use frequency to build trie – this generates the bit sequence in the form of trie paths for each character
  3. generate a code table using trie – this lets us find a corresponding bit sequence code for a character
  4. encode our text using table to produce a bit stream
  5. encode our trie as bits. this will be used by decoder
  6. write both to a file

Decompression

  1. read the trie bits portion (header) to re-construct the trie. we’ll need this to decode the encoded text
  2. read the body / text bits portion (this is the encoded form of the actual value we’re trying to get)

The Trie Node

We’ll be using this to construct our trie. this will be referenced throughout the implementation.

class Node:
    def __init__(self, char="", left=None, right=None, freq=0):
        self.char = char
        self.left = left
        self.right = right
        self.freq = freq

    def to_binary(self):
        return "{:08b}".format(ord(self.char))

    def is_leaf(self):
        return (self.left is None and self.right is None)

    # necessary for heapq comparisons
    # heapq falls back on object comparison when priority keys are equal
    def __lt__(a, b):
        return a.char < b.char

    def __eq__(self, other):
        if isinstance(other, Node):
            return (
                (self.char == other.char) and
                (self.left == other.left) and
                (self.right == other.right)
            )
        return False

    def __repr__(self):
        return "None" if self.char is None else self.char

Compression Process

This is the main method for compression – we’re encoding the text and then we’re including some header metadata for the decoder. The essense of the header metadata is the serialized trie that we constructed for the purposes of encoding.

def compress(text):
    trie_tree = build_trie(text)
    table = build_code_table(trie_tree)
    trie_bits = serialize_trie_to_binary(trie_tree)
    header = "{:016b}{}".format(len(trie_bits), trie_bits)
    body = encode_text(table, text)
    return header + body

The following method uses a min heap to ensure that the most frequently occuring characters (via the freq attribute) are included in our trie structure last.

def build_trie(text):
    from collections import Counter
    from heapq import heappush, heappop
    char_count = Counter(text)
    queue = []
    for char, freq in char_count.items():
        node = Node(char=char, freq=freq)
        heappush(queue, (node.freq, node))
    while len(queue) > 1:
        freq1, node1 = heappop(queue)
        freq2, node2 = heappop(queue)
        parent_node = Node(
            left=node1,
            right=node2,
            freq=freq1 + freq2
        )
        heappush(queue, (parent_node.freq, parent_node))
    freq, root_node = heappop(queue)
    return root_node

This method constructs our character to code hash table. Our trie lets using decode an encoded stream by allowing us to follow the binary node paths to the characters using bit values in a stream. However, we need to create a character to code mapping in order for our constructed trie to be useful in the encoding process. Otherwise, we would need to scan our entire trie using either DFS or BFS searching for a target character (for every character we want to encode).

def build_code_table(node):
    table = {}
    def map_char_to_code(node, value):
        if node.is_leaf():
            table[node.char] = value
            return
        map_char_to_code(node.left, value + "0")
        map_char_to_code(node.right, value + "1")
    map_char_to_code(node, "")
    return table

In order for a decoder to decode our encoded text, it needs to know the character-to-code mapping we used so this method serializes the trie used in the encoding into bits. It uses a pre-order traversal to encode our trie. If it’s a non-leaf node, we prefix the output with a zero. Otherwise, we prefix it with a 1 followed by the actual bits representing the character.

def serialize_trie_to_binary(node): 
    if not node:
        return ""
    if node.is_leaf():
        return "1" + node.to_binary()
    return "0" + serialize_trie_to_binary(node.left) + serialize_trie_to_binary(node.right)

This method makes use of our character-to-code table to convert characters into bits. This represents our compressed text!

def encode_text(table, text):
    output = ""
    for x in text:
        output += table[x]
    return output

Decompression Process

Here’s the main method for decompression. It essentially re-constructs the trie in memory used the bits in the input representing the trie. Then it uses that in-memory trie to decode the bits of the input that represent our encoded text.

def decompress(bit_string):
    trie_size = int(f"{bit_string[0:16]}", 2)
    trie_range_end = 16 + trie_size
    trie = deserialize_binary_trie(bit_string[16:trie_range_end])
    body = bit_string[trie_range_end:]
    return decode_text(trie, body)

This function does the reverse of serialize_trie_to_binary. The recursion here takes advantage of the fact that 1 bits are leafs of our trie, therefore it can be used as a base case to continue de-serializing the next trie path. The curr_pos is used in this function to act as a pointer into our current read position so we know when to start and stop reading input.

def deserialize_binary_trie(bits):
    def read_bits(curr_pos):
        if curr_pos >= len(bits):
            return None, curr_pos
        bit = bits[curr_pos]
        if bit == "1":
            char_range_left = curr_pos+1
            char_range_right = char_range_left + 8
            char_bits = bits[char_range_left:char_range_right]
            return Node(
                char=chr(int(char_bits, 2))
            ), char_range_right
        left_node, pos = read_bits(curr_pos + 1)
        right_node, pos = read_bits(pos)
        return Node(
                left = left_node,
                right = right_node
        ), pos
    node, pos = read_bits(0)
    return node

Finally, with our trie object on hand, this function follows the bits of the encoded text using the trie to find the characters.

def decode_text(node, data):
    out = ""
    root = node
    curr_node = root
    for bit in data:
        if bit == "0":
            curr_node = curr_node.left
        else:
            curr_node = curr_node.right
        if curr_node.is_leaf():
            out += curr_node.char
            curr_node = root
    return out

That completes the overview of this basic python representation of the huffman algorithm. In practice, some implementations may used pre-existing code tables rather than generating them on the fly as we did here. For example, if you need fast encoding and know about average frequencies of the text you’re encoding, you may not want to be constructing a new trie on every encode operation.

References

Here’s a couple of resources I used in writing this implementation – I highly encourage you to check them out to understand huffman in even greater depth.

Nand2tetris Python Assembler

Here’s my source code for the assembler for the nand2tetris HACK assembly language written in Python 3.

This implementation emphasizes readability above all else. Therefore, there are more function calls than necessary and many parts of the implementation assume valid inputs. It has been tested to work with all files provided in the course.

I hope this can serve as a useful reference for others.

import re
import sys
import argparse

def convert_assembly_to_binary_file(asm_file, binary_file):
    with open(asm_file, "r") as f:
        result = translate_lines(f.readlines())
        output = "\n".join([l for l in result if l]) 
        with open(binary_file, "w") as f:
            f.write(output)

def translate_lines(lines):
    lines = strip_whitespace_and_comments(lines)
    symbol_table = build_symbol_table(lines)
    translate_instruction = build_instruction_translator(symbol_table)
    return [translate_instruction(x) for x in lines]

def strip_whitespace_and_comments(lines):
    instructions = []
    for line in lines:
        stripped_line = line.strip() 
        if stripped_line:
            if not stripped_line.startswith("//"):
                if "//" in stripped_line:
                    instructions.append(stripped_line.split("//")[0].strip())
                else:
                    instructions.append(stripped_line)
    return instructions

def build_symbol_table(lines):
    symbols = {
        "R0": "0000000000000000",
        "R1": "0000000000000001",
        "R2": "0000000000000010",
        "R3": "0000000000000011",
        "R4": "0000000000000100",
        "R5": "0000000000000101",
        "R6": "0000000000000110",
        "R7": "0000000000000111",
        "R8": "0000000000001000",
        "R9": "0000000000001001",
        "R10": "0000000000001010",
        "R11": "0000000000001011",
        "R12": "0000000000001100",
        "R13": "0000000000001101",
        "R14": "0000000000001110",
        "R15": "0000000000001111",
        "SP": "0000000000000000",
        "ARG": "0000000000000010",
        "LCL": "0000000000000001",
        "THIS": "0000000000000011",
        "THAT": "0000000000000100",
        "KBD": "0110000000000000",
        "SCREEN": "0100000000000000"
    }
    is_address_instruction = lambda x: x.startswith("@")
    is_compute_instruction = lambda x: "=" in x or ";" in x
    label_value = lambda x: x.replace("(", "").replace(")", "").strip()
    current_line_num = 0
    for line in lines: 
        if is_address_instruction(line) or is_compute_instruction(line):
            current_line_num +=1 
        elif is_label(line):
            symbols[label_value(line)] = decimal_to_binary(current_line_num)
    base_address = 16
    for line in lines:
        if line.startswith("@"):
            value = line[1:]
            if value not in symbols and not value.isnumeric():
                symbols[value] = decimal_to_binary(base_address)
                base_address += 1
    return symbols

def build_instruction_translator(symbol_table):
    COMPUTATIONS = {
        "0": "0101010",
        "1": "0111111",
        "-1": "0111010",
        "D": "0001100",
        "A": "0110000",
        "!D": "0001101",
        "!A": "0110001",
        "-D": "0001111",
        "-A": "0110011",
        "D+1": "0011111",
        "A+1": "0110111",
        "D-1": "0001110",
        "A-1": "0110010",
        "D+A": "0000010",
        "D-A": "0010011",
        "A-D": "0000111",
        "D&A": "0000000",
        "D|A": "0010101",
        "M": "1110000",
        "!M": "1110001",
        "-M": "1110011",
        "M+1": "1110111",
        "M-1": "1110010",
        "D+M": "1000010",
        "D-M": "1010011",
        "M-D": "1000111",
        "D&M": "1000000",
        "D|M": "1010101"
    }
    DESTINATIONS = {
        "": "000",
        "M": "001",
        "D": "010",
        "MD": "011",
        "A": "100",
        "AM": "101",
        "AD": "110",
        "AMD": "111"
    }
    JUMPS = {
        "": "000",
        "JGT": "001",
        "JEQ": "010",
        "JGE": "011",
        "JLT": "100",
        "JNE": "101",
        "JLE": "110",
        "JMP": "111"
    }
    def fn(line):
        if is_label(line):
            return
        if line.startswith("@"):
            value = line[1:]
            if value in symbol_table:
                return symbol_table[value]
            return decimal_to_binary(int(value))
        dest, jump = "", ""
        comp = line.split("=").pop().split(";")[0]
        if "=" in line: 
            dest = line.split("=")[0]
        if ";" in line: 
            jump = line.split(";").pop()
        return f"111{COMPUTATIONS.get(comp, '0000000')}{DESTINATIONS.get(dest, '000')}{JUMPS.get(jump, '000')}"
    return fn

def is_label(line):
    return line.startswith("(") and line.endswith(")")

def decimal_to_binary(decimal_value):  
    return f"{decimal_value:0>16b}"

if __name__ == "__main__": 
    parser = argparse.ArgumentParser(description="Generates a hack binary file from assembly")
    parser.add_argument("asm_file", help="name of a HACK assembly file, i.e input.asm")
    parser.add_argument("binary_file", help="name of the HACK file, i.e output.hack")
    args = parser.parse_args()
    convert_assembly_to_binary_file(args.asm_file, args.binary_file)

Here’s what a translation for the Project 06 Max program looks like:

Line Number Before After
0 @R0 0000000000000000
1 D=M 1111110000010000
2 @R1 0000000000000001
3 D=D-M 1111010011010000
4 @OUTPUT_FIRST 0000000000001010
5 D;JGT 1110001100000001
6 @R1 0000000000000001
7 D=M 1111110000010000
8 @OUTPUT_D 0000000000001100
9 0;JMP 1110101010000111
10 (OUTPUT_FIRST)
11 @R0 0000000000000000
12 D=M 1111110000010000
13 (OUTPUT_D)
14 @R2 0000000000000010
15 M=D 1110001100001000
16 (INFINITE_LOOP)
17 @INFINITE_LOOP 0000000000001110
18 0;JMP 1110101010000111

The full specification for the nand2tetris HACK machine language can be found in the Project 6 materials on the course website.

Let me know if you have any questions!

Scraping Roger Ebert’s reviews and finding his favorite movies on Amazon Prime

My wife and I are big fans of the late film critic Roger Ebert. We also share an Amazon prime membership.

I wondered: which of Roger Ebert’s favorite movies are available to watch for free on prime? Since there are hundreds of reviews by Roger Ebert, I had the perfect excuse for writing a web scraper!

In this article, I will:

  • Show my not so pretty scraping code
  • Discuss some roadblocks / gotchas I ran into along the way
  • Share with you the list of movies rated as great by Roger Ebert. That’s what you’re here for, right?

PS: If you just want to see the list of movies, just jump to the end of this article.

Code Quality Warning: I hacked this together as fast as I could without much refactoring, so it’s not the most readable or optimized. But it mostly works… for now.

Roadblocks

I hit a few roadblocks while working on this that I think are worth calling out and will clarify some of the decisions I made in the implementation.

scraping rogerebert.com

Performing a regular GET with an Accept: text/html header (which I think is the default for the requests library) against the url assigned to the variable ebert_url will always return the first page of movies (regardless of what you set the page query parameter to).

Solution? The Accept header field needs to be set to application/json for the server to return JSON containing movies for that specific page.

scraping amazon.com

No public API

First, there is no publicaly available Amazon API for their catalog search. It seems like you could email them to get authorization, but I didn’t want to waste my time doing that.

Not automation friendly

I started off using the requests library. Turns out that if you don’t set a proper browser agent, you’ll get a 503 and some message about how automation isn’t welcome. If you do fake a proper agent but you’re not setting cookies from the server respond, you’ll get:

Sorry, we just need to make sure you’re not a robot. For best results, please make sure your browser is accepting cookies.

I got frustrated and switched over to using a more stateful HTTP tool: mechanize.

That worked… 80% of the time? I noticed that if I run my scraper repeatedly it starts to get the anti-robot message again. Maybe there’s some pattern detection going on on the amazon servers?

Bad HTML …

You’ll notice that I’m using some regex in the function amazon_search to parse out the movie title search results on the page. The reason is that when I tried using beautifulsoup‘s find_all function on the search result tags, I got nothing. My guess is that there’s some invalid HTML on the page and confused the beautifulsoup html.parser parser which isn’t super lenient.

Turns out, rather than using regex, I could have switched over to use the html5lib parser.

For example: BeautifulSoup(match, features="html5lib").

The html5lib parser is the most lenient parser – much more lenient than html.parser. So if I needed to make additional changes to this function, I’d refactor it to use that parser and get rid of the nasty looking regex.

Results

Without further ado, here’s a table of all the great movies movies that are included with prime (sorted by most recent release).

If you want the full dataset, I’ve shared it via this google spreadsheet.

TitleYear ReleasedReview URLPrime URL
Moonstruck1987LinkLink
Fitzcarraldo1982LinkLink
Atlantic City1980LinkLink
Nosferatu the Vampyre1979LinkLink
The Long Goodbye1973LinkLink
“Aguirre, the Wrath of God”1972LinkLink
“The Good, the Bad and the Ugly”1968LinkLink
Gospel According to St. Matthew1964LinkLink
The Man Who Shot Liberty Valance1962LinkLink
Some Like It Hot1959LinkLink
Paths of Glory1957LinkLink
The Sweet Smell of Success1957LinkLink
The Night of the Hunter1955LinkLink
Johnny Guitar1954LinkLink
Beat the Devil1954LinkLink
Sunset Boulevard1950LinkLink
It’s a Wonderful Life1946LinkLink
Detour1945LinkLink
My Man Godfrey1936LinkLink
The General1927LinkLink

Enjoy.

Update (2020-6-10)

Lots of really neat discussion happened when I submitted this to hacker news. I’ll just highlight a few additional resources / things I learned that are useful.

And, of course, that there are fans of roger ebert everywhere. I’m glad some of you found this useful. Thank you.