stream of consciousness about software. but mostly blank

Scraping Roger Ebert’s reviews and finding his favorite movies on Amazon Prime

My wife and I are big fans of the late film critic Roger Ebert. We also share an Amazon prime membership.

I wondered: which of Roger Ebert’s favorite movies are available to watch for free on prime? Since there are hundreds of reviews by Roger Ebert, I had the perfect excuse for writing a web scraper!

In this article, I will:

  • Show my not so pretty scraping code
  • Discuss some roadblocks / gotchas I ran into along the way
  • Share with you the list of movies rated as great by Roger Ebert. That’s what you’re here for, right?

PS: If you just want to see the list of movies, just jump to the end of this article.

Code Quality Warning: I hacked this together as fast as I could without much refactoring, so it’s not the most readable or optimized. But it mostly works… for now.

Roadblocks

I hit a few roadblocks while working on this that I think are worth calling out and will clarify some of the decisions I made in the implementation.

scraping rogerebert.com

Performing a regular GET with an Accept: text/html header (which I think is the default for the requests library) against the url assigned to the variable ebert_url will always return the first page of movies (regardless of what you set the page query parameter to).

Solution? The Accept header field needs to be set to application/json for the server to return JSON containing movies for that specific page.

scraping amazon.com

No public API

First, there is no publicaly available Amazon API for their catalog search. It seems like you could email them to get authorization, but I didn’t want to waste my time doing that.

Not automation friendly

I started off using the requests library. Turns out that if you don’t set a proper browser agent, you’ll get a 503 and some message about how automation isn’t welcome. If you do fake a proper agent but you’re not setting cookies from the server respond, you’ll get:

Sorry, we just need to make sure you’re not a robot. For best results, please make sure your browser is accepting cookies.

I got frustrated and switched over to using a more stateful HTTP tool: mechanize.

That worked… 80% of the time? I noticed that if I run my scraper repeatedly it starts to get the anti-robot message again. Maybe there’s some pattern detection going on on the amazon servers?

Bad HTML …

You’ll notice that I’m using some regex in the function amazon_search to parse out the movie title search results on the page. The reason is that when I tried using beautifulsoup‘s find_all function on the search result tags, I got nothing. My guess is that there’s some invalid HTML on the page and confused the beautifulsoup html.parser parser which isn’t super lenient.

Turns out, rather than using regex, I could have switched over to use the html5lib parser.

For example: BeautifulSoup(match, features="html5lib").

The html5lib parser is the most lenient parser – much more lenient than html.parser. So if I needed to make additional changes to this function, I’d refactor it to use that parser and get rid of the nasty looking regex.

Results

Without further ado, here’s a table of all the great movies movies that are included with prime (sorted by most recent release).

If you want the full dataset, I’ve shared it via this google spreadsheet.

TitleYear ReleasedReview URLPrime URL
Moonstruck1987LinkLink
Fitzcarraldo1982LinkLink
Atlantic City1980LinkLink
Nosferatu the Vampyre1979LinkLink
The Long Goodbye1973LinkLink
“Aguirre, the Wrath of God”1972LinkLink
“The Good, the Bad and the Ugly”1968LinkLink
Gospel According to St. Matthew1964LinkLink
The Man Who Shot Liberty Valance1962LinkLink
Some Like It Hot1959LinkLink
Paths of Glory1957LinkLink
The Sweet Smell of Success1957LinkLink
The Night of the Hunter1955LinkLink
Johnny Guitar1954LinkLink
Beat the Devil1954LinkLink
Sunset Boulevard1950LinkLink
It’s a Wonderful Life1946LinkLink
Detour1945LinkLink
My Man Godfrey1936LinkLink
The General1927LinkLink

Enjoy.

Update (2020-6-10)

Lots of really neat discussion happened when I submitted this to hacker news. I’ll just highlight a few additional resources / things I learned that are useful.

And, of course, that there are fans of roger ebert everywhere. I’m glad some of you found this useful. Thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *