My wife and I are big fans of the late film critic Roger Ebert. We also share an Amazon prime membership.
I wondered: which of Roger Ebert’s favorite movies are available to watch for free on prime? Since there are hundreds of reviews by Roger Ebert, I had the perfect excuse for writing a web scraper!
In this article, I will:
- Show my not so pretty scraping code
- Discuss some roadblocks / gotchas I ran into along the way
- Share with you the list of movies rated as great by Roger Ebert. That’s what you’re here for, right?
PS: If you just want to see the list of movies, just jump to the end of this article.
Code Quality Warning: I hacked this together as fast as I could without much refactoring, so it’s not the most readable or optimized. But it mostly works… for now.
Roadblocks
I hit a few roadblocks while working on this that I think are worth calling out and will clarify some of the decisions I made in the implementation.
scraping rogerebert.com
Performing a regular GET
with an Accept: text/html
header (which I think is the default for the requests
library) against the url assigned to the variable ebert_url
will always return the first page of movies (regardless of what you set the page
query parameter to).
Solution? The Accept
header field needs to be set to application/json
for the server to return JSON containing movies for that specific page.
scraping amazon.com
No public API
First, there is no publicaly available Amazon API for their catalog search. It seems like you could email them to get authorization, but I didn’t want to waste my time doing that.
Not automation friendly
I started off using the requests
library. Turns out that if you don’t set a proper browser agent, you’ll get a 503 and some message about how automation isn’t welcome. If you do fake a proper agent but you’re not setting cookies from the server respond, you’ll get:
Sorry, we just need to make sure you’re not a robot. For best results, please make sure your browser is accepting cookies.
I got frustrated and switched over to using a more stateful HTTP tool: mechanize.
That worked… 80% of the time? I noticed that if I run my scraper repeatedly it starts to get the anti-robot message again. Maybe there’s some pattern detection going on on the amazon servers?
Bad HTML …
You’ll notice that I’m using some regex in the function amazon_search
to parse out the movie title search results on the page. The reason is that when I tried using beautifulsoup
‘s find_all
function on the search result tags, I got nothing. My guess is that there’s some invalid HTML on the page and confused the beautifulsoup
html.parser
parser which isn’t super lenient.
Turns out, rather than using regex, I could have switched over to use the html5lib
parser.
For example: BeautifulSoup(match, features="html5lib")
.
The html5lib
parser is the most lenient parser – much more lenient than html.parser
. So if I needed to make additional changes to this function, I’d refactor it to use that parser and get rid of the nasty looking regex.
Results
Without further ado, here’s a table of all the great movies movies that are included with prime (sorted by most recent release).
If you want the full dataset, I’ve shared it via this google spreadsheet.
Title | Year Released | Review URL | Prime URL |
---|---|---|---|
Moonstruck | 1987 | Link | Link |
Fitzcarraldo | 1982 | Link | Link |
Atlantic City | 1980 | Link | Link |
Nosferatu the Vampyre | 1979 | Link | Link |
The Long Goodbye | 1973 | Link | Link |
“Aguirre, the Wrath of God” | 1972 | Link | Link |
“The Good, the Bad and the Ugly” | 1968 | Link | Link |
Gospel According to St. Matthew | 1964 | Link | Link |
The Man Who Shot Liberty Valance | 1962 | Link | Link |
Some Like It Hot | 1959 | Link | Link |
Paths of Glory | 1957 | Link | Link |
The Sweet Smell of Success | 1957 | Link | Link |
The Night of the Hunter | 1955 | Link | Link |
Johnny Guitar | 1954 | Link | Link |
Beat the Devil | 1954 | Link | Link |
Sunset Boulevard | 1950 | Link | Link |
It’s a Wonderful Life | 1946 | Link | Link |
Detour | 1945 | Link | Link |
My Man Godfrey | 1936 | Link | Link |
The General | 1927 | Link | Link |
Enjoy.
Update (2020-6-10)
Lots of really neat discussion happened when I submitted this to hacker news. I’ll just highlight a few additional resources / things I learned that are useful.
- More streaming info on the rogerebert site itself: https://www.rogerebert.com/features/where-to-find-roger-eberts-great-movies-streaming
requests.session()
is another way to get a more stateful HTTP client- available movies on amazon can differ substantially between countries! I did not know that. This list I made has only been tested with the U.S Amazon
- Roger ebert bio pic: https://www.rogerebert.com/reviews/life-itself-2014
And, of course, that there are fans of roger ebert everywhere. I’m glad some of you found this useful. Thank you.
Leave a Reply