Wednesday, April 4, 2012

Bookies Project

For our group project, my group (Bookies) is comparing books on the New York Times bestseller list (i.e., books that people are reading) to books that have New York Times book reviews written about them (i.e., books that the New York Times thinks worthy of a review).  We want to see if there are any particular pieces of data that correlate very well between these two groups of books (and the implicit third group, books that both are bestsellers and have New York Times book reviews).

The data that we collected came from three main sources.  First, we downloaded all of the New York Times bestseller lists from 2008 to the present, extracted the relevant information about each book in the list, and then put that information into a database.  This was relatively straightforward, since the API is set up to download a particular week's bestseller list, so we just had to iterate through all the weeks in our data set, which was pretty simple, since Python is a magical language and everything is built-in.  After we had completed this downloading, we consolidated the data so that we only had one entry on each book (which we did by joining up entries with the same ISBN), and we put information about the book's performance on the list over time into a single entry (to make it more manageable in the future for analysis). 

Next, we searched through all of the New York Times book reviews from the same time period using the New York Times Article Search API.  This proved more challenging than initially expected, since the book reviews also include some information about the bestseller lists that aren't actually book reviews, and extracting any information about what book was being reviewed was difficult, but we eventually came up with a system based on the basic format of New York Times book reviews.  We realized that all of the reviews started with the title of the book, so by using the "works mentioned" piece of information, we can find the work mentioned that matches the beginning of the article and assume that that's the actual book being reviewed.

Most of the challenges we encountered in this data collection were because of poorly-documented APIs.  We spent about 45 minutes trying to figure out if there was any way to search for book reviews, and then almost an hour trying different permutations of the words "Book Review Desk" to figure out how to download just book reviews from our article search.  Unfortunately, the only solution for these problems was to try more information, but perseverance won out in the end :)