Wednesday, April 4, 2012

Bookies Project

For our group project, my group (Bookies) is comparing books on the New York Times bestseller list (i.e., books that people are reading) to books that have New York Times book reviews written about them (i.e., books that the New York Times thinks worthy of a review).  We want to see if there are any particular pieces of data that correlate very well between these two groups of books (and the implicit third group, books that both are bestsellers and have New York Times book reviews).

The data that we collected came from three main sources.  First, we downloaded all of the New York Times bestseller lists from 2008 to the present, extracted the relevant information about each book in the list, and then put that information into a database.  This was relatively straightforward, since the API is set up to download a particular week's bestseller list, so we just had to iterate through all the weeks in our data set, which was pretty simple, since Python is a magical language and everything is built-in.  After we had completed this downloading, we consolidated the data so that we only had one entry on each book (which we did by joining up entries with the same ISBN), and we put information about the book's performance on the list over time into a single entry (to make it more manageable in the future for analysis). 

Next, we searched through all of the New York Times book reviews from the same time period using the New York Times Article Search API.  This proved more challenging than initially expected, since the book reviews also include some information about the bestseller lists that aren't actually book reviews, and extracting any information about what book was being reviewed was difficult, but we eventually came up with a system based on the basic format of New York Times book reviews.  We realized that all of the reviews started with the title of the book, so by using the "works mentioned" piece of information, we can find the work mentioned that matches the beginning of the article and assume that that's the actual book being reviewed.

Most of the challenges we encountered in this data collection were because of poorly-documented APIs.  We spent about 45 minutes trying to figure out if there was any way to search for book reviews, and then almost an hour trying different permutations of the words "Book Review Desk" to figure out how to download just book reviews from our article search.  Unfortunately, the only solution for these problems was to try more information, but perseverance won out in the end :)

Friday, February 3, 2012

The Census: Metropolitan Area Populations

The US Census provides data on the estimated size of metropolitan and micropolitan areas over time.  While the Census collects exact data once every ten years, this information is estimated for the years when the Census did not collect data.  I enjoyed looking at this data set because enough of the listed cities and towns are familiar to me that I'm interested in just seeing which cities are grouped together, and how they choose to list the cities.  (The San Diego metropolitan area is listed as San Diego-Carlsbad-San Marcos.  I'm left curious: does this include Oceanside, the large city north of Carlsbad that adjoins the naval base?  Why did they choose to list Carlsbad, which, while a perfectly lovely town, isn't any more populous than any town adjoining it?  Who made this decision?)  While these questions can't be answered by the data in the table, the data can be used to find other correlations between absolute population (or between population growth or decline) and other data sets.
The first few rows of the available data

The numerical variable is the estimated population of the town in a given year.  The obvious categorical variable is the city and state in which each metropolitan area is located; this can be extended to discuss metropolitan areas in a given region (like New England or the Southwest).  The year is another categorical variable.  Since an estimate is given for each year, this data can be graphed in a time series.

This data would be most interesting if analyzed in conjunction with other data; for instance, I would be interested to see if there is any correlation between economic measures (median income, percent of population below the poverty line, job growth or lack thereof) and the population, or between these same measures and the change in population over time (i.e., a correlation between cities with high job loss and cities whose populations are decreasing over time).  I would also want to compare it with other measures of the general well-being of people in a city, like happiness, unemployment rates, age distributions, education rates, and other interesting measures.  This could reveal interesting information about the relationship between population and other variables.

Welcome!

Welcome to my blog for CS349B, Quantifying the World!  I'll be posting writeups relevant to what I'm doing in the class.