Chicago Bikeability

Bicycling and walking are two methods of transportation that have significant benefits over cars or even public transportation. The costs of supporting the necessary infrastructure and individuals' costs of travel are lower, and higher activity level is liked to better health and wellbeing. It has become common in the last decade for urban neighborhoods to promote themselves with a walk score, indicating how easy it is to live in that neighborhood and walk as a primary mode of transportation. These scores typically rely on the availability and accessibility of sidewalks, shops, grocery stores, parks, etc.

This project aims to understand the socioeconomic factors that lead people to choose bicycling and use that knowledge to develop a "bikability" score for neighborhoods in the city of Chicago.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
sns.set_context("notebook")
plt.style.use('ggplot')

Obtaining the data

The American Community Survey provides tabular data on a number of socioeconomic features at a very small spatial scale, with over 2500 "block groups" within the Chicago city limits. This data includes a question on the method used for commuting to work, which includes by bicycle. The ACS report Modes Less Traveled - Bicycling and Walking to Work in the United States explores this data on a national level, and uncovers some very useful trends that we will look for in the Chicago-centric data:

  • Men commute by bicycle more often than women (0.8% to 0.3%, respectively)
  • The rate of commuting steadily drops with age from ~1% from 16-24 to 0.3% for workers 55 and older
  • Workers with a graduate or professional degree were 3 times as likely to commute by bicycle than workers with only a high school diploma.
  • Households with income less than \$25,000 were 2-3 times more likely to commute by bicycle

This suggests two different populations of bicycle commuters: those with low education and income where bicycling proves an economical choise, and those with high education and income. In both cases young men in households without children predominate.

Initial data exploration

In this section we will load the commuting dataset for Chicago and perform some simple cleaning and analysis to ensure it is suitable for this project. The ACS is downloadable at the census block group level for individual counties, in this case Cook county which contains virrtually all of Chicago. We use the GEOID encoding to map block groups to census tracts and this helpful mappping of the 2010 census tracts to Chicago community area (slightly larger than a neighborhood).

In [2]:
def load_acs2013_commuting():
    """ Load the ACS 2013 survey on method of commuting to work for all of Cook county
        and restrict to those census blocks located within the city limits"""
        
    # load all block groups in Cook county
    acs = pd.read_csv("../ACS/ACS_13_5YR_B08301_with_ann.csv", header=1, index_col=0,
                      usecols=["Id2", "Estimate; Total:", "Margin of Error; Total:",
                               "Estimate; Bicycle", "Margin of Error; Bicycle"])
    # rename columns
    acs.columns = ["total", "total_error", "bicycle", "bicycle_error"]

    # parse tract codes from Id2 (GEOID) for collation
    acs['tract'] = [int(str(i)[5:11]) for i in acs.index]
    
    # load list of Chicago tracts and associated communities
    community = pd.read_csv("../communities/chicago_census_tract_to_community.csv",
                            header=0, index_col=0, comment="#")
    community.columns = ['community_id','community']
    
    # eliminate tracts not in Chicago city boundaries
    acs = acs.join(community['community'], on='tract', how='inner')
    
    # remove handful of census blocks with no working population
    # (generally non-residential locations like O'Hare and Midway airports)
    acs = acs[acs['total'] > 0]

    # Convert from ACS 90% margin of error to 68% 1-sigma S/N
    # erf(1.645/sqrt(2)) ~ 0.9
    acs[['total_error','bicycle_error']] /= 1.645

    return acs
In [3]:
acs = load_acs2013_commuting()
In [4]:
def aggregate_error(x):
    """ A margin of error of 11 represents the sample error when the estimate is 0.
        ACS recommends adding only one such value when aggregating
        see ACS "Multiyear Accuracy of the Data", however it appears to underestimate
        uncertainty for community level aggregates
    """
    #if (x == 11).any():
    #    return np.sqrt((x[x != 11]**2).sum() + 11**2)
    #else:
    return np.sqrt(x.dot(x))

def ratio_error(X, Y, Xerr, Yerr):
    """ Compute the error of the ratio of two population statistics
        where X is a subset of Y, and Xerr and Yerr are the sample 
        margins of error on X and Y respectively."""
    if (np.asarray(Y) == 0).any():
        raise ValueError("Zero value detected in numerator of ratio estimate")
    YXerr = Y**2*Xerr**2
    XYerr = X**2*Yerr**2
    # ACS recommends using standard error for proportion when
    # X is subset of Y, and normal ratio form if any terms are
    # negative (Eqs 3 & 4 in ACS "Multiyear Accuracy of the Data")
    if (XYerr > YXerr).any():    
        return np.sqrt(YXerr + XYerr)/Y**2
    else:
        return np.sqrt(YXerr - XYerr)/Y**2
In [5]:
# compute the block-by-block bicycle commuter fraction and add it to the dataframe
acs['bicycle_frac'] = acs['bicycle']/acs['total']
acs['bicycle_frac_error'] = ratio_error(acs['bicycle'], acs['total'],
                                        acs['bicycle_error'],
                                        acs['total_error'])


acs['bicycle_SN'] = acs['bicycle']/acs['bicycle_error']

f, ax = plt.subplots(1,2, figsize=(12,4))
acs['bicycle_frac'].hist(ax=ax[0], bins=30)
ax[0].set_xlabel("Bicycle Commuter fraction")
ax[0].set_ylabel("count")
ax[0].set_yscale('log')

acs['bicycle_SN'].hist(ax=ax[1], bins=30)
ax[1].set_xlabel("S/N")
ax[1].set_ylabel("count")
ax[1].set_yscale('log')
ax[1].set_ylim(ax[0].get_ylim())

bicycle_total = float(acs['bicycle'].sum())
bicycle_err = np.sqrt((acs['bicycle_error']**2).sum())
total = float(acs['total'].sum())
total_err = np.sqrt((acs['total_error']**2).sum())

pop_average = bicycle_total / total
pop_average_err = ratio_error(bicycle_total, total,
                              bicycle_err, total_err)

print "Average block-by-block bicycle commuter fraction: {:.3f}".format(acs['bicycle_frac'].mean())
print "Number of blocks with measured bicycle commuters: {:d}".format((acs['bicycle'] > 0).sum())
print "Population weighted average: {:.3f} +/- {:.3f}".format(pop_average, pop_average_err)
Average block-by-block bicycle commuter fraction: 0.011
Number of blocks with measured bicycle commuters: 636
Population weighted average: 0.013 +/- 0.001

This figure shows that a large fraction of the block groups have no bicycle commuters within the margin of error, which is unsurprising given the city-wide average of \(1.3\%\). It's tempting to simply remove those blocks from the sample, however that will introduce a correlation with the population of each block (in that a smaller total population increases the likelihood of no detection). There are also correlations between the \(S/N\) and the fraction at the detection threshold, with some of the largest fractions having \(S/N \sim 1\); these may simply be outliers.

In [157]:
ax = acs.plot("bicycle_frac","bicycle_SN", kind="scatter", label="Census blocks")
ax.set_ylabel("S/N")
ax.set_xlabel("Bicycle Commuter Fraction");

Aggregating into larger geographical areas

Because the data is so noisy at the census block level, we explore the possibility of aggregating geographical regions with low \(S/N\) to increase their signal.

In [7]:
# aggregate by Census tract
acs_tract = acs.groupby('tract').agg({"total":"sum",
                                      "total_error":aggregate_error,
                                      "bicycle":"sum",
                                      "bicycle_error":aggregate_error,
                                      "community":"first"})
acs_tract['bicycle_frac'] = acs_tract['bicycle']/acs_tract['total']
acs_tract['bicycle_frac_error'] = ratio_error(acs_tract['bicycle'], 
                                              acs_tract['total'],
                                              acs_tract['bicycle_error'], 
                                              acs_tract['total_error'])
acs_tract['bicycle_SN'] = acs_tract['bicycle']/acs_tract['bicycle_error']

# aggregate by Community (name)
acs_community = acs.groupby('community').agg({"total":"sum",
                                              "total_error":aggregate_error,
                                              "bicycle":"sum",
                                              "bicycle_error":aggregate_error})
acs_community['bicycle_frac'] = acs_community['bicycle']/acs_community['total']
acs_community['bicycle_frac_error'] = ratio_error(acs_community['bicycle'], 
                                                  acs_community['total'],
                                                  acs_community['bicycle_error'], 
                                                  acs_community['total_error'])
acs_community['bicycle_SN'] = acs_community['bicycle']/acs_community['bicycle_error']

print "Chicago communities with no measured bicycle commuters:"
acs_community[['total']][acs_community['bicycle'] == 0]
Chicago communities with no measured bicycle commuters:

Out[7]:
total
community
Ashburn 18317
Auburn Gresham 14878
Avalon Park 3845
Burnside 809
Chatham 10919
Chicago Lawn 18504
Fuller Park 683
Garfield Ridge 15385
Hegewisch 3961
O'Hare 6770
Pullman 2460
Riverdale 1321
Washington Park 2992
West Pullman 8668

By aggregating into the much larger Chicago communities, we have reduced the number of data points without bicycle commuter samples to 14 (out of 77), and a by-eye inspection shows they are mostly in outlying, suburban areas where we would expect very low rates of bicycle commuters. Plotting the distribution of fractions and \(S/N\) for the Census tract and Chicago community aggregations we also see significant improvement for the aggregated samples. The \(S/N\) for Chicago communities in particular shows a nice linear trend with fraction, as you would expect if the uncertainty was approximately constant.

In [8]:
f, ax = plt.subplots(2, 2, figsize=(12,8), sharex='col')

acs_tract.plot('bicycle_frac', 'bicycle_SN', kind="scatter", 
               label="Census Tracts", ax=ax[1,0])
ax[1,0].set_ylabel("S/N")
ax[1,0].set_xlabel("Bicycle Commuter Fraction")

acs_community.plot('bicycle_frac', 'bicycle_SN', kind="scatter", 
                   label="Communities", ax=ax[1,1])
ax[1,1].set_ylabel("S/N")
ax[1,1].set_xlabel("Bicycle Commuter Fraction")

acs_tract.hist('bicycle_frac', ax=ax[0,0], bins=30)
ax[0,0].set_title("By Census Tract")
ax[0,0].set_xlabel("Bicycle Commuter Fraction", visible=True)
plt.setp(ax[0,0].get_xticklabels(), visible=True)
ax[0,0].set_ylabel("# Census Tracts")
ax[0,0].set_yscale('log')

acs_community.hist('bicycle_frac', ax=ax[0,1], bins=30)
ax[0,1].set_title("By Chicago Community")
ax[0,1].set_xlabel("Bicycle Commuter Fraction", visible=True)
plt.setp(ax[0,1].get_xticklabels(), visible=True)
ax[0,1].set_ylabel("# Communities");
ax[0,1].set_yscale('log')

It would be a shame to coarsen the ACS data to the community scale in regions where the measurements are statistically significant at census tract or even block scales. Instead we keep data at the census tract level when all tracts in a community have data, and from the block level when all blocks in a tract have data. In principle we should make a further restriction on the \(S/N\) of those regions.

Note, this aggregation is dangerous. If the underlying distribution of bicycle commuters is not linear in the aggregated variables, this will bias any recovered trends. Preferentially aggregating regions with low probability also creates a bias (known as Malmquist bias in astronomy) where low counts of bicycle commuters are preferentially selected from census block groups with higher than average bicycle commuting rates (allowing them to remain in the sample). This can be somewhat ameliorated by fitting not raw counts but population fractions, which also helps eliminate the underlying correlations between variables proportional to the total population.

The final binned sample is computed below.

In [9]:
divide_communities = acs_community[acs_tract.groupby('community')['bicycle'].apply(lambda x: x.all())].index
keep_communities = acs_community[~acs_community.index.isin(divide_communities)]
print "Number of communities:", len(keep_communities)

divide_tracts = acs_tract[(acs.groupby('tract')['bicycle'].apply(lambda x: x.all())) & 
                          (acs_tract['community'].isin(divide_communities))].index
keep_tracts = acs_tract[~acs_tract.index.isin(divide_tracts) &
                        acs_tract['community'].isin(divide_communities)].drop(['community'], axis=1)
print "Number of census tracts:", len(keep_tracts)

keep_blocks = acs[acs['tract'].isin(divide_tracts)].drop(['tract','community'], axis=1)
print "Number of census blocks:", len(keep_blocks)

print "Total regions included in sample:", len(keep_blocks)+len(keep_tracts)+len(keep_communities)

acs_sample = pd.concat([keep_blocks, keep_tracts, keep_communities])
Number of communities: 69
Number of census tracts: 41
Number of census blocks: 109
Total regions included in sample: 219

Bicycle Commuting fraction

Before proceeding to analysis, it's always nice (and a good idea) to plot the data after cleaning. The census kindly provides cartographic boundary shapefiles for census blocks in KML format. The following uses pyKML to parse the boundary database for IL by GEOID (identifying unique census block groups), then merges them into census tract and Chicago community level boundaries using shapely. Finally, we plot all regions kept in the sample, colored by the sampled bicycle commuter fraction.

In [10]:
from itertools import ifilter
from shapely.geometry import Polygon
from shapely.ops import unary_union
from zipfile import ZipFile
from pykml import parser as kml_parser

def load_block_shapes():
    blocks = {}
    with ZipFile("../boundaries/cb_2013_17_bg_500k.kmz") as zf:
        with zf.open("cb_2013_17_bg_500k.kml") as f:
            doc = kml_parser.parse(f).getroot().Document
            for p in doc.Folder.Placemark:
                geoid, = ifilter(lambda x: 'name' in x.attrib and x.attrib['name'] == 'GEOID', p.iter())
                blocks[geoid] = [tuple(float(i) for i in triplet.split(',')[0:2])
                                 for triplet in p.Polygon.outerBoundaryIs.LinearRing.coordinates.text.split()]
    return blocks

def create_merged_shapes():
    block_polygons = load_block_shapes()

    polygons = {}
    for block in acs.index:
        polygons[block] = Polygon(block_polygons[block])

    for tract in acs_tract.index:
        polygons[tract] = unary_union([Polygon(block_polygons[block]) for block in acs[acs['tract']==tract].index])

    for community in acs_community.index:
        polygons[community] = unary_union([Polygon(block_polygons[block])
                                           for tract in acs_tract[acs_tract['community']==community].index
                                           for block in acs[acs['tract']==tract].index])
    return polygons

def load_bike_routes():
    routes = []
    with open("../chicagodata/Bike Routes.kml") as f:
        doc = kml_parser.parse(f).getroot().Document
        for p in doc.Placemark:
            try:
                route_type = re.search('<span class="atr-name">TYPE</span>:</strong> <span class="atr-value">(\d*)</span>', p.description.text).groups(0)[0]
                #bikeroute = re.search('<span class="atr-name">BIKEROUTE</span>:</strong> <span class="atr-value">([A-Z\- ]*)</span>', p.description.text).groups(0)[0]
            except:
                route_type = 0
                print p.description
                
            if route_type in ['13', '1', '3', '9', '2', '7', '45', '4', '5']:
                try:
                    routes.append([[float(i) for i in x.split(',')]
                                   for x in p.MultiGeometry.LineString.coordinates.text.split()])
                except:
                    print "Error parsing route", p.description
    return routes
In [11]:
# a dictionary mapping sample (block, tract, or community) to shape
polygons = create_merged_shapes()
In [12]:
from matplotlib.collections import PatchCollection
from matplotlib.colors import Normalize
from descartes import PolygonPatch

def plot_regions(regions, fill_value=None,
                 plot_communities=False, cm=plt.cm.Blues, 
                 vmin=None, vmax=None, label=None, ec='k', lw=0.5,
                 plot_bike_routes=False):
    if not fill_value is None:
        if not vmax:
            vmax = fill_value.max()
        if not vmin:
            vmin = fill_value.min()
        norm = Normalize(vmin=vmin, vmax=vmax) 
        # hack to get a colorbar without a mappable
        Z = [[0,0],[0,0]]
        fcm = plt.figure()
        levels = np.linspace(vmin,vmax,100)
        contours = plt.contourf(Z, levels, cmap=cm)
        fcm.clf()

    f = plt.figure(figsize=(12,12))
    ax = f.add_axes([0,0,.9,1])
    ax.set_xlim(-87.89,-87.5)
    ax.set_ylim(41.6,42.05)
    ax.set_aspect(1)
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
    plt.axis('off')

    if plot_communities:
        patches = [PolygonPatch(polygons[region], fill=0, ec='k')
                   for region in acs_community.index]
    else:
        patches = []
    
    if not fill_value is None:
        patches.extend([PolygonPatch(polygons[r], fill=1, ec=ec, lw=lw,
                                     fc=cm(norm(fill_value[r])))
                        for r in regions])
    else:
        patches.extend([PolygonPatch(polygons[r], fill=0, ec='k')
                        for r in regions])

    ax.add_collection(PatchCollection(patches, match_original=True))

    if plot_bike_routes:
        routes = load_bike_routes()
        for r in routes:
            x, y = zip(*[(x[0],x[1]) for x in r])
            ax.plot(x, y, color='c', lw=1)
    
    if not fill_value is None:
        f.colorbar(contours, cmap=cm, shrink=0.6, label=label, norm=norm)
    plt.close(fcm)

plot_regions(acs_sample.index, fill_value=100.*acs_sample['bicycle_frac'],
             vmin=0., label="% Workers Commute by Bicyle")
print "Largest observed bicycle commuter fraction: {:.2f}%".format(100.*acs_sample['bicycle_frac'].max())
Largest observed bicycle commuter fraction: 13.72%

A major question of the remaining analysis is whether the observed spatial variation at block and tract scales is due mostly to noise or whether it is predictable from some measureable socioeconomic factors.

Collecting feature data

Now that we've established the geographical boundaries of our dataset, we can collect additional data to define features of the model.

ACS 2013 census data

We naturally look to other population characteristics as measured in the ACS 2013 dataset as they are directly comparable, both in methodology and geographic breakdown. We also expect from the ACS report on national trends that age, education level, income and sex are all likely factors that we should include in our model. The ACS data are available as separate tables in CSV format, which we read and join with the bicycle commuter fraction dataset.

In [13]:
def load_acs2013_dataset(dataset, sample=None, measure_distribution=False):
    """ Load specified ACS table and match aggregation of provided the provided
        dataframe. measure_distribution computes the cumulative distribution which
        can be used to select 
    """
    acs_dataset = pd.read_csv("../ACS/ACS_13_5YR_{}_with_ann.csv".format(dataset), 
                              header=1, index_col=1)
    acs_dataset.drop(["Id","Geography"], axis=1, inplace=True)    
    acs_dataset['tract'] = [int(str(i)[5:11]) for i in acs_dataset.index]

    if measure_distribution:
        cols = list(set(acs_dataset.columns) - set(['Estimate; Total:',
                                                    'Estimate; Male:',
                                                    'Estimate; Female:',
                                                    'Estimate; With cash rent:',
                                                    'Estimate; No cash rent:']))
        cols.sort(key=acs_dataset.columns.tolist().index)
        cumulative = acs_dataset[cols].filter(like="Estimate").sum().cumsum()
        cumulative /= cumulative[-1]
        print cumulative
    
    # eliminate block groups not in Chicago city boundaries
    community = pd.read_csv("../communities/chicago_census_tract_to_community.csv",
                            header=0, index_col=0, comment="#")
    community.columns = ['community_id','community']
    acs_dataset = acs_dataset.join(community['community'], on='tract', how='inner')

    # convert margin of error to 68% 1-sigma
    acs_dataset[acs_dataset.filter(like="Margin of Error").columns] /= 1.645
    
    # aggregation functions by column type
    aggf = {}
    for c in acs_dataset.columns:
        if c.startswith("Estimate;"):
            aggf[c] = "sum"
        elif c.startswith("Margin of Error;"):
            aggf[c] = aggregate_error
        else:
            aggf[c] = "first"
    
    if not sample is None:
        # aggregate into census tracts and communities
        acs_dataset_by_tract = acs_dataset.groupby('tract').agg(aggf).drop(['tract','community'], axis=1)
        acs_dataset_by_community = acs_dataset.groupby('community').agg(aggf).drop(['tract','community'], axis=1)
        acs_dataset = acs_dataset.drop(['tract','community'], axis=1)

        # reduce dataset to bicycle commuter defined sample
        return pd.concat([acs_dataset[acs_dataset.index.isin(sample.index)],
                          acs_dataset_by_tract[acs_dataset_by_tract.index.isin(sample.index)],
                          acs_dataset_by_community[acs_dataset_by_community.index.isin(sample.index)]])
    else:
        acs_dataset = acs_dataset.drop(['tract','community'], axis=1)
        return acs_dataset

def compute_proportion(acs_df, groups, group_type, group_name, col_class="Total"):
    frac = "Frac{}; {}".format(group_type, group_name)
    frac_err = frac + " Error"
    
    cols = ["Estimate; {}: - {}".format(col_class, r) for r in groups]
    group_total = acs_df[cols].sum(axis=1)
    cols = ["Margin of Error; {}: - {}".format(col_class, r) for r in groups]
    group_total_err = acs_df[cols].apply(aggregate_error, axis=1)
    
    total = acs_df["Estimate; Total:"]
    total_err = acs_df["Margin of Error; Total:"]
 
    # compute proportion of 
    acs_df[frac] = group_total / total
    acs_df[frac_err] = ratio_error(group_total, total,
                                   group_total_err, total_err)
B01001 - Sex by Age

This table includes both sex and age characteristics. While nationwide data shows a significant difference in commuter fraction between Male and Female commuters, the region-by-region differences are minimal (no more than 10%), and so we find the sex alone is not a useful feature for characterising bicycle commuter fraction. We combine both sexes to obtain population counts by age, and divide into approximate thirds, 18-29, 30-54, and 55+.

In [14]:
def load_acs2013_sex_by_age(data_expression="fraction", sample=acs_sample):
    acs_sex_by_age = load_acs2013_dataset("B01001", sample)
    
    if data_expression == "fraction":
        # Compute overall male fraction
        acs_sex_by_age["FracMale; all"] = acs_sex_by_age["Estimate; Male:"]/acs_sex_by_age["Estimate; Total:"]
        acs_sex_by_age["FracMale; all Error"] = ratio_error(acs_sex_by_age["Estimate; Male:"],
                                                            acs_sex_by_age["Estimate; Total:"],
                                                            acs_sex_by_age["Margin of Error; Male:"],
                                                            acs_sex_by_age["Margin of Error; Total:"])

        age_groups = {"18-29": ["18 and 19 years", "20 years",
                                "21 years", "22 to 24 years",
                                "25 to 29 years"],
                      "30-54": ["30 to 34 years", "35 to 39 years",
                                "40 to 44 years", "45 to 49 years",
                                "50 to 54 years"],
                      "55+": ["55 to 59 years", "60 and 61 years",
                              "62 to 64 years", "65 and 66 years",
                              "67 to 69 years", "70 to 74 years",
                              "75 to 79 years", "80 to 84 years",
                              "85 years and over"]
                      }
        
        for age, groups in age_groups.items():        
            frac_age = "FracAge; {}".format(age)
            frac_age_err = frac_age + " Error"
            male = ["Estimate; Male: - {}".format(g) for g in groups]
            female = ["Estimate; Female: - {}".format(g) for g in groups]    
            male_err = ["Margin of Error; Male: - {}".format(g) for g in groups]
            female_err = ["Margin of Error; Female: - {}".format(g) for g in groups]
        
            age_total = acs_sex_by_age[male+female].sum(axis=1)
            age_total_err = acs_sex_by_age[male_err+female_err].apply(aggregate_error, axis=1)
        
            acs_sex_by_age[frac_age] = age_total / acs_sex_by_age["Estimate; Total:"]
            acs_sex_by_age[frac_age_err] = ratio_error(age_total,
                                                       acs_sex_by_age["Estimate; Total:"],
                                                       age_total_err,
                                                       acs_sex_by_age["Margin of Error; Total:"])    
        return acs_sex_by_age.filter(like="Frac")
    else: #data expression
        # remove Total columns
        acs_sex_by_age.drop(['Estimate; Total:', 'Margin of Error; Total:',
                             'Estimate; Male:', 'Margin of Error; Male:',
                             'Estimate; Female:', 'Margin of Error; Female:'], axis=1, inplace=True)
        return acs_sex_by_age
C17002 - Ratio of income to poverty level

This table is currently not used in the analysis as more than 2/3 of the population falls in the last bin (2.00 and over). See B19001 for household income.

In [15]:
def load_acs2013_poverty_level(data_expression="fraction", sample=acs_sample):
    acs_poverty = load_acs2013_dataset("C17002", acs_sample)
    
    if data_expression == "fraction":
        poverty_groups = { "below":[".50 to .99", "Under .50"],
                           "above":["1.00 to 1.24", "1.25 to 1.49",
                                    "1.50 to 1.84", "1.85 to 1.99",
                                    "2.00 and over"]}
    
        for group, poverty_ranges in poverty_groups.items():
            compute_proportion(acs_poverty, poverty_ranges,
                               "Poverty", group)
            
        return acs_poverty.filter(like="Frac")
    else: # data_expression
        acs_poverty.drop(["Estimate; Total:",
                          "Margin of Error; Total:"], axis=1, inplace=True)
        return acs_poverty
B19001 - Household income

This table breaks down total household income (expressed in 2013 dollars) into a number of ranges from less than \$10,000 to more than \$200,000. In Chicago, approximate 1/3 have incomes < \$35k, \$35k-\$75k, and \$75k+.

In [16]:
def load_acs2013_household_income(data_expression="fraction", sample=acs_sample):
    acs_income = load_acs2013_dataset("B19001", sample)
    
    if data_expression == "fraction":
        income_groups = {"$0-35k":["Less than $10,000", "$10,000 to $14,999",
                                   "$15,000 to $19,999", "$20,000 to $24,999",
                                   "$25,000 to $29,999", "$30,000 to $34,999"],
                         "$35k-75k":["$35,000 to $39,999", "$40,000 to $44,999",
                                     "$45,000 to $49,999", "$50,000 to $59,999",
                                     "$60,000 to $74,999"],
                         "$75k+":["$75,000 to $99,999", "$100,000 to $124,999",
                                  "$125,000 to $149,999", "$150,000 to $199,999",
                                  "$200,000 or more"]
                        }
        
        for group, income_ranges in income_groups.items():
            compute_proportion(acs_income, income_ranges,
                               "Income", group)
            
        return acs_income.filter(like="Frac")
    else: # data_expression
        acs_income.drop(["Estimate; Total:",
                         "Margin of Error; Total:"], axis=1, inplace=True)
        return acs_income
B25063 - Gross rent for renter occupied households

The number of households in the region broken down by rent paid. Approximate thirds are located at up to \$799, \$800-\$1249, and \$1250+.

In [17]:
def load_acs2013_rent(data_expression="fraction", sample=acs_sample):
    acs_rent = load_acs2013_dataset("B25063", sample)
    
    if data_expression == "fraction":
        rent_groups = {"$0-799": ["Less than $100", "$100 to $149",
                                 "$150 to $199", "$200 to $249",
                                 "$250 to $299", "$300 to $349",
                                 "$350 to $399", "$400 to $449",
                                 "$450 to $499", "$500 to $549",
                                 "$550 to $599", "$600 to $649",
                                 "$650 to $699", "$700 to $749",
                                 "$750 to $799"],
                       "$800-1249": ["$800 to $899", "$900 to $999",
                                    "$1,000 to $1,249"],
                       "$1250+": ["$1,250 to $1,499", "$1,500 to $1,999",
                                "$2,000 or more"]
                       }
    
        for rent, rent_ranges in rent_groups.items():
            compute_proportion(acs_rent, rent_ranges,
                               "Rent", rent, "With cash rent")
        return acs_rent.filter(like="Frac")
    else:
        acs_rent.drop(["Estimate; Total:", "Margin of Error; Total:",
                       "Estimate; With cash rent:", "Margin of Error; With cash rent:",
                       "Estimate; No cash rent", "Margin of Error; No cash rent"], 
                      axis=1, inplace=True)
        return acs_rent
B15003 - Education level

The number of residents broken down by maximum education attained. Approximately 1/3 each of the Chicago population have no schooling through a high school diploma, GED through an associate's degree, and a bachelor's or professional degree.

In [18]:
def load_acs2013_education(data_expression="fraction", sample=acs_sample):
    acs_education = load_acs2013_dataset("B15003", sample)
    
    if data_expression == "fraction":
        # breaking into approximate Chicago-wide quintiles
        education_groups = {"None to High School": ["No schooling completed", 
                                                    "Kindergarten",
                                                    "1st grade", "2nd grade",
                                                    "3rd grade", "4th grade",
                                                    "5th grade", "6th grade",
                                                    "7th grade", "8th grade",
                                                    "9th grade", "10th grade",
                                                    "11th grade", 
                                                    "12th grade, no diploma",
                                                    "Regular high school diploma"],
                            "GED to Associate's": ["GED or alternative credential",
                                                   "Some college, less than 1 year",
                                                   "Some college, 1 or more years, no degree",
                                                   "Associate's degree"],
                            "Bachelor's+": ["Bachelor's degree", 
                                            "Master's degree",
                                            "Professional school degree",
                                            "Doctorate degree"]
                            }
    
        for education, education_ranges in education_groups.items():
            compute_proportion(acs_education, education_ranges,
                               "Education", education)

        return acs_education.filter(like="Frac")
    else:
        acs_education.drop(["Estimate; Total:",
                            "Margin of Error; Total:"], axis=1, inplace=True)
        return acs_education

Regression Analysis

In this section we use the ACS socioeconomic factors to model the bicycle commuting fraction through standard regression techniques.

"Traditional" linear modeling

In this section we perform regressions on data that is binned in both dependent and independent variables. This ensures the statistical significance of all measured points and allows us to use methods that account for measurement uncertainty. We select the top and bottom third of each independent variable, which provides sensitivity to trends with those variables.

The list of variables and their general distributions is given below:

In [158]:
acs_sex_by_age = load_acs2013_sex_by_age()
acs_income = load_acs2013_household_income()
acs_rent = load_acs2013_rent()
acs_education = load_acs2013_education()

acs_new_sample = pd.concat([acs_sample[['bicycle_frac', 'bicycle_frac_error']],
                            acs_sex_by_age,
                            acs_income,
                            acs_rent,
                            acs_education], axis=1)
In [20]:
# subset of columns used for regression
cols = ["FracEducation; None to High School", "FracEducation; Bachelor's+", 
        "FracAge; 18-29", "FracAge; 55+",
        "FracIncome; $0-35k", "FracIncome; $75k+",
        "FracRent; $0-799", "FracRent; $1250+"]
acs_new_sample[cols].describe().T
Out[20]:
count mean std min 25% 50% 75% max
FracEducation; None to High School 219 0.344890 0.196221 0.000000 0.180418 0.326888 0.496789 0.777186
FracEducation; Bachelor's+ 219 0.423149 0.240654 0.055702 0.210408 0.389878 0.626630 0.941088
FracAge; 18-29 219 0.237657 0.094862 0.083645 0.172749 0.207591 0.274712 0.834556
FracAge; 55+ 219 0.194828 0.077787 0.016452 0.145226 0.189762 0.248002 0.433976
FracIncome; $0-35k 219 0.376819 0.150587 0.067474 0.259853 0.366172 0.491065 0.811164
FracIncome; $75k+ 219 0.330181 0.159742 0.042739 0.191575 0.315946 0.442825 0.707612
FracRent; $0-799 219 0.300577 0.173233 0.000000 0.164137 0.292532 0.406480 0.776803
FracRent; $1250+ 219 0.248127 0.181972 0.000000 0.118428 0.201313 0.333974 0.982318

After trying a number of different regression methods we settle on a linear model and use the Lasso method to reduce the number of non-zero coefficients.

In [166]:
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn import cross_validation

class ScaleError(TransformerMixin):
    """ Allows the scaling of all features to the error in 
        the predicted variable. Must scale input y manually """
    def fit_transform(self, X, y=None, yerr=None):
        if yerr is None: return X
        return X / yerr[:, np.newaxis]
    
    def transform(self, X, y=None):
        return X
        
model = Pipeline([ ('poly', PolynomialFeatures(degree=1, include_bias=True)),
                   ('scale', ScaleError()),
#                    ('linear', linear_model.LinearRegression(fit_intercept=False))
#                    ('ridge', linear_model.Ridge(fit_intercept=False, alpha=1.0))
#                    ('ridge', linear_model.RidgeCV(fit_intercept=False))
#                    ('ridge', linear_model.BayesianRidge(fit_intercept=False))
                    ('lasso', linear_model.Lasso(fit_intercept=False, alpha=1.))
#                    ('lasso', linear_model.LassoCV(fit_intercept=False))
#                    ('lasso', linear_model.LassoLarsCV(fit_intercept=False))
#                    ('lasso', linear_model.LassoLarsIC(fit_intercept=False))
#                    ('elastic', linear_model.ElasticNet(fit_intercept=False))
                 ])

X = acs_new_sample[cols]
Y = acs_new_sample["bicycle_frac"]/acs_new_sample['bicycle_frac_error']
model.fit(X, Y, scale__yerr=acs_new_sample['bicycle_frac_error'])
yest_poly = model.predict(X)

if hasattr(model.steps[-1][-1], "alpha_"):
    print "Alpha:", model.steps[-1][-1].alpha_

scores = cross_validation.cross_val_score(model,
            X, Y, scoring="mean_squared_error", cv=len(acs_new_sample)/3)
print "MSE:", -scores.mean()
print "std:", scores.std()

print "Model: bicycle fraction = "
if model.named_steps['poly'].n_output_features_ == 1:
    print "\t\t%+8.3e" % (model.steps[-1][-1].coef_,)
else:
    for terms, coef in zip(model.named_steps['poly'].powers_,
                           model.steps[-1][-1].coef_):
        if coef != 0.0:
            print "\t\t{:+8.3e} ".format(coef) + \
                  " ".join("[{}]{}".format(c, "**{}".format(terms[i])
                                           if terms[i] > 1 else "")
                           for i, c in enumerate(cols) if terms[i])
MSE: 2.48954339279
std: 6.2100655508
Model: bicycle fraction = 
		+1.052e-02 [FracEducation; None to High School]
		+3.587e-02 [FracEducation; Bachelor's+]
		+2.247e-03 [FracAge; 18-29]
		-2.298e-02 [FracAge; 55+]
		-1.187e-02 [FracIncome; $75k+]
		+2.097e-03 [FracRent; $0-799]
		-1.922e-03 [FracRent; $1250+]

The results of the fit are as expected from the national trends. Positive indicators of bicycle commuting are young, low income, low rent, and low education, as well as high education. Older populations, and those with higher income and higher rent are negative indicators of bicycle commuting.

We plot the distribution of residuals from the model and compare to a simple citiwide average. It clearly shows improvement in the model predictions, however a number of regions are mis-predicted by several times their measurement error.

In [168]:
# LevM linear fit (for checking)
from scipy import optimize
def f(theta, x, y, yerr):
    return (y - theta[0] - x.dot(theta[1:]))/yerr
theta = np.zeros(len(cols)+1)
theta_min, flags = optimize.leastsq(f, theta, args=(acs_new_sample[cols], acs_new_sample['bicycle_frac'], acs_new_sample['bicycle_frac_error']))
yest_linear = theta_min[0] + acs_new_sample[cols].dot(theta_min[1:])

print "Linear model found by Levenberg-Marquardt minimization:"
print "Model: bicycle fraction = "
print "       {:+8.3e} ".format(theta_min[0])
for i, c in enumerate(cols):
    print "       {:+8.3e} [{}]".format(theta_min[i+1], c)
Linear model found by Levenberg-Marquardt minimization:
Model: bicycle fraction = 
       +1.213e-02 
       +1.296e-02 [FracEducation; None to High School]
       +4.316e-02 [FracEducation; Bachelor's+]
       +2.270e-03 [FracAge; 18-29]
       -2.666e-02 [FracAge; 55+]
       -2.146e-02 [FracIncome; $0-35k]
       -3.439e-02 [FracIncome; $75k+]
       +5.608e-03 [FracRent; $0-799]
       -3.138e-03 [FracRent; $1250+]

In [23]:
yest_mean = acs_new_sample['bicycle_frac'].mean()*np.ones_like(acs_new_sample['bicycle_frac'])

dyest_linear = (yest_linear.clip(0.0) - acs_new_sample["bicycle_frac"])/acs_new_sample["bicycle_frac_error"]
dyest_mean = (yest_mean - acs_new_sample["bicycle_frac"])/acs_new_sample["bicycle_frac_error"]
dyest_poly = (yest_poly.clip(0.0) - acs_new_sample['bicycle_frac'])/acs_new_sample["bicycle_frac_error"]

results = pd.DataFrame({"mean":dyest_mean, "linear":dyest_linear, "polynomial":dyest_poly}, 
                       columns=["mean", "linear", "polynomial"])
print results.describe()

np.abs(results).plot(kind='hist', bins=100, histtype="step", cumulative=-1, 
                     figsize=(8,6), normed=True, logx=True, logy=True);
             mean      linear  polynomial
count  219.000000  219.000000  219.000000
mean     1.850090   -0.461898   -0.461945
std      3.234413    1.073156    1.073142
min     -2.601671   -5.342885   -5.343338
25%     -0.408414   -1.126215   -1.126170
50%      0.786095   -0.524977   -0.525850
75%      3.500079    0.111500    0.111009
max     13.953788    4.033402    4.033942

This map shows where the model particularly fails (those regions where the prediction fails by more than twice the measurement error). In particular the model seems to fail along a line running from NW-SE in the northern half of the city. The neighborhoods along this line tend to have large numbers of young bicyclists, and have bike routes that lead directly to the downtown.

In [24]:
outliers = dyest_linear[np.abs(dyest_linear) > 2].index
extremum = np.abs(dyest_linear).max()
plot_regions(outliers, fill_value=dyest_linear[outliers],
                 plot_communities=True, cm=plt.cm.coolwarm, 
                 vmin=-extremum, vmax=extremum, label="Normalized model error")

Modeling using the full dataset

The previous analysis loses some information by aggregating geographically and by binning the independent variables. In this section we create a model based on the entire dataset, including low-significance data and non-detections.

In [170]:
acs_raw_sex_by_age = load_acs2013_sex_by_age(data_expression="raw", sample=None)
acs_raw_income = load_acs2013_household_income(data_expression="raw", sample=None)
acs_raw_rent = load_acs2013_rent(data_expression="raw", sample=None)
acs_raw_education = load_acs2013_education(data_expression="raw", sample=None)

acs_raw_sample = pd.concat([acs[['bicycle', 'bicycle_error']],
                            acs_raw_sex_by_age,
                            acs_raw_income,
                            acs_raw_rent,
                            acs_raw_education], axis=1, join="inner")

print "Description of the full sample:"
print len(acs_raw_sample.filter(like="Estimate;").columns), "features,", len(acs_raw_sample), "data points"
Description of the full sample:
107 features, 2165 data points

The full set of features contains significant covariance, as can be seen in the plot below. Areas of high-correlation in the figure can be attributed to different subgroups, including college students, young families, and professionals who stop renting and buy their home or move to the suburbs in their late 30s.

Standard regression techniques have difficult with these highly-correlated, colinear features, and we explore ways of mitigating this before fitting a model.

In [173]:
def centered_label(label, columns):
    if len(columns) % 2:
        return [""]*(len(columns)/2) + [label] + [""]*(len(columns)/2)
    else:
        return [""]*(len(columns)/2-1) + [label] + [""]*(len(columns)/2)

labels = centered_label("Male by Age", acs_raw_sex_by_age.filter(like="Estimate; Male").columns)
labels += centered_label("Female by Age", acs_raw_sex_by_age.filter(like="Estimate; Female").columns)
labels += centered_label("Income", acs_raw_income.filter(like="Estimate;").columns)
labels += centered_label("Rent", acs_raw_rent.filter(like="Estimate;").columns)
labels += centered_label("Education", acs_raw_education.filter(like="Estimate;").columns)
features = [c for c in acs_raw_sample.columns if "Estimate;" in c]

f, ax = plt.subplots(1,1, figsize=(16,12))
sns.heatmap(acs_raw_sample[features].corr(), linewidths=0,
            square=True, xticklabels=labels, yticklabels=labels)

x = len(acs_raw_sex_by_age.filter(like="Estimate;").columns)/2
ax.axvline(x=x, color='k')
ax.axhline(y=(len(features)-x), color='k')
x += len(acs_raw_sex_by_age.filter(like="Estimate;").columns)/2
ax.axvline(x=x, color='k')
ax.axhline(y=(len(features)-x), color='k')
x += len(acs_raw_income.filter(like="Estimate;").columns)
ax.axvline(x=x, color='k')
ax.axhline(y=(len(features)-x), color='k')
x += len(acs_raw_rent.filter(like="Estimate;").columns)
ax.axvline(x=x, color='k')
ax.axhline(y=(len(features)-x), color='k');
Using NMF to reduce dimensionality

Non-negative matrix factorization is a technique similar to principle component analysis, but with the useful property that all components are strictly positive. This is advantagous for our dataset which is comprised of populations counts for which negative values are non-sensical.

This plot shows the error in the reconstructed data matrix \(||X - WH||^2\) as a function of the number of components, where \(X\) is the original matrix, and \(W\) and \(H\) are the matrix decompositions provided by NMF. The graph is relatively flat, showing only a few components are necessary to describe the data's coarsest features. Only when the number of components nears the number of features does the error drop steeply, at which point the decomposition is as complex as the original data.

In [30]:
from sklearn.decomposition import NMF

X = acs_raw_sample[features]
Y = acs_raw_sample["bicycle"]
Yerr = acs_raw_sample["bicycle_error"]

err = []
components = list(range(1,10))+list(range(11,len(X.columns),10))
for i in components:
    n = NMF(n_components=i)
    n.fit(X)
    err.append(n.reconstruction_err_)
In [31]:
f, ax = plt.subplots(1, 1)
ax.plot(components, err)
ax.set_yscale("log")
ax.set_xscale("log")
ax.set_xlabel("Number of components")
ax.set_ylabel("Reconstruction Error ||X - WH||^2");

In this analysis we use 5 components from the NMF decomposition and a linear model. We account for regions with no detected bicycle commuters by adding their upper limits to the likelihood. These points have a significant impact on the model, which struggles to jointly fit both sets of data.

In [182]:
from scipy.stats import norm
from scipy.special import gammaln

features = [c for c in acs_raw_sample.columns if "Estimate;" in c]
X = acs_raw_sample[features]
Y = acs_raw_sample["bicycle"]
Yerr = acs_raw_sample["bicycle_error"]

n = NMF(n_components=8, random_state=32)
n.fit(X)
X_ = n.transform(X)

def lnlike(theta, x, y, yerr):
    # account for lower limits
    x, y, yerr = map(np.asarray, [x, y, yerr])
    upper = y == 0
    hx = theta[0] + x.dot(theta[1:])
    A = -norm.logcdf(yerr[upper], hx[upper], yerr[upper]).sum()
    B = -norm.logpdf(y[~upper], hx[~upper], yerr[~upper]).sum()
    return A+B
 
    # Poisson distribution
    #return (-y[~upper]*np.log(hx[~upper]) + hx[~upper] + gammaln(y[~upper]+1)).sum()
    #return (-y*np.log(hx) + hx + gammaln(y+1)).sum()

# initialize sam
theta = np.zeros(X_.shape[1]+1)
theta_min = optimize.fmin(lnlike, theta, args=(X_, Y, Yerr), disp=True, maxiter=10000, maxfun=10000)
print "Linear model coefficients: ", theta_min
yest_nmflinear = theta_min[0] + X_.dot(theta_min[1:])
dyest_nmflinear = (yest_nmflinear - Y)/Yerr

print "Full sample:"
print dyest_nmflinear.describe()
print "Only blocks with bicycle commuters:"
print dyest_nmflinear[Y > 0].describe()
Optimization terminated successfully.
         Current function value: 3064.615590
         Iterations: 1494
         Function evaluations: 2149
Linear model coefficients:  [-0.02440859  1.67898001 -0.31102721 -1.89539323 -0.25687289 -0.34821534
  0.26369392  0.68038343  0.32542088]
Full sample:
count    2165.000000
mean       -0.462148
std         0.627872
min        -3.170577
25%        -0.886784
50%        -0.347566
75%        -0.080579
max         2.624843
dtype: float64
Only blocks with bicycle commuters:
count    636.000000
mean      -1.162571
std        0.493628
min       -3.170577
25%       -1.414806
50%       -1.125151
75%       -0.913789
max        2.624843
dtype: float64

These plots show the distribution of residuals from the best-fit linear model. The histogram is bimodal, showing that the model tends to over-estimate those regions where no bicycle commuters were measured (within the error), while under-estimating in regions where the data is non-zero.

However, when we plot the residuals as a function of the measured bicycle commuters, we see that the trend exists in the full dataset, not only in the non-detected points. We take this as evidence that the model lacks a necessary degree of freedom to completely explain the data.

In [183]:
f, ax = plt.subplots(1, 2, squeeze=True, figsize=(12,4))
dyest_nmflinear.hist(bins=30, ax=ax[0], label="Full sample")
dyest_nmflinear[Y > 0].hist(bins=30, ax=ax[0], label="Regions with commuters")
ax[0].set_xlabel("Normalized Error")
ax[0].legend(loc="upper left")
ax[1].scatter(Y, dyest_nmflinear);
ax[1].set_xlabel("Bicycle Commuters")
ax[1].set_ylabel("Normalized error");

In the map below we show the normalized error in the model fit and overlay the map of chicago bike routes. Visually we see a higher density of bike routes in blue areas, where the model underpredicts the number of commuters.

In [186]:
#extremum = np.abs(dyest_nmflinear).max()
extremum = 3.5
plot_regions(Y.index, fill_value=dyest_nmflinear,
                 plot_communities=False, cm=plt.cm.coolwarm,
                 vmin=-extremum, vmax=extremum, label="Normalized model error",
                 ec=None, lw=0, plot_bike_routes=True)

Looking for additional features

Since both our previous analyses show evidence for an additional, hidden variable, and we have some hint that there is some spatial component, we broaden our feature set.

Bike routes

As we showed in the previous sections, there seems to be a connection between the availability of bike routes and the ability of the model to predict bicycle commuters. Naturally people are more likely to commute by bicycle when they have a safe means for doing so.

A complete list of Chicago bike routes, both on and off street, and shared lanes, are provided by the Chicago Data Portal in KML format. We match each bike route to our city regions by stepping along the route at 1m resolution. We use a tree-based method to ensure the calculation is tractable, and the resulting map, overlayed with the original bike routes, is shown below.

In [187]:
from math import sin,cos,pi,degrees,radians,asin,sqrt
from itertools import izip
from shapely.geometry import Point
import rtree
import csv
import os

EARTH_RADIUS = 6.371e6   # in meters

def haversine(angle_radians):
    return sin(angle_radians / 2.0) ** 2

def inverse_haversine(h):
    return 2 * asin(sqrt(h)) # radians

def distance_between_points(lon1, lat1, lon2, lat2):
    # all args are in degrees
    # WARNING: loss of absolute precision when points are near-antipodal
    lat1 = radians(lat1)
    lat2 = radians(lat2)
    dlat = lat2 - lat1
    dlon = radians(lon2 - lon1)
    h = haversine(dlat) + cos(lat1) * cos(lat2) * haversine(dlon)
    return EARTH_RADIUS * inverse_haversine(h)

def bike_route_lengths(sample, datafile="../chicagodata/region_bike_route_lengths.csv"):
    if not os.path.exists(datafile):
        routes = load_bike_routes()
        lengths = {region:0 for region in polygons}
        
        # use libspatialindex to speed lookups
        index = rtree.index.Index()
        for i, region in enumerate(polygons):
            index.insert(i, polygons[region].bounds, obj=region)
    
        for r in routes:            
            for i in range(len(r)-1):
                d = distance_between_points(r[i][0], r[i][1],
                                            r[i+1][0], r[i+1][1])
            
                # sample points at 1m resolution along segment
                for p in izip(np.linspace(r[i][0],r[i+1][0],d),
                              np.linspace(r[i][1],r[i+1][1],d)):
                    pt = Point(p)
                    for region in index.intersection((p[0], p[1], p[0], p[1]),
                                                     objects="raw"):
                        if polygons[region].contains(pt):
                            lengths[region] += 1

        # save dictionary to file
        with open(datafile, "w") as f:
            writer = csv.writer(f)
            for row in lengths.items():
                writer.writerow(row)
    
    # initialize from cached data
    lengths = {}
    with open(datafile, "r") as f:
        reader = csv.reader(f)
        for row in reader:
            lengths[row[0]] = int(row[1])
            
    # initialize new column
    sample["bike route lengths"] = np.array([lengths[str(region)] for region in sample.index])
In [188]:
bike_route_lengths(acs_raw_sample)

extremum = np.abs(acs_raw_sample["bike route lengths"]).max()
plot_regions(acs_raw_sample.index, fill_value=acs_raw_sample["bike route lengths"]/1000.,
                 plot_communities=True, cm=plt.cm.Blues,
                 vmin=0, vmax=acs_raw_sample["bike route lengths"].max()/1000.,
                 label="Bike Routes (km)",
                 ec=None, lw=0, plot_bike_routes=True)