Wenyu Li

Database and Geographical Data Analysis

Wenyu Li

March 20, 2017


The data sets have been stored in the SQLite database sf_data.sqlite, which you can download here. The database contains the following tables:

Table Description
crime Crime reports dating back to 2010.
mobile_food_locations List of all locations where mobile food vendors sell.
mobile_food_permits List of all mobile food vendor permits. More details here.
mobile_food_schedule Schedules for mobile food vendors.
noise Noise complaints dating back to August 2015.
parking List of all parking lots.
parks List of all parks.
schools List of all schools.
zillow Zillow rent and housing statistics dating back to 1996. More details here.

The mobile_food_ tables are explicitly connected through the locationid and permit columns. The other tables are not connected, but you may be able to connect them using dates, latitude/longitude, or postal codes.

Shapefiles for US postal codes are available here. These may be useful for converting latitude/longitude to postal codes.

Shapefiles for San Francisco Neighborhoods are available here.

1.1. Which mobile food vendor(s) sells at the most locations?

import pandas as pd
from matplotlib import pyplot as plt
import sqlalchemy as sqla
import sqlite3
%matplotlib inline
plt.style.use('ggplot')
import os.path
import numpy as np

Let’s create the sqlite file, and create the connection.

BASE_DIR = os.path.dirname(os.path.abspath("sf_data.sqlite"))
db_path = os.path.join(BASE_DIR, "sf_data.sqlite")
sqlite_file = db_path
sf_conn = sqla.create_engine('sqlite:///' + sqlite_file)
sql_query = """
select Applicant,count(distinct locationid) as loc_num
from mobile_food_permits,mobile_food_schedule
where mobile_food_schedule.permit = mobile_food_permits.permit
and mobile_food_permits.Status = 'APPROVED'
group by Applicant
"""
ven_loc = pd.read_sql_query(sql_query,sf_conn)
ven_loc = ven_loc.sort(['loc_num'], ascending=False)
ven_loc = ven_loc.reset_index(drop=True)
ven_loc.head()
Applicant loc_num
0 Park's Catering 23
1 D & T Catering 17
2 F & C Catering 13
3 Munch A Bunch 10
4 Eva's Catering 8
ven_loc.loc[ven_loc['loc_num'].idxmax()]
Applicant    Park's Catering
loc_num                   23
Name: 29, dtype: object
plot = ven_loc.plot(x = 'Applicant', kind = 'bar', legend = None,
                            title = 'Number of Locations for Different Applicant')
plot.set_ylabel('Number of Locations')
fig = plot.get_figure()

png

For this question, I connected the mobile_food_permits table and mobile_food_schedule table with the same permit. After selecting the Applicant field and count the distinct location id, I got my ven_loc dataframe. From the result, we can see vendor Park’s Catering sells at the most locations, which is 23.

1.2. Here I asked and use the database to analyze 5 questions about San Francisco.

import geopandas
import folium
from pyzipcode import ZipCodeDatabase
import shapefile
from IPython.display import HTML
from folium.plugins import MarkerCluster
from folium import features

Question 1 Which parts of the city are the most and least expensive?

house = pd.read_sql_query("""select * from zillow""",sf_conn)
house.head()
RegionName Date ZriPerSqft_AllHomes MedianSoldPricePerSqft_AllHomes PriceToRentRatio_AllHomes Turnover_AllHomes
0 94109 2010-11-01 00:00:00 3.156 675.1913 19.14 6.0771
1 94110 2010-11-01 00:00:00 2.566 599.6785 18.10 5.4490
2 94122 2010-11-01 00:00:00 2.168 495.4432 17.99 2.4198
3 94080 2010-11-01 00:00:00 1.666 369.5538 18.31 3.8757
4 94112 2010-11-01 00:00:00 2.322 422.4538 14.44 3.1288
house = house[np.isfinite(house['MedianSoldPricePerSqft_AllHomes'])]
house = house.reset_index(drop=True)
plt.hist(house['MedianSoldPricePerSqft_AllHomes'],bins = 30, alpha = .4)
plt.legend()
plt.xlabel('Median Sold House Price')
plt.show()

png

most_exp = house.loc[house['MedianSoldPricePerSqft_AllHomes'].idxmax()]
least_exp = house.loc[house['MedianSoldPricePerSqft_AllHomes'].idxmin()]
most_zip = most_exp['RegionName']
least_zip = least_exp['RegionName']
def get_location(code):
    zcdb = ZipCodeDatabase()
    zipcode = zcdb[code]
    lat = zipcode.latitude
    lon = zipcode.longitude
    return [lat, lon]
most_loc = get_location(most_zip)
least_loc = get_location(least_zip)
SF_COORDINATES = (37.76, -122.45)
map_osm = folium.Map(location= SF_COORDINATES, zoom_start= 12.5)
folium.CircleMarker(least_loc, popup='Least Expensive Parts: ' + str(least_exp['MedianSoldPricePerSqft_AllHomes']),radius = 30,color='#3186cc',
                    fill_color='#3186cc').add_to(map_osm)
folium.CircleMarker(most_loc, popup='Most Expensive Parts: '+str(most_exp['MedianSoldPricePerSqft_AllHomes']),radius = 30,color='#3186cc',
                    fill_color='red').add_to(map_osm)
map_osm

I did data query for the zillow table. Based on the field “MedianSoldPricePerSqft_AllHomes” to select the most and least expensive RegionName. We can see the Region name is in zip code format. Using the pyzipcode package I convert the zip codes to longitude and latitude of these regions and got the interactive maps with folium. The circle-style markers marked the most and least expensive parts in San Francisco. The red one is the most expensive part and the blue one is the least expensive part. The popped up texts are their respective median sold price. After zooming the map, we can see the most expensive parts are located around the Montgomery Street and the Pine Street. The least expensive parts are located on the Bayview Street.

Question 2 Which parts of the city are the most dangerous?

crimedata = pd.read_sql_query("""select * from crime""",sf_conn)
crimedata.head()
IncidntNum Category Descript DayOfWeek Datetime PdDistrict Resolution Address Lon Lat
0 150060275 NON-CRIMINAL LOST PROPERTY Monday 2015-01-19 14:00:00 MISSION NONE 18TH ST / VALENCIA ST -122.421582 37.761701
1 150098210 ROBBERY ROBBERY, BODILY FORCE Sunday 2015-02-01 15:45:00 TENDERLOIN NONE 300 Block of LEAVENWORTH ST -122.414406 37.784191
2 150098210 ASSAULT AGGRAVATED ASSAULT WITH BODILY FORCE Sunday 2015-02-01 15:45:00 TENDERLOIN NONE 300 Block of LEAVENWORTH ST -122.414406 37.784191
3 150098210 SECONDARY CODES DOMESTIC VIOLENCE Sunday 2015-02-01 15:45:00 TENDERLOIN NONE 300 Block of LEAVENWORTH ST -122.414406 37.784191
4 150098226 VANDALISM MALICIOUS MISCHIEF, VANDALISM OF VEHICLES Tuesday 2015-01-27 19:00:00 NORTHERN NONE LOMBARD ST / LAGUNA ST -122.431119 37.800469
# definition of the boundaries in the map
district_geo = r'sfpddistricts.geojson'
# calculating total number of incidents per district
crimedata2 = pd.DataFrame(crimedata['PdDistrict'].value_counts().astype(float))
crimedata2.to_json('crimeagg.json')
crimedata2 = crimedata2.reset_index()
crimedata2.columns = ['District', 'Number']
plot = crimedata2.plot(x = 'District', kind = 'bar', legend = None, 
                            title = 'Number of Crime Incidents for Different District')
plot.set_ylabel('Count')
fig = plot.get_figure()

png

# creation of the choropleth
map1 = folium.Map(location=SF_COORDINATES, zoom_start=12,tiles = "Stamen Terrain")
map1.choropleth(geo_path = district_geo,
              data_out = 'crimeagg.json',
              data = crimedata2,
              columns = ['District', 'Number'],
              key_on = 'feature.properties.DISTRICT',
              fill_color = 'YlOrRd',
              fill_opacity = 0.7,
              line_opacity = 0.2,
              legend_name = 'Number of incidents per district')
map1

To analyze the crime occuring frequency in different parts of san francisco, I would like to create a choropleth. First, I need a geojson file to create the areas that match the San Francisco police districts in the data file. With a Google search on “sfpd districts geojson” reference, I found the file that matches my needs.

The choropleth is based on the aggregated counts of crime numbers. The choropleth map like above with a legend in the upper right corner. Different colors reprensents different counts of crime occuring. From the map, we can see crimes happen most frequently in the deepest red region, which is the most dangerous parts. We can conclude that southern san francisco is the most dangerous and Richmond is the least dangerous.

Question 3 How are schools distributed in the city? Which area mostly schools are located in? Is there any relationship between school location and crime incident?

school = pd.read_sql_query("""select * from schools""",sf_conn)
school = school[np.isfinite(school['Lat'])]
lons = school['Lon']
lats= school['Lat']
school.head()
Name Entity LowerGrade UpperGrade GradeRange Category LowerAge UpperAge GeneralType Address Lat Lon
0 Alamo Elementary School SFUSD 0 5 K-5 USD Grades K-5 5 10 PS 250 23RD AVE, San Francisco, CA 94121 37.783005 -122.482300
1 Alvarado Elementary School SFUSD 0 5 K-5 USD Grades K-5 5 10 PS 625 DOUGLASS ST, San Francisco, CA 94114 37.753681 -122.438194
2 Aptos Middle School SFUSD 6 8 6-8 USD Grades 6-8 11 13 PS 105 APTOS AVE, San Francisco, CA 94127 37.729672 -122.465782
3 Argonne Early Education School SFUSD -2 0 PK-TK USD PreK/TK 3 5 PS 750 16TH AVE, San Francisco, CA 94118 37.773968 -122.474060
4 Argonne Elementary School SFUSD 0 5 K-5 USD Grades K-5 5 10 PS 680 18TH AVE, San Francisco, CA 94121 37.775307 -122.476311
data = np.array(
    [
        np.array(lats),  
        np.array(lons), 
        range(len(school)),  # Popups text will be simple numbers.
    ]
).T

m = folium.Map(location= SF_COORDINATES, zoom_start= 12)
mc = features.MarkerCluster()

for k in range(len(school)):
    mk = features.Marker([data[k][0], data[k][1]])
    p = features.Popup(str(data[k][2]))
    mk.add_child(p)
    mc.add_child(mk)

m.add_child(mc)

# add choropleth map of crime to the school map
m.choropleth(geo_path = district_geo,
              data_out = 'crimeagg.json',
              data = crimedata2,
              columns = ['District', 'Number'],
              key_on = 'feature.properties.DISTRICT',
              fill_color = 'YlOrRd',
              fill_opacity = 0.7,
              line_opacity = 0.2,
              legend_name = 'Number of incidents per district')
m

The map above shows the distribution of schools in San Francisco. The number in the circle reprensents the total number of schools in this area. You can zoom in and zoom out the map to see the specific location for each school. Based on the result for Question 2, crimes happen most frequently in the deepest red region, which is the most dangerous. From the map, we can see most schools are located near the dangerous areas. In other words, most crimes occur near the school areas. It’s concluded that there is relationship between school and crime incident. We can refer that most crimes are related to children and students, which also accords with the reality.

Question 4 Which types of noise are most and least complained? Which parts of the city have most noise cases?

noise = pd.read_sql_query("""select * from noise""",sf_conn)
noise = noise[np.isfinite(noise['Lat'])]
# calculating total number of noise case per type
noise2 = pd.DataFrame(noise['Type'].value_counts().astype(float))
noise2 = noise2.reset_index()
noise2.columns = ['Type', 'Number']
plot = noise2.plot(x = 'Type', kind = 'bar', legend = None, 
                            title = 'Number of Noise Case for Different Type')
plot.set_ylabel('Count')
fig = plot.get_figure() 

png

lat = noise['Lat']
lon = noise['Lon']
type_ = noise['Type']
data = np.array(
    [
        np.array(lat),  
        np.array(lon), 
        np.array(type_),  # Popups text will be simple numbers.
    ]
).T

noise_m =  folium.Map(location= SF_COORDINATES, zoom_start= 12)
mc = features.MarkerCluster()

for k in range(len(noise)):
    mk = features.Marker([data[k][0], data[k][1]])
    p = features.Popup(str(data[k][2]))
    mk.add_child(p)
    mc.add_child(mk)

noise_m.add_child(mc)

From the table above we can see, the type of other excessive noise is most, which has 1080 cases. The top two type is construction private property, which has 744 cases. The least number cases are happened in hospital. There are only 4 cases from August 2015. Among all these types, we can see most noises come from construction and electronic equipment. Noises that come from vechiles and traffic are not that much. According to the map, the number in the circle reprensents the total number of noise cases in this area. You can zoom in and zoom out the map to see the specific location for each noise case. The popped up text shows the noise type for each case. We can see most noise complaints were reported in central san francisco area. It’s because there are more population and stores, which cause more noise.

Question 5 How are the mobile food vendors distributed for different facility types?

sql_query1 = """
select locationid,Applicant, FacilityType, Longitude, Latitude
from(
select *
from
(select *
from mobile_food_schedule
left join mobile_food_permits on mobile_food_schedule.permit = mobile_food_permits.permit
where mobile_food_permits.Status = 'APPROVED'
) a
left join mobile_food_locations on mobile_food_locations.locationid = a.locationid)
"""
vendor = pd.read_sql_query(sql_query1,sf_conn)
vendor = vendor.drop_duplicates()
vendor = vendor[vendor['Longitude']!= 0]
vendor = vendor.reset_index(drop=True)
vendor.head()
locationid Applicant FacilityType Longitude Latitude
0 654163 Maiwand Halal Kabob Truck Truck -122.402314 37.793393
1 654165 Maiwand Halal Kabob Truck Truck -122.397043 37.793214
2 676176 The Huge Hotdog Concession Truck -122.420091 37.788877
3 676797 Peruchi Food Truck,LLC Truck -122.397273 37.796123
4 680504 Halal Cart of San Francisco Push Cart -122.402567 37.793149
Lat = vendor['Latitude']
Lon = vendor['Longitude']
Type = vendor['FacilityType']
Name = vendor['Applicant']
vm = folium.Map(location= SF_COORDINATES, zoom_start= 12)
for i in range(len(Type)):
    if Type[i] == 'Truck':
        folium.RegularPolygonMarker([Lat[i],Lon[i]],
                     fill_color='red', radius=12, popup=Name[i]).add_to(vm)
    if Type[i] == 'Push Cart':
        folium.RegularPolygonMarker([Lat[i],Lon[i]],
                     fill_color='#43d9de', radius=12, popup=Name[i]).add_to(vm)
vm

In the map, I used polygon markers to mark the mobile food vendor in San Francisco. The red ones represent the truck vendors and the blue ones represent those push cart vendors. The popped up text show their respective vendor names. From the map we can see most are truck vendors. And for truck vendors, they are located all over the city. But for push cart vendors, they are more centralized. Based on the result of problem 1, we know that some vendors sell at many locations, like Park’s Catering, D & T Catering and so on. For most of these vendors, all of their locations are not very far from each other, like D&T Catering. To be mentioned that vendor Eva’s Catering sells dispersedly.

comments powered by Disqus