Database and Geographical Data Analysis

Wenyu Li

March 20, 2017

The data sets have been stored in the SQLite database sf_data.sqlite, which you can download here. The database contains the following tables:

Table	Description
`crime`	Crime reports dating back to 2010.
`mobile_food_locations`	List of all locations where mobile food vendors sell.
`mobile_food_permits`	List of all mobile food vendor permits. More details here.
`mobile_food_schedule`	Schedules for mobile food vendors.
`noise`	Noise complaints dating back to August 2015.
`parking`	List of all parking lots.
`parks`	List of all parks.
`schools`	List of all schools.
`zillow`	Zillow rent and housing statistics dating back to 1996. More details here.

The mobile_food_ tables are explicitly connected through the locationid and permit columns. The other tables are not connected, but you may be able to connect them using dates, latitude/longitude, or postal codes.

Shapefiles for US postal codes are available here. These may be useful for converting latitude/longitude to postal codes.

Shapefiles for San Francisco Neighborhoods are available here.

1.1. Which mobile food vendor(s) sells at the most locations?

import pandas as pd
from matplotlib import pyplot as plt
import sqlalchemy as sqla
import sqlite3
%matplotlib inline
plt.style.use('ggplot')
import os.path
import numpy as np

Let’s create the sqlite file, and create the connection.

BASE_DIR = os.path.dirname(os.path.abspath("sf_data.sqlite"))
db_path = os.path.join(BASE_DIR, "sf_data.sqlite")

sqlite_file = db_path
sf_conn = sqla.create_engine('sqlite:///' + sqlite_file)

sql_query = """
select Applicant,count(distinct locationid) as loc_num
from mobile_food_permits,mobile_food_schedule
where mobile_food_schedule.permit = mobile_food_permits.permit
and mobile_food_permits.Status = 'APPROVED'
group by Applicant
"""

ven_loc = pd.read_sql_query(sql_query,sf_conn)
ven_loc = ven_loc.sort(['loc_num'], ascending=False)
ven_loc = ven_loc.reset_index(drop=True)
ven_loc.head()

	Applicant	loc_num
0	Park's Catering	23
1	D & T Catering	17
2	F & C Catering	13
3	Munch A Bunch	10
4	Eva's Catering	8

ven_loc.loc[ven_loc['loc_num'].idxmax()]

Applicant    Park's Catering
loc_num                   23
Name: 29, dtype: object

plot = ven_loc.plot(x = 'Applicant', kind = 'bar', legend = None,
                            title = 'Number of Locations for Different Applicant')
plot.set_ylabel('Number of Locations')
fig = plot.get_figure()

png

For this question, I connected the mobile_food_permits table and mobile_food_schedule table with the same permit. After selecting the Applicant field and count the distinct location id, I got my ven_loc dataframe. From the result, we can see vendor Park’s Catering sells at the most locations, which is 23.

1.2. Here I asked and use the database to analyze 5 questions about San Francisco.

import geopandas
import folium
from pyzipcode import ZipCodeDatabase
import shapefile
from IPython.display import HTML
from folium.plugins import MarkerCluster
from folium import features

Question 1 Which parts of the city are the most and least expensive?

house = pd.read_sql_query("""select * from zillow""",sf_conn)

house.head()

	RegionName	Date	ZriPerSqft_AllHomes	MedianSoldPricePerSqft_AllHomes	PriceToRentRatio_AllHomes	Turnover_AllHomes
0	94109	2010-11-01 00:00:00	3.156	675.1913	19.14	6.0771
1	94110	2010-11-01 00:00:00	2.566	599.6785	18.10	5.4490
2	94122	2010-11-01 00:00:00	2.168	495.4432	17.99	2.4198
3	94080	2010-11-01 00:00:00	1.666	369.5538	18.31	3.8757
4	94112	2010-11-01 00:00:00	2.322	422.4538	14.44	3.1288

house = house[np.isfinite(house['MedianSoldPricePerSqft_AllHomes'])]
house = house.reset_index(drop=True)

plt.hist(house['MedianSoldPricePerSqft_AllHomes'],bins = 30, alpha = .4)
plt.legend()
plt.xlabel('Median Sold House Price')
plt.show()

png

most_exp = house.loc[house['MedianSoldPricePerSqft_AllHomes'].idxmax()]
least_exp = house.loc[house['MedianSoldPricePerSqft_AllHomes'].idxmin()]
most_zip = most_exp['RegionName']
least_zip = least_exp['RegionName']

def get_location(code):
    zcdb = ZipCodeDatabase()
    zipcode = zcdb[code]
    lat = zipcode.latitude
    lon = zipcode.longitude
    return [lat, lon]

most_loc = get_location(most_zip)
least_loc = get_location(least_zip)

SF_COORDINATES = (37.76, -122.45)
map_osm = folium.Map(location= SF_COORDINATES, zoom_start= 12.5)
folium.CircleMarker(least_loc, popup='Least Expensive Parts: ' + str(least_exp['MedianSoldPricePerSqft_AllHomes']),radius = 30,color='#3186cc',
                    fill_color='#3186cc').add_to(map_osm)
folium.CircleMarker(most_loc, popup='Most Expensive Parts: '+str(most_exp['MedianSoldPricePerSqft_AllHomes']),radius = 30,color='#3186cc',
                    fill_color='red').add_to(map_osm)
map_osm

I did data query for the zillow table. Based on the field “MedianSoldPricePerSqft_AllHomes” to select the most and least expensive RegionName. We can see the Region name is in zip code format. Using the pyzipcode package I convert the zip codes to longitude and latitude of these regions and got the interactive maps with folium. The circle-style markers marked the most and least expensive parts in San Francisco. The red one is the most expensive part and the blue one is the least expensive part. The popped up texts are their respective median sold price. After zooming the map, we can see the most expensive parts are located around the Montgomery Street and the Pine Street. The least expensive parts are located on the Bayview Street.

Question 2 Which parts of the city are the most dangerous?

crimedata = pd.read_sql_query("""select * from crime""",sf_conn)

crimedata.head()

	IncidntNum	Category	Descript	DayOfWeek	Datetime	PdDistrict	Resolution	Address	Lon	Lat
0	150060275	NON-CRIMINAL	LOST PROPERTY	Monday	2015-01-19 14:00:00	MISSION	NONE	18TH ST / VALENCIA ST	-122.421582	37.761701
1	150098210	ROBBERY	ROBBERY, BODILY FORCE	Sunday	2015-02-01 15:45:00	TENDERLOIN	NONE	300 Block of LEAVENWORTH ST	-122.414406	37.784191
2	150098210	ASSAULT	AGGRAVATED ASSAULT WITH BODILY FORCE	Sunday	2015-02-01 15:45:00	TENDERLOIN	NONE	300 Block of LEAVENWORTH ST	-122.414406	37.784191
3	150098210	SECONDARY CODES	DOMESTIC VIOLENCE	Sunday	2015-02-01 15:45:00	TENDERLOIN	NONE	300 Block of LEAVENWORTH ST	-122.414406	37.784191
4	150098226	VANDALISM	MALICIOUS MISCHIEF, VANDALISM OF VEHICLES	Tuesday	2015-01-27 19:00:00	NORTHERN	NONE	LOMBARD ST / LAGUNA ST	-122.431119	37.800469

# definition of the boundaries in the map
district_geo = r'sfpddistricts.geojson'
# calculating total number of incidents per district
crimedata2 = pd.DataFrame(crimedata['PdDistrict'].value_counts().astype(float))
crimedata2.to_json('crimeagg.json')
crimedata2 = crimedata2.reset_index()
crimedata2.columns = ['District', 'Number']

plot = crimedata2.plot(x = 'District', kind = 'bar', legend = None, 
                            title = 'Number of Crime Incidents for Different District')
plot.set_ylabel('Count')
fig = plot.get_figure()

png

# creation of the choropleth
map1 = folium.Map(location=SF_COORDINATES, zoom_start=12,tiles = "Stamen Terrain")
map1.choropleth(geo_path = district_geo,
              data_out = 'crimeagg.json',
              data = crimedata2,
              columns = ['District', 'Number'],
              key_on = 'feature.properties.DISTRICT',
              fill_color = 'YlOrRd',
              fill_opacity = 0.7,
              line_opacity = 0.2,
              legend_name = 'Number of incidents per district')

map1

To analyze the crime occuring frequency in different parts of san francisco, I would like to create a choropleth. First, I need a geojson file to create the areas that match the San Francisco police districts in the data file. With a Google search on “sfpd districts geojson” reference, I found the file that matches my needs.

The choropleth is based on the aggregated counts of crime numbers. The choropleth map like above with a legend in the upper right corner. Different colors reprensents different counts of crime occuring. From the map, we can see crimes happen most frequently in the deepest red region, which is the most dangerous parts. We can conclude that southern san francisco is the most dangerous and Richmond is the least dangerous.

Question 3 How are schools distributed in the city? Which area mostly schools are located in? Is there any relationship between school location and crime incident?

school = pd.read_sql_query("""select * from schools""",sf_conn)
school = school[np.isfinite(school['Lat'])]

lons = school['Lon']
lats= school['Lat']
school.head()

	Name	Entity	LowerGrade	UpperGrade	GradeRange	Category	LowerAge	UpperAge	GeneralType	Address	Lat	Lon
0	Alamo Elementary School	SFUSD	0	5	K-5	USD Grades K-5	5	10	PS	250 23RD AVE, San Francisco, CA 94121	37.783005	-122.482300
1	Alvarado Elementary School	SFUSD	0	5	K-5	USD Grades K-5	5	10	PS	625 DOUGLASS ST, San Francisco, CA 94114	37.753681	-122.438194
2	Aptos Middle School	SFUSD	6	8	6-8	USD Grades 6-8	11	13	PS	105 APTOS AVE, San Francisco, CA 94127	37.729672	-122.465782
3	Argonne Early Education School	SFUSD	-2	0	PK-TK	USD PreK/TK	3	5	PS	750 16TH AVE, San Francisco, CA 94118	37.773968	-122.474060
4	Argonne Elementary School	SFUSD	0	5	K-5	USD Grades K-5	5	10	PS	680 18TH AVE, San Francisco, CA 94121	37.775307	-122.476311

data = np.array(
    [
        np.array(lats),  
        np.array(lons), 
        range(len(school)),  # Popups text will be simple numbers.
    ]
).T

m = folium.Map(location= SF_COORDINATES, zoom_start= 12)
mc = features.MarkerCluster()

for k in range(len(school)):
    mk = features.Marker([data[k][0], data[k][1]])
    p = features.Popup(str(data[k][2]))
    mk.add_child(p)
    mc.add_child(mk)

m.add_child(mc)

# add choropleth map of crime to the school map
m.choropleth(geo_path = district_geo,
              data_out = 'crimeagg.json',
              data = crimedata2,
              columns = ['District', 'Number'],
              key_on = 'feature.properties.DISTRICT',
              fill_color = 'YlOrRd',
              fill_opacity = 0.7,
              line_opacity = 0.2,
              legend_name = 'Number of incidents per district')
m

The map above shows the distribution of schools in San Francisco. The number in the circle reprensents the total number of schools in this area. You can zoom in and zoom out the map to see the specific location for each school. Based on the result for Question 2, crimes happen most frequently in the deepest red region, which is the most dangerous. From the map, we can see most schools are located near the dangerous areas. In other words, most crimes occur near the school areas. It’s concluded that there is relationship between school and crime incident. We can refer that most crimes are related to children and students, which also accords with the reality.

Question 4 Which types of noise are most and least complained? Which parts of the city have most noise cases?

noise = pd.read_sql_query("""select * from noise""",sf_conn)
noise = noise[np.isfinite(noise['Lat'])]

# calculating total number of noise case per type
noise2 = pd.DataFrame(noise['Type'].value_counts().astype(float))
noise2 = noise2.reset_index()
noise2.columns = ['Type', 'Number']

plot = noise2.plot(x = 'Type', kind = 'bar', legend = None, 
                            title = 'Number of Noise Case for Different Type')
plot.set_ylabel('Count')
fig = plot.get_figure()

png

lat = noise['Lat']
lon = noise['Lon']
type_ = noise['Type']

data = np.array(
    [
        np.array(lat),  
        np.array(lon), 
        np.array(type_),  # Popups text will be simple numbers.
    ]
).T

noise_m =  folium.Map(location= SF_COORDINATES, zoom_start= 12)
mc = features.MarkerCluster()

for k in range(len(noise)):
    mk = features.Marker([data[k][0], data[k][1]])
    p = features.Popup(str(data[k][2]))
    mk.add_child(p)
    mc.add_child(mk)

noise_m.add_child(mc)

From the table above we can see, the type of other excessive noise is most, which has 1080 cases. The top two type is construction private property, which has 744 cases. The least number cases are happened in hospital. There are only 4 cases from August 2015. Among all these types, we can see most noises come from construction and electronic equipment. Noises that come from vechiles and traffic are not that much. According to the map, the number in the circle reprensents the total number of noise cases in this area. You can zoom in and zoom out the map to see the specific location for each noise case. The popped up text shows the noise type for each case. We can see most noise complaints were reported in central san francisco area. It’s because there are more population and stores, which cause more noise.

Question 5 How are the mobile food vendors distributed for different facility types?

sql_query1 = """
select locationid,Applicant, FacilityType, Longitude, Latitude
from(
select *
from
(select *
from mobile_food_schedule
left join mobile_food_permits on mobile_food_schedule.permit = mobile_food_permits.permit
where mobile_food_permits.Status = 'APPROVED'
) a
left join mobile_food_locations on mobile_food_locations.locationid = a.locationid)
"""

vendor = pd.read_sql_query(sql_query1,sf_conn)
vendor = vendor.drop_duplicates()
vendor = vendor[vendor['Longitude']!= 0]
vendor = vendor.reset_index(drop=True)
vendor.head()

	locationid	Applicant	FacilityType	Longitude	Latitude
0	654163	Maiwand Halal Kabob Truck	Truck	-122.402314	37.793393
1	654165	Maiwand Halal Kabob Truck	Truck	-122.397043	37.793214
2	676176	The Huge Hotdog Concession	Truck	-122.420091	37.788877
3	676797	Peruchi Food Truck,LLC	Truck	-122.397273	37.796123
4	680504	Halal Cart of San Francisco	Push Cart	-122.402567	37.793149

Lat = vendor['Latitude']
Lon = vendor['Longitude']
Type = vendor['FacilityType']
Name = vendor['Applicant']

vm = folium.Map(location= SF_COORDINATES, zoom_start= 12)
for i in range(len(Type)):
    if Type[i] == 'Truck':
        folium.RegularPolygonMarker([Lat[i],Lon[i]],
                     fill_color='red', radius=12, popup=Name[i]).add_to(vm)
    if Type[i] == 'Push Cart':
        folium.RegularPolygonMarker([Lat[i],Lon[i]],
                     fill_color='#43d9de', radius=12, popup=Name[i]).add_to(vm)
vm

In the map, I used polygon markers to mark the mobile food vendor in San Francisco. The red ones represent the truck vendors and the blue ones represent those push cart vendors. The popped up text show their respective vendor names. From the map we can see most are truck vendors. And for truck vendors, they are located all over the city. But for push cart vendors, they are more centralized. Based on the result of problem 1, we know that some vendors sell at many locations, like Park’s Catering, D & T Catering and so on. For most of these vendors, all of their locations are not very far from each other, like D&T Catering. To be mentioned that vendor Eva’s Catering sells dispersedly.