I saw a post a while back about drug use in different SF neighborhoods. The most basic example was how there was a higher percentage of crack and cocaine users in the Tenderloin compared to higher percentages of weed users in the Haight.

It affirmed stuff that people already knew, but it was still pretty interesting how the data reflected the theory in the police reports. I decided to use the same dataset to just see if any of the neighborhoods had higher percentages that stood out in terms of certain crimes. Essentially I wanted to ask the question: Does each neighborhood have a stand-out crime it would be known for?

With the SF Kaggle competition going on right now on predicting crimes in San Francisco, there’s got to be some feature selection that’s looking at some of the more common crimes that would get classified based on area and location.

The SF crime dataset is hosted by Socrata (Note: I used to work for them as an intern), and consists of open SF crime data since 2003. I used a 800 meter radius for an approximate half a mile and picked the center points in each neighborhood based on my own discretion and where the Google Maps neighborhood label was (You can check out the coordinates at the bottom of this post). I grouped the data by category and took crime data since January of 2014 as a recency mark.

Most Common Crime in Each Neighborhood

Top Crime in Each Neighborhood and it’s occurrences from Jan 1st 2014- August 31st 2015

The graph displays the most common crime in the specific neighborhood and their count. Larceny and theft really dominate the scene. Mostly because it consists of car prowls, shoplifting, etc… but also it’s the easiest and most common petty crime of all time really. “Other Offenses” also leads in certain neighborhoods and those are usually a variety of crimes that just don’t get categorized. If you subset for “Other Offenses: and look down the list in the description, it goes from traffic violation arrests to permit violations to animal cruelty. It might also just be a huge homeless indicator but there is no real way to prove that except that the Mission District and Portola lead in these categories.


San Francisco Neighborhood Map

San Francisco Neighborhood Map


With the total counts of each crime within each neighborhood covered, it gets more interesting when we look at got percentages of each crime over total crime counts to see the crime categories normalized. Once we get the percentages of each crime category, we can select the top percentages among each neighborhood in SF and see which ones would stand out. Ultimately it looks as though some neighborhoods never showed up at all in specific cases.

Bar Chart Showing Neighborhoods and the Crimes that rank the highest when compared to the other neighborhoods. Y-axis is the percentage of that specific crime within their neighborhood

Bar Chart Showing Neighborhoods and the Crimes that rank the highest when compared to the other neighborhoods. Y-axis is the percentage of that specific crime within their neighborhood

There’s a couple ways to interpret this graph. Ultimately here’s the breakdown

  • The nicer neighborhoods such as SOMA, the Marina District, and Hayes Valley tend to lead in larceny and theft because they don’t have too much of the other kinds of violent “scary” crimes.
  • Potrero Hill leads the category for Missing Person versus other neighborhoods by a substantial amount. For ten percent of it’s overall crimes and the second highest category, somehow people end up missing in Potrero Hill? It also leads other neighborhoods percentages in trespass,  sex offenses,  runaway, and suspicious occurrences even though it’s overall crime rate is very low.
  • The Mission District leads higher percentages in Other Offenses and Prostitution but I labeled it on the map as Prostitution because of how much higher it was comparatively. In the last  year and a half, the Mission has had 170 cases of prostitution which consists of only 1.3 percent of it’s total crimes yet still a magnitude higher than other neighborhoods.
  • Overall a lot of the neighborhoods have common characteristics in their crime type distributions. Many neighborhoods are similar across the board but some can stand out.
Worst Crime Map Ever Drawn

Worst Crime Map Ever Drawn.

Overall that is my crime map drawn out and labeled. Mac also doesn’t even have a real Paint application so I used something off the app store. It’s the true data scientist design work that really shows through. It’s also kind of comical but readable.

One important thing to note about the visualization is how it’s a huge generalization based on a couple years of data and obviously does not tell the whole story. Inner Sunset will never be the Arson capitol of San Francisco. Arson consists of 0.5 percent of the crimes in Inner Sunset and maybe occurred just a few times, but I really couldn’t just leave that area without a labeled crime type for aesthetic purposes so that’s bad on me. Please do not confuse highlights with higher occurrences. The Mission and the Tenderloin both have more crime than many of the neighborhoods combined, but they just can’t be in so many different types of categories.

Here’s some of the code used to complete the project.  I also put in a python module that will query the SF crime dataset based on any kind of conditions. The API is very useful and I’m thinking about wrapping it in a google maps user interface soon to just compare different locations based on user pinpoints.

Subscribe! twitter and linkedin

import pandas as pd
import requests

sf = 'San Francisco, CA'

neighborhoods = [
    {'label': 'SOMA', 'city': sf, 'lat': 37.778955, 'lng': -122.402347},
    {'label': 'Mission District', 'city': sf, 'lat': 37.759656, 'lng': -122.414797},
    {'label': 'Potrero Hill', 'city': sf, 'lat': 37.759798, 'lng': -122.399495},
    {'label': 'Noe Valley', 'city': sf, 'lat':37.749145, 'lng': -122.431090},
    {'label': 'The Castro', 'city': sf, 'lat':37.758010, 'lng': -122.434726},
    {'label': 'Bernal Heights', 'city': sf, 'lat':37.741662, 'lng': -122.414473},
    {'label': 'Tenderloin', 'city':sf, 'lat':37.783784, 'lng': -122.415870},
    {'label': 'North Beach', 'city':sf, 'lat':37.800578, 'lng': -122.412786},
    {'label': 'Portola Place', 'city':sf, 'lat':37.729333, 'lng':-122.397045},
    {'label': 'Excelsior', 'city':sf, 'lat':37.725222, 'lng': -122.426830},
    {'label': 'Hayes/Western Addition', 'city':sf, 'lat':37.780036, 'lng':-122.431399},
    {'label': 'Haight Ashbury', 'city':sf, 'lat':37.772437, 'lng':-122.440731},
    {'label': 'Marina District', 'city':sf, 'lat':37.800685, 'lng':-122.436623},
    {'label': 'Laurel Heights', 'city':sf, 'lat':37.781839, 'lng': -122.450356},
    {'label': 'Inner Richmond/Richmond', 'city':sf, 'lat':37.780175, 'lng': -122.471751},
    {'label': 'Outer Richmond', 'city':sf, 'lat':37.776502, 'lng':-122.498408},
    {'label': 'Outer Sunset', 'city':sf, 'lat':37.757396, 'lng': -122.492781},
    {'label': 'Sunset District', 'city':sf, 'lat':37.747936, 'lng':-122.488887},
    {'label': 'Inner Sunset', 'city':sf, 'lat':37.759149, 'lng': -122.467827},
    {'label': 'Forest Hill', 'city':sf, 'lat':37.744198, 'lng':-122.463872}

def sf_crime_api(shape, rad, lat, lng, s_date, e_date, granular):
    Calls the SF crime api and returns a json of crime categories, dates, and counts of occurences
    - shape: Either 'within_circle' or 'within_square' of latitude and longitudinal coordinates
    - rad: Number of meters radius from the points
    - lat: latitude point
    - lng: longitudinal point
    - s_date: start date in 'yyyy-mm-dd' format
    - e_date: end date
    - granular: 'ym' represents by month, 'y' represents by year
    return "https://data.sfgov.org/resource/cuks-n6tp.json?$select=COUNT(*),category,date_trunc_" + \
    granular + "(date) as date_key&$where=" + shape + "(location," + str(lat) + "," + str(lng) + "," + str(rad) + ") AND date>" + \
    "'" + s_date + "' AND date<'" + e_date + "'&$group=date_key,category&$limit=50000"
#Example will grab all crimes within a 500 meter radius of Outer Sunset coordinates
url = sf_crime_api('within_circle', 500, 37.757396, -122.492781, '2015-01-01', '2015-10-01','ym')
response = requests.get(url).json()
df = pd.DataFrame(response)