# Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from janitor import clean_names
# Theme setting
="talk", style="whitegrid",
sns.set_theme(context="colorblind", color_codes=True,
palette={"figure.figsize": [12, 8]}
rc )
Exploratory Analysis of the Calgary’s Crime Data
Background
Chol Aruei and I recently formed a Python meetup group to meet every Sunday around 3 pm (CST) to further our data science learning, particularly with Python. We aim to have each of us bring his/her data and present it to the team in the future. And this project is our first project since we started to learn together. Chol provided the data, and I am leading the analysis because I have been programming in Python a bit longer than him.
With that in mind, let’s switch gears to walk you through our project.
We are exploring Calgary’s crime data from 2012 to 2019. The data was obtained from Calgary website. Below is a brief description of the dataset.
Data is provided monthly by the Calgary Police Service. This includes both crime and disorder statistics.
Crime count is based on the most serious violation (MSV) per incident. Violence: These figures include all violent crime offences as defined by the Centre for Canadian Justice Statistics Universal Crime Reporting (UCR) rules. Domestic violence is excluded. Break and Enter: Residential B&E includes both House and ‘Other’ structure break and enters due to the predominantly residential nature of this type of break in (e.g. detached garages, sheds). B&Es incidents include attempts.
Disorder events consist of Social and Physical disorder. Only public-generated calls are included.
Social disorder includes: Drunk, Disturbance, Indecent Act, Juvenile Complaint, Landlord/tenant, Mental health concern, Neighbor dispute, Party complaint, Prowler, Suspicious person, Threats, Drugs, Noise complaint, Possible gunshots, Unwanted guest/patron, Prostitution, Speeder, Suspicious Auto.
Physical disorder includes: Fire, Property damage and Abandoned auto.
**Resident counts are pulled from census data which is collected Apr/May of each year and apply to community crime stats from June of current year to May of the following year.
ex) 2018 Census data applies to June 2018-May 2019 Community Crime stats
Importing the Libraries
Here we will load pandas
as pd, numpy
as np, matplotlib.pyplot
as plt, and seaborn
as sns. Also, we will use Quarto
to display all the columns as shown in the below code chunk.
Loading the Dataset
# Importing the dataset from GitHub
= 'https://raw.githubusercontent.com/tongakuot/cushanalytics/main/00_data/Community_Crime_Statistics%20(1).csv'
url
= pd.read_csv(url)
crime_raw
# Inspect the first 5 rows
crime_raw.head()
Sector | Community Name | Category | Crime Count | Resident Count | Date | Year | Month | long | lat | ID | Community Center Point | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NORTHWEST | BOWNESS | Theft OF Vehicle | 4 | 11150.0 | 2022/07 | 2022 | JUL | -114.187826 | 51.083359 | 2022-JUL-BOWNESS-Theft OF Vehicle | POINT (-114.187826026643 51.083359442425) |
1 | CENTRE | INGLEWOOD | Theft OF Vehicle | 1 | NaN | 2017/04 | 2017 | APR | -114.019120 | 51.034141 | 2017-APR-INGLEWOOD-Theft OF Vehicle | POINT (-114.01912 51.03414052) |
2 | NORTHEAST | MERIDIAN | Theft OF Vehicle | 10 | 0.0 | 2022/07 | 2022 | JUL | -113.996513 | 51.056587 | 2022-JUL-MERIDIAN-Theft OF Vehicle | POINT (-113.996513423841 51.056586683694) |
3 | EAST | SOUTHVIEW | Theft FROM Vehicle | 3 | NaN | 2017/04 | 2017 | APR | -113.997338 | 51.034156 | 2017-APR-SOUTHVIEW-Theft FROM Vehicle | POINT (-113.9973381 51.03415568) |
4 | EAST | SOUTHVIEW | Assault (Non-domestic) | 4 | 1805.0 | 2022/07 | 2022 | JUL | -113.997338 | 51.034156 | 2022-JUL-SOUTHVIEW-Assault (Non-domestic) | POINT (-113.997338140443 51.034155684723) |
# Inspect the last 5 rows
crime_raw.tail()
Sector | Community Name | Category | Crime Count | Resident Count | Date | Year | Month | long | lat | ID | Community Center Point | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
63586 | SOUTH | SOUTHWOOD | Break & Enter - Other Premises | 2 | NaN | 2021/01 | 2021 | JAN | -114.083201 | 50.957774 | 2021-JAN-SOUTHWOOD-Break & Enter - Other Premises | POINT (-114.08320089308144 50.9577735651428) |
63587 | CENTRE | HIGHWOOD | Break & Enter - Other Premises | 2 | NaN | 2021/05 | 2021 | MAY | -114.079901 | 51.092356 | 2021-MAY-HIGHWOOD-Break & Enter - Other Premises | POINT (-114.07990104743979 51.0923557749049) |
63588 | CENTRE | CLIFF BUNGALOW | Break & Enter - Dwelling | 1 | 1895.0 | 2022/06 | 2022 | JUN | -114.074895 | 51.033930 | 2022-JUN-CLIFF BUNGALOW-Break & Enter - Dwelling | POINT (-114.074894864056 51.033930231501) |
63589 | WEST | ROSSCARROCK | Assault (Non-domestic) | 2 | 3625.0 | 2022/06 | 2022 | JUN | -114.145494 | 51.043277 | 2022-JUN-ROSSCARROCK-Assault (Non-domestic) | POINT (-114.145493740992 51.043277392563) |
63590 | NORTHWEST | COLLINGWOOD | Street Robbery | 1 | 2249.0 | 2022/05 | 2022 | MAY | -114.102401 | 51.083504 | 2022-MAY-COLLINGWOOD-Street Robbery | POINT (-114.102401360913 51.083504232298) |
Next, we’ll check for Missing Values
Below, we use .isna() together with .sum() method to check for missing values in each column.
# Check for nas using .isna() method. We can also use .isnull(); I prefer .isna().
# df.isnull().sum()
sum() crime_raw.isna().
Sector 182
Community Name 0
Category 0
Crime Count 0
Resident Count 56986
Date 0
Year 0
Month 0
long 182
lat 182
ID 0
Community Center Point 182
dtype: int64
The above printout shows that our data is complete; there are no missing values.
Checking for Duplicates
Sometimes, it might be necessary to check for duplicates in your dataset. Here, we use the Sector
column to illustrate how to do this in pandas
. However, it’s crucial to better understand your dataset before performing this step because some duplicates might be necessary, just as in this dataset.
# Check for duplicated values
'Sector'].duplicated().sum() crime_raw[
63582
Writing Custom Functions
It’s always a good practice not to repeat yourself. So, automating your data cleaning processes when possible is of paramount importance. Below are the functions we wrote for transforming our dataset.
Checking for Unique Values
We may also want to know the number of unique elements in each column, as shown below. However, checking each column at a time may become tedious and time-consuming for a dataset with too many columns. So, we will write a function to automate this task in the next section.
# Display the sector column values
'Sector'].unique() crime_raw[
array(['NORTHWEST', 'CENTRE', 'NORTHEAST', 'EAST', 'NORTH', 'SOUTHEAST',
nan, 'SOUTH', 'WEST'], dtype=object)
# Display the category column values
'Category'].unique() crime_raw[
array(['Theft OF Vehicle', 'Theft FROM Vehicle', 'Assault (Non-domestic)',
'Violence Other (Non-domestic)', 'Break & Enter - Commercial',
'Break & Enter - Other Premises', 'Break & Enter - Dwelling',
'Street Robbery', 'Commercial Robbery'], dtype=object)
# Inspect column names
crime_raw.columns
Index(['Sector', 'Community Name', 'Category', 'Crime Count', 'Resident Count',
'Date', 'Year', 'Month', 'long', 'lat', 'ID', 'Community Center Point'],
dtype='object')
# Investigate all the elements within each Feature
def check_column_elements(df):
"""
Investigate the number of elements in each column in a pandas data frame and print the column name, count, and values if a column contains less than 12. Otherwise, print the column name and the number of elements.
Args:
df (pandas.DataFrame): The data frame whose columns will be investigated.
"""
for column in df:
= np.unique(df[column])
unique_vals = len(unique_vals)
number_values if number_values < 12:
print(f'The number of values for feature {column} :{number_values} -- {unique_vals}')
else:
print(f'The number of values for feature {column} :{number_values}')
# Investigate column elements
# check_column_elements(crime_raw)
Wrangling the Data
We clean and transform our dataset in this section with a custom function.
# Write a function for transforming the data
def tweak_calgary_crime_data(df):
"""
Tweak Calgary's Crime dataset.
Args:
df (pandas.DataFrame): The data frame to be tweaked.
"""
return(df
'Community Center Point', 'ID', 'Resident Count', 'Year', 'Month'],
.drop([= 'columns')
axis
.assign(= lambda df: pd.to_datetime(df['Date']),
Date = lambda df_: df_['Category'].str.lower(),
Category = lambda df_: df_['Date'].dt.year,
year = lambda df_: df_['Date'].dt.month_name()
month
)
.clean_names()'~ sector.isna()')
.query( )
# Clean the data with our new function
= tweak_calgary_crime_data(crime_raw)
df df.head()
sector | community_name | category | crime_count | date | long | lat | year | month | |
---|---|---|---|---|---|---|---|---|---|
0 | NORTHWEST | BOWNESS | theft of vehicle | 4 | 2022-07-01 | -114.187826 | 51.083359 | 2022 | July |
1 | CENTRE | INGLEWOOD | theft of vehicle | 1 | 2017-04-01 | -114.019120 | 51.034141 | 2017 | April |
2 | NORTHEAST | MERIDIAN | theft of vehicle | 10 | 2022-07-01 | -113.996513 | 51.056587 | 2022 | July |
3 | EAST | SOUTHVIEW | theft from vehicle | 3 | 2017-04-01 | -113.997338 | 51.034156 | 2017 | April |
4 | EAST | SOUTHVIEW | assault (non-domestic) | 4 | 2022-07-01 | -113.997338 | 51.034156 | 2022 | July |
Excluding Unwanted Column Values
After carefully examining our data, we noticed that the Community Name
column contains elements that start with at least one digit. For example, '10D', '05B', '01C', '05D', '13G'
, etc. And these “substrings” do not make sense since the Community Name
column represents various communities within each region or sector of Calgary. So, we will remove them using the below function (Credits: José Gonçalves)
.
# Credits: José Gonçalves
def exclude_substring(df, col, substr):
"""
Exclude a given list of substrings from the "community" column that start with at least one digit in a pandas data frame.
Args:
df (pandas.DataFrame): The data frame to be filtered.
col (str): The name of the column to be filtered.
substr (list): The list of substrings to be excluded.
Returns:
pandas.DataFrame: The filtered data frame.
"""
return df[~df[col].str.contains('|'.join(substr))]
# Remove the unwanted values in the community name column
= exclude_substring(df, 'community_name', ['^\d+'])
df
# Inspect the first 5 rows
df.head()
sector | community_name | category | crime_count | date | long | lat | year | month | |
---|---|---|---|---|---|---|---|---|---|
0 | NORTHWEST | BOWNESS | theft of vehicle | 4 | 2022-07-01 | -114.187826 | 51.083359 | 2022 | July |
1 | CENTRE | INGLEWOOD | theft of vehicle | 1 | 2017-04-01 | -114.019120 | 51.034141 | 2017 | April |
2 | NORTHEAST | MERIDIAN | theft of vehicle | 10 | 2022-07-01 | -113.996513 | 51.056587 | 2022 | July |
3 | EAST | SOUTHVIEW | theft from vehicle | 3 | 2017-04-01 | -113.997338 | 51.034156 | 2017 | April |
4 | EAST | SOUTHVIEW | assault (non-domestic) | 4 | 2022-07-01 | -113.997338 | 51.034156 | 2022 | July |
df.community_name.unique()
array(['BOWNESS', 'INGLEWOOD', 'MERIDIAN', 'SOUTHVIEW', 'WEST HILLHURST',
'ABBEYDALE', 'ACADIA', 'ALBERT PARK/RADISSON HEIGHTS', 'ALTADORE',
'ALYTH/BONNYBROOK', 'APPLEWOOD PARK', 'ARBOUR LAKE', 'ASPEN WOODS',
'AUBURN BAY', 'BANFF TRAIL', 'BANKVIEW', 'BEDDINGTON HEIGHTS',
'BELMONT', 'BELTLINE', 'BRAESIDE', 'BRENTWOOD',
'BRIDGELAND/RIVERSIDE', 'BRIDLEWOOD', 'BRITANNIA',
'BURNS INDUSTRIAL', 'CALGARY INTERNATIONAL AIRPORT',
'CAMBRIAN HEIGHTS', 'CANYON MEADOWS', 'CAPITOL HILL', 'CARRINGTON',
'CASTLERIDGE', 'CEDARBRAE', 'CHAPARRAL', 'CHARLESWOOD',
'CHINATOWN', 'CHRISTIE PARK', 'CITADEL', 'CITYSCAPE',
'CLIFF BUNGALOW', 'COACH HILL', 'COLLINGWOOD', 'COPPERFIELD',
'CORAL SPRINGS', 'CORNERSTONE', 'COUNTRY HILLS VILLAGE',
'COUNTRY HILLS', 'COVENTRY HILLS', 'CRANSTON', 'CRESCENT HEIGHTS',
'CRESTMONT', 'DALHOUSIE', 'DEER RIDGE', 'DEER RUN',
'DEERFOOT BUSINESS CENTRE', 'DISCOVERY RIDGE', 'DOUGLASDALE/GLEN',
'DOVER', 'DOWNTOWN COMMERCIAL CORE', 'DOWNTOWN EAST VILLAGE',
'DOWNTOWN WEST END', 'EAGLE RIDGE', 'EAST FAIRVIEW INDUSTRIAL',
'EAST SHEPARD INDUSTRIAL', 'EASTFIELD', 'EAU CLAIRE', 'EDGEMONT',
'ELBOW PARK', 'ELBOYA', 'ERIN WOODS', 'ERLTON', 'EVANSTON',
'EVERGREEN', 'FAIRVIEW INDUSTRIAL', 'FAIRVIEW', 'FALCONRIDGE',
'FISH CREEK PARK', 'FOOTHILLS', 'FOREST HEIGHTS',
'FOREST LAWN INDUSTRIAL', 'FOREST LAWN', 'FRANKLIN',
'GARRISON GREEN', 'GARRISON WOODS', 'GLAMORGAN', 'GLENBROOK',
'GLENDALE', 'GLENMORE PARK', 'GOLDEN TRIANGLE', 'GREAT PLAINS',
'GREENVIEW INDUSTRIAL PARK', 'GREENVIEW', 'GREENWOOD/GREENBRIAR',
'HAMPTONS', 'HARVEST HILLS', 'HAWKWOOD', 'HAYSBORO',
'HIDDEN VALLEY', 'HIGHFIELD', 'HIGHLAND PARK', 'HIGHWOOD',
'HILLHURST', 'HOMESTEAD', 'HORIZON',
'HOUNSFIELD HEIGHTS/BRIAR HILL', 'HUNTINGTON HILLS',
'KELVIN GROVE', 'KILLARNEY/GLENGARRY', 'KINGSLAND',
'LAKE BONAVISTA', 'LAKEVIEW', 'LEGACY', 'LINCOLN PARK',
'LIVINGSTON', 'LOWER MOUNT ROYAL', 'MAHOGANY',
'MANCHESTER INDUSTRIAL', 'MANCHESTER', 'MAPLE RIDGE',
'MARLBOROUGH PARK', 'MARLBOROUGH', 'MARTINDALE', 'MAYLAND HEIGHTS',
'MAYLAND', 'MCCALL', 'MCKENZIE LAKE', 'MCKENZIE TOWNE',
'MEADOWLARK PARK', 'MIDNAPORE', 'MILLRISE', 'MISSION',
'MONTEREY PARK', 'MONTGOMERY', 'MOUNT PLEASANT', 'NEW BRIGHTON',
'NOLAN HILL', 'NORTH AIRWAYS', 'NORTH GLENMORE PARK',
'NORTH HAVEN', 'NOSE HILL PARK', 'OAKRIDGE', 'OGDEN SHOPS',
'OGDEN', 'PALLISER', 'PANORAMA HILLS', 'PARKDALE', 'PARKHILL',
'PATTERSON', 'PEGASUS', 'PENBROOKE MEADOWS', 'PINERIDGE',
'POINT MCKAY', 'PUMP HILL', 'QUEENSLAND', 'RAMSAY', 'RANCHLANDS',
'RED CARPET', 'REDSTONE', 'RENFREW', 'RICHMOND', 'RIVERBEND',
'ROCKY RIDGE', 'ROSEDALE', 'ROSEMONT', 'ROSSCARROCK', 'ROYAL OAK',
'ROYAL VISTA', 'RUNDLE', 'RUTLAND PARK', 'SADDLE RIDGE INDUSTRIAL',
'SADDLE RIDGE', 'SAGE HILL', 'SANDSTONE VALLEY', 'SCENIC ACRES',
'SECTION 23', 'SETON', 'SHAGANAPPI', 'SHAWNEE SLOPES', 'SHAWNESSY',
'SHEPARD INDUSTRIAL', 'SHERWOOD', 'SIGNAL HILL', 'SILVER SPRINGS',
'SILVERADO', 'SKYLINE EAST', 'SKYLINE WEST', 'SKYVIEW RANCH',
'SOMERSET', 'SOUTH AIRWAYS', 'SOUTH CALGARY', 'SOUTH FOOTHILLS',
'SOUTHWOOD', 'SPRINGBANK HILL', 'SPRUCE CLIFF',
'ST. ANDREWS HEIGHTS', 'STARFIELD', 'STONEGATE LANDING',
'STONEY 1', 'STONEY 2', 'STONEY 3', 'STONEY 4', 'STRATHCONA PARK',
'SUNALTA', 'SUNDANCE', 'SUNNYSIDE', 'SUNRIDGE', 'TARADALE',
'TEMPLE', 'THORNCLIFFE', 'TUSCANY', 'TUXEDO PARK',
'UNIVERSITY DISTRICT', 'UNIVERSITY HEIGHTS',
'UNIVERSITY OF CALGARY', 'UPPER MOUNT ROYAL', 'VALLEY RIDGE',
'VALLEYFIELD', 'VARSITY', 'VISTA HEIGHTS', 'WALDEN',
'WEST SPRINGS', 'WESTGATE', 'WESTWINDS', 'WHITEHORN', 'WILDWOOD',
'WILLOW PARK', 'WINDSOR PARK', 'WINSTON HEIGHTS/MOUNTVIEW',
'WOLF WILLOW', 'WOODBINE', 'WOODLANDS', 'YORKVILLE',
'AURORA BUSINESS PARK', 'BAYVIEW', 'BELVEDERE',
'CANADA OLYMPIC PARK', 'CHINOOK PARK', 'COUGAR RIDGE',
'DIAMOND COVE', 'KINCORA', 'MACEWAN GLEN', 'NORTH HAVEN UPPER',
'PARKLAND', 'QUEENS PARK VILLAGE', 'SCARBORO', 'BEL-AIRE',
'GLENDEER BUSINESS PARK', 'MAYFAIR', 'RIDEAU PARK',
'CURRIE BARRACKS', 'HASKAYNE', 'RANGEVIEW', 'ROXBORO',
'PINE CREEK', 'HOTCHKISS', 'MEDICINE HILL',
'SCARBORO/ SUNALTA WEST', 'BONAVISTA DOWNS', 'KEYSTONE HILLS'],
dtype=object)
Group by and Summarization
= (df
region_df 'sector'])
.groupby(['crime_count']]
[[sum()
.'crime_count', ascending=False)
.sort_values(
.reset_index()
)
# Inspect the first 5 rows
10) region_df.head(
sector | crime_count | |
---|---|---|
0 | CENTRE | 61926 |
1 | NORTHEAST | 34767 |
2 | SOUTH | 22623 |
3 | NORTHWEST | 16518 |
4 | EAST | 15262 |
5 | NORTH | 12649 |
6 | WEST | 11238 |
7 | SOUTHEAST | 10775 |
= (df
vehicle_df 'sector', 'category'])
.groupby([= ('crime_count', 'sum'))
.agg(crime_total 'sector', 'crime_total'], ascending=(True, False))
.sort_values([
.reset_index()
)
# Inspect the first 5 rows
10)
vehicle_df.head(
# Save dataset as a CSV
'theft of vehicle dataset.csv') vehicle_df.to_csv(
= (df
sector_by_year_df 'sector', 'year', 'category'])
.groupby([= ('crime_count', 'sum'))
.agg(crime_total 'sector', 'crime_total'], ascending=(True,False))
.sort_values([
.reset_index()
)
# Inspect the first 5 rows
10) sector_by_year_df.head(
sector | year | category | crime_total | |
---|---|---|---|---|
0 | CENTRE | 2019 | theft from vehicle | 5550 |
1 | CENTRE | 2020 | theft from vehicle | 4273 |
2 | CENTRE | 2018 | theft from vehicle | 3898 |
3 | CENTRE | 2017 | theft from vehicle | 3615 |
4 | CENTRE | 2021 | theft from vehicle | 3438 |
5 | CENTRE | 2019 | break & enter - commercial | 2737 |
6 | CENTRE | 2018 | break & enter - commercial | 2290 |
7 | CENTRE | 2017 | break & enter - commercial | 2150 |
8 | CENTRE | 2022 | theft from vehicle | 1988 |
9 | CENTRE | 2020 | break & enter - commercial | 1896 |
= (df
sector_by_year_n_month_df 'sector', 'year', 'month', 'category'])
.groupby([= ('crime_count', 'sum'))
.agg(crime_total
.sort_values('sector', 'crime_total'],
[=(True,False)
ascending
)
.reset_index()
)
# Inspect the first 5 rows
10) sector_by_year_n_month_df.head(
sector | year | month | category | crime_total | |
---|---|---|---|---|---|
0 | CENTRE | 2019 | September | theft from vehicle | 614 |
1 | CENTRE | 2019 | August | theft from vehicle | 566 |
2 | CENTRE | 2019 | July | theft from vehicle | 551 |
3 | CENTRE | 2019 | November | theft from vehicle | 524 |
4 | CENTRE | 2019 | June | theft from vehicle | 519 |
5 | CENTRE | 2019 | October | theft from vehicle | 495 |
6 | CENTRE | 2019 | May | theft from vehicle | 472 |
7 | CENTRE | 2020 | August | theft from vehicle | 458 |
8 | CENTRE | 2018 | July | theft from vehicle | 457 |
9 | CENTRE | 2020 | September | theft from vehicle | 452 |
= sns.catplot(
bar_g = 'year',
x = 'crime_total',
y = 'bar',
kind = 'sector',
hue = False,
dodge = None,
ci = sector_by_year_df
data
)
"lower center", ncol = 4,
sns.move_legend(bar_g, = None, frameon = False, fontsize = 7
title
)
=3)
plt.tight_layout(pad= '')
plt.xlabel(xlabel 'Crime Events', fontsize=8)
plt.ylabel(=8)
plt.xticks(fontsize=8)
plt.yticks(fontsize'Calgary Crime Events by Region', fontsize = 15)
plt.title( plt.show()
= sns.catplot(x="sector",
g ="crime_count",
y="bar",
kind='gray',
color='sd',
ci=region_df)
data
# Add the title
set(xlabel = '',
g.= 'Number of Crime Events'
ylabel
)
=30, horizontalalignment='right')
g.set_xticklabels(rotation=2)
plt.tight_layout(pad plt.show()