Exploratory Analysis of the Calgary’s Crime Data

Data Wrangling
Python
Data Science
Data Analytics
Author

Alier Reng, Chol Aruei

Published

Wednesday, July 6, 2022

Background

Chol Aruei and I recently formed a Python meetup group to meet every Sunday around 3 pm (CST) to further our data science learning, particularly with Python. We aim to have each of us bring his/her data and present it to the team in the future. And this project is our first project since we started to learn together. Chol provided the data, and I am leading the analysis because I have been programming in Python a bit longer than him.

With that in mind, let’s switch gears to walk you through our project.

We are exploring Calgary’s crime data from 2012 to 2019. The data was obtained from Calgary website. Below is a brief description of the dataset.

Data is provided monthly by the Calgary Police Service. This includes both crime and disorder statistics.

Crime count is based on the most serious violation (MSV) per incident. Violence: These figures include all violent crime offences as defined by the Centre for Canadian Justice Statistics Universal Crime Reporting (UCR) rules. Domestic violence is excluded. Break and Enter: Residential B&E includes both House and ‘Other’ structure break and enters due to the predominantly residential nature of this type of break in (e.g. detached garages, sheds). B&Es incidents include attempts.

Disorder events consist of Social and Physical disorder. Only public-generated calls are included.

Social disorder includes: Drunk, Disturbance, Indecent Act, Juvenile Complaint, Landlord/tenant, Mental health concern, Neighbor dispute, Party complaint, Prowler, Suspicious person, Threats, Drugs, Noise complaint, Possible gunshots, Unwanted guest/patron, Prostitution, Speeder, Suspicious Auto.

Physical disorder includes: Fire, Property damage and Abandoned auto.

**Resident counts are pulled from census data which is collected Apr/May of each year and apply to community crime stats from June of current year to May of the following year.

ex) 2018 Census data applies to June 2018-May 2019 Community Crime stats

Importing the Libraries

Here we will load pandas as pd, numpy as np, matplotlib.pyplot as plt, and seaborn as sns. Also, we will use Quarto to display all the columns as shown in the below code chunk.

# Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from janitor import clean_names

# Theme setting
sns.set_theme(context="talk", style="whitegrid",
              palette="colorblind", color_codes=True,
              rc={"figure.figsize": [12, 8]}
)

Loading the Dataset

# Importing the dataset from GitHub
url = 'https://raw.githubusercontent.com/tongakuot/cushanalytics/main/00_data/Community_Crime_Statistics%20(1).csv'


crime_raw = pd.read_csv(url)

# Inspect the first 5 rows
crime_raw.head()
Sector Community Name Category Crime Count Resident Count Date Year Month long lat ID Community Center Point
0 NORTHWEST BOWNESS Theft OF Vehicle 4 11150.0 2022/07 2022 JUL -114.187826 51.083359 2022-JUL-BOWNESS-Theft OF Vehicle POINT (-114.187826026643 51.083359442425)
1 CENTRE INGLEWOOD Theft OF Vehicle 1 NaN 2017/04 2017 APR -114.019120 51.034141 2017-APR-INGLEWOOD-Theft OF Vehicle POINT (-114.01912 51.03414052)
2 NORTHEAST MERIDIAN Theft OF Vehicle 10 0.0 2022/07 2022 JUL -113.996513 51.056587 2022-JUL-MERIDIAN-Theft OF Vehicle POINT (-113.996513423841 51.056586683694)
3 EAST SOUTHVIEW Theft FROM Vehicle 3 NaN 2017/04 2017 APR -113.997338 51.034156 2017-APR-SOUTHVIEW-Theft FROM Vehicle POINT (-113.9973381 51.03415568)
4 EAST SOUTHVIEW Assault (Non-domestic) 4 1805.0 2022/07 2022 JUL -113.997338 51.034156 2022-JUL-SOUTHVIEW-Assault (Non-domestic) POINT (-113.997338140443 51.034155684723)
# Inspect the last 5 rows
crime_raw.tail()
Sector Community Name Category Crime Count Resident Count Date Year Month long lat ID Community Center Point
63586 SOUTH SOUTHWOOD Break & Enter - Other Premises 2 NaN 2021/01 2021 JAN -114.083201 50.957774 2021-JAN-SOUTHWOOD-Break & Enter - Other Premises POINT (-114.08320089308144 50.9577735651428)
63587 CENTRE HIGHWOOD Break & Enter - Other Premises 2 NaN 2021/05 2021 MAY -114.079901 51.092356 2021-MAY-HIGHWOOD-Break & Enter - Other Premises POINT (-114.07990104743979 51.0923557749049)
63588 CENTRE CLIFF BUNGALOW Break & Enter - Dwelling 1 1895.0 2022/06 2022 JUN -114.074895 51.033930 2022-JUN-CLIFF BUNGALOW-Break & Enter - Dwelling POINT (-114.074894864056 51.033930231501)
63589 WEST ROSSCARROCK Assault (Non-domestic) 2 3625.0 2022/06 2022 JUN -114.145494 51.043277 2022-JUN-ROSSCARROCK-Assault (Non-domestic) POINT (-114.145493740992 51.043277392563)
63590 NORTHWEST COLLINGWOOD Street Robbery 1 2249.0 2022/05 2022 MAY -114.102401 51.083504 2022-MAY-COLLINGWOOD-Street Robbery POINT (-114.102401360913 51.083504232298)

Next, we’ll check for Missing Values

Below, we use .isna() together with .sum() method to check for missing values in each column.

# Check for nas using .isna() method. We can also use .isnull(); I prefer .isna().
# df.isnull().sum()
crime_raw.isna().sum()
Sector                      182
Community Name                0
Category                      0
Crime Count                   0
Resident Count            56986
Date                          0
Year                          0
Month                         0
long                        182
lat                         182
ID                            0
Community Center Point      182
dtype: int64

The above printout shows that our data is complete; there are no missing values.

Checking for Duplicates

Sometimes, it might be necessary to check for duplicates in your dataset. Here, we use the Sector column to illustrate how to do this in pandas. However, it’s crucial to better understand your dataset before performing this step because some duplicates might be necessary, just as in this dataset.

# Check for duplicated values
crime_raw['Sector'].duplicated().sum()
63582

Writing Custom Functions

It’s always a good practice not to repeat yourself. So, automating your data cleaning processes when possible is of paramount importance. Below are the functions we wrote for transforming our dataset.

Checking for Unique Values

We may also want to know the number of unique elements in each column, as shown below. However, checking each column at a time may become tedious and time-consuming for a dataset with too many columns. So, we will write a function to automate this task in the next section.

# Display the sector column values
crime_raw['Sector'].unique()
array(['NORTHWEST', 'CENTRE', 'NORTHEAST', 'EAST', 'NORTH', 'SOUTHEAST',
       nan, 'SOUTH', 'WEST'], dtype=object)
# Display the category column values
crime_raw['Category'].unique()
array(['Theft OF Vehicle', 'Theft FROM Vehicle', 'Assault (Non-domestic)',
       'Violence Other (Non-domestic)', 'Break & Enter - Commercial',
       'Break & Enter - Other Premises', 'Break & Enter - Dwelling',
       'Street Robbery', 'Commercial Robbery'], dtype=object)
# Inspect column names
crime_raw.columns
Index(['Sector', 'Community Name', 'Category', 'Crime Count', 'Resident Count',
       'Date', 'Year', 'Month', 'long', 'lat', 'ID', 'Community Center Point'],
      dtype='object')
# Investigate all the elements within each Feature 
def check_column_elements(df):
  """
    Investigate the number of elements in each column in a pandas data frame and print the column name, count, and values if a column contains less than 12. Otherwise, print the column name and the number of elements.
    
    Args:
      df (pandas.DataFrame): The data frame whose columns will be investigated. 
  """ 
  for column in df:
      unique_vals = np.unique(df[column])
      number_values = len(unique_vals)
      if number_values < 12:
          print(f'The number of values for feature {column} :{number_values} -- {unique_vals}')
      else:
          print(f'The number of values for feature {column} :{number_values}')
# Investigate column elements
# check_column_elements(crime_raw)

Wrangling the Data

We clean and transform our dataset in this section with a custom function.

# Write a function for transforming the data
def tweak_calgary_crime_data(df):
  """
  Tweak Calgary's Crime dataset.
  
  Args:
      df (pandas.DataFrame): The data frame to be tweaked.
  """
  return(df
        .drop(['Community Center Point', 'ID', 'Resident Count', 'Year', 'Month'], 
        axis = 'columns')
        .assign(
          Date = lambda df: pd.to_datetime(df['Date']),
          Category = lambda df_: df_['Category'].str.lower(),
          year = lambda df_: df_['Date'].dt.year,
          month = lambda df_: df_['Date'].dt.month_name()
         )
        .clean_names()
        .query('~ sector.isna()')
     )
# Clean the data with our new function
df = tweak_calgary_crime_data(crime_raw)
df.head()
sector community_name category crime_count date long lat year month
0 NORTHWEST BOWNESS theft of vehicle 4 2022-07-01 -114.187826 51.083359 2022 July
1 CENTRE INGLEWOOD theft of vehicle 1 2017-04-01 -114.019120 51.034141 2017 April
2 NORTHEAST MERIDIAN theft of vehicle 10 2022-07-01 -113.996513 51.056587 2022 July
3 EAST SOUTHVIEW theft from vehicle 3 2017-04-01 -113.997338 51.034156 2017 April
4 EAST SOUTHVIEW assault (non-domestic) 4 2022-07-01 -113.997338 51.034156 2022 July

Excluding Unwanted Column Values

After carefully examining our data, we noticed that the Community Name column contains elements that start with at least one digit. For example, '10D', '05B', '01C', '05D', '13G', etc. And these “substrings” do not make sense since the Community Name column represents various communities within each region or sector of Calgary. So, we will remove them using the below function (Credits: José Gonçalves).

# Credits: José Gonçalves
def exclude_substring(df, col, substr):    
  """ 
  Exclude a given list of substrings from the "community" column that start with at least one digit in a pandas data frame.        
  
  Args:        
    df (pandas.DataFrame): The data frame to be filtered. 
    
    col (str): The name of the column to be filtered.  
    
    substr (list): The list of substrings to be excluded.            
    
    Returns:        
      pandas.DataFrame: The filtered data frame.    
  """    
  return df[~df[col].str.contains('|'.join(substr))]
# Remove the unwanted values in the community name column
df = exclude_substring(df, 'community_name', ['^\d+'])

# Inspect the first 5 rows
df.head()
sector community_name category crime_count date long lat year month
0 NORTHWEST BOWNESS theft of vehicle 4 2022-07-01 -114.187826 51.083359 2022 July
1 CENTRE INGLEWOOD theft of vehicle 1 2017-04-01 -114.019120 51.034141 2017 April
2 NORTHEAST MERIDIAN theft of vehicle 10 2022-07-01 -113.996513 51.056587 2022 July
3 EAST SOUTHVIEW theft from vehicle 3 2017-04-01 -113.997338 51.034156 2017 April
4 EAST SOUTHVIEW assault (non-domestic) 4 2022-07-01 -113.997338 51.034156 2022 July
df.community_name.unique()
array(['BOWNESS', 'INGLEWOOD', 'MERIDIAN', 'SOUTHVIEW', 'WEST HILLHURST',
       'ABBEYDALE', 'ACADIA', 'ALBERT PARK/RADISSON HEIGHTS', 'ALTADORE',
       'ALYTH/BONNYBROOK', 'APPLEWOOD PARK', 'ARBOUR LAKE', 'ASPEN WOODS',
       'AUBURN BAY', 'BANFF TRAIL', 'BANKVIEW', 'BEDDINGTON HEIGHTS',
       'BELMONT', 'BELTLINE', 'BRAESIDE', 'BRENTWOOD',
       'BRIDGELAND/RIVERSIDE', 'BRIDLEWOOD', 'BRITANNIA',
       'BURNS INDUSTRIAL', 'CALGARY INTERNATIONAL AIRPORT',
       'CAMBRIAN HEIGHTS', 'CANYON MEADOWS', 'CAPITOL HILL', 'CARRINGTON',
       'CASTLERIDGE', 'CEDARBRAE', 'CHAPARRAL', 'CHARLESWOOD',
       'CHINATOWN', 'CHRISTIE PARK', 'CITADEL', 'CITYSCAPE',
       'CLIFF BUNGALOW', 'COACH HILL', 'COLLINGWOOD', 'COPPERFIELD',
       'CORAL SPRINGS', 'CORNERSTONE', 'COUNTRY HILLS VILLAGE',
       'COUNTRY HILLS', 'COVENTRY HILLS', 'CRANSTON', 'CRESCENT HEIGHTS',
       'CRESTMONT', 'DALHOUSIE', 'DEER RIDGE', 'DEER RUN',
       'DEERFOOT BUSINESS CENTRE', 'DISCOVERY RIDGE', 'DOUGLASDALE/GLEN',
       'DOVER', 'DOWNTOWN COMMERCIAL CORE', 'DOWNTOWN EAST VILLAGE',
       'DOWNTOWN WEST END', 'EAGLE RIDGE', 'EAST FAIRVIEW INDUSTRIAL',
       'EAST SHEPARD INDUSTRIAL', 'EASTFIELD', 'EAU CLAIRE', 'EDGEMONT',
       'ELBOW PARK', 'ELBOYA', 'ERIN WOODS', 'ERLTON', 'EVANSTON',
       'EVERGREEN', 'FAIRVIEW INDUSTRIAL', 'FAIRVIEW', 'FALCONRIDGE',
       'FISH CREEK PARK', 'FOOTHILLS', 'FOREST HEIGHTS',
       'FOREST LAWN INDUSTRIAL', 'FOREST LAWN', 'FRANKLIN',
       'GARRISON GREEN', 'GARRISON WOODS', 'GLAMORGAN', 'GLENBROOK',
       'GLENDALE', 'GLENMORE PARK', 'GOLDEN TRIANGLE', 'GREAT PLAINS',
       'GREENVIEW INDUSTRIAL PARK', 'GREENVIEW', 'GREENWOOD/GREENBRIAR',
       'HAMPTONS', 'HARVEST HILLS', 'HAWKWOOD', 'HAYSBORO',
       'HIDDEN VALLEY', 'HIGHFIELD', 'HIGHLAND PARK', 'HIGHWOOD',
       'HILLHURST', 'HOMESTEAD', 'HORIZON',
       'HOUNSFIELD HEIGHTS/BRIAR HILL', 'HUNTINGTON HILLS',
       'KELVIN GROVE', 'KILLARNEY/GLENGARRY', 'KINGSLAND',
       'LAKE BONAVISTA', 'LAKEVIEW', 'LEGACY', 'LINCOLN PARK',
       'LIVINGSTON', 'LOWER MOUNT ROYAL', 'MAHOGANY',
       'MANCHESTER INDUSTRIAL', 'MANCHESTER', 'MAPLE RIDGE',
       'MARLBOROUGH PARK', 'MARLBOROUGH', 'MARTINDALE', 'MAYLAND HEIGHTS',
       'MAYLAND', 'MCCALL', 'MCKENZIE LAKE', 'MCKENZIE TOWNE',
       'MEADOWLARK PARK', 'MIDNAPORE', 'MILLRISE', 'MISSION',
       'MONTEREY PARK', 'MONTGOMERY', 'MOUNT PLEASANT', 'NEW BRIGHTON',
       'NOLAN HILL', 'NORTH AIRWAYS', 'NORTH GLENMORE PARK',
       'NORTH HAVEN', 'NOSE HILL PARK', 'OAKRIDGE', 'OGDEN SHOPS',
       'OGDEN', 'PALLISER', 'PANORAMA HILLS', 'PARKDALE', 'PARKHILL',
       'PATTERSON', 'PEGASUS', 'PENBROOKE MEADOWS', 'PINERIDGE',
       'POINT MCKAY', 'PUMP HILL', 'QUEENSLAND', 'RAMSAY', 'RANCHLANDS',
       'RED CARPET', 'REDSTONE', 'RENFREW', 'RICHMOND', 'RIVERBEND',
       'ROCKY RIDGE', 'ROSEDALE', 'ROSEMONT', 'ROSSCARROCK', 'ROYAL OAK',
       'ROYAL VISTA', 'RUNDLE', 'RUTLAND PARK', 'SADDLE RIDGE INDUSTRIAL',
       'SADDLE RIDGE', 'SAGE HILL', 'SANDSTONE VALLEY', 'SCENIC ACRES',
       'SECTION 23', 'SETON', 'SHAGANAPPI', 'SHAWNEE SLOPES', 'SHAWNESSY',
       'SHEPARD INDUSTRIAL', 'SHERWOOD', 'SIGNAL HILL', 'SILVER SPRINGS',
       'SILVERADO', 'SKYLINE EAST', 'SKYLINE WEST', 'SKYVIEW RANCH',
       'SOMERSET', 'SOUTH AIRWAYS', 'SOUTH CALGARY', 'SOUTH FOOTHILLS',
       'SOUTHWOOD', 'SPRINGBANK HILL', 'SPRUCE CLIFF',
       'ST. ANDREWS HEIGHTS', 'STARFIELD', 'STONEGATE LANDING',
       'STONEY 1', 'STONEY 2', 'STONEY 3', 'STONEY 4', 'STRATHCONA PARK',
       'SUNALTA', 'SUNDANCE', 'SUNNYSIDE', 'SUNRIDGE', 'TARADALE',
       'TEMPLE', 'THORNCLIFFE', 'TUSCANY', 'TUXEDO PARK',
       'UNIVERSITY DISTRICT', 'UNIVERSITY HEIGHTS',
       'UNIVERSITY OF CALGARY', 'UPPER MOUNT ROYAL', 'VALLEY RIDGE',
       'VALLEYFIELD', 'VARSITY', 'VISTA HEIGHTS', 'WALDEN',
       'WEST SPRINGS', 'WESTGATE', 'WESTWINDS', 'WHITEHORN', 'WILDWOOD',
       'WILLOW PARK', 'WINDSOR PARK', 'WINSTON HEIGHTS/MOUNTVIEW',
       'WOLF WILLOW', 'WOODBINE', 'WOODLANDS', 'YORKVILLE',
       'AURORA BUSINESS PARK', 'BAYVIEW', 'BELVEDERE',
       'CANADA OLYMPIC PARK', 'CHINOOK PARK', 'COUGAR RIDGE',
       'DIAMOND COVE', 'KINCORA', 'MACEWAN GLEN', 'NORTH HAVEN UPPER',
       'PARKLAND', 'QUEENS PARK VILLAGE', 'SCARBORO', 'BEL-AIRE',
       'GLENDEER BUSINESS PARK', 'MAYFAIR', 'RIDEAU PARK',
       'CURRIE BARRACKS', 'HASKAYNE', 'RANGEVIEW', 'ROXBORO',
       'PINE CREEK', 'HOTCHKISS', 'MEDICINE HILL',
       'SCARBORO/ SUNALTA WEST', 'BONAVISTA DOWNS', 'KEYSTONE HILLS'],
      dtype=object)

Group by and Summarization

region_df = (df
            .groupby(['sector'])
            [['crime_count']]
            .sum()
            .sort_values('crime_count', ascending=False)
            .reset_index()
           )

# Inspect the first 5 rows
region_df.head(10)
sector crime_count
0 CENTRE 61926
1 NORTHEAST 34767
2 SOUTH 22623
3 NORTHWEST 16518
4 EAST 15262
5 NORTH 12649
6 WEST 11238
7 SOUTHEAST 10775
vehicle_df = (df
              .groupby(['sector', 'category'])
              .agg(crime_total = ('crime_count', 'sum'))
              .sort_values(['sector', 'crime_total'], ascending=(True, False))
              .reset_index()
             )

# Inspect the first 5 rows
vehicle_df.head(10)

# Save dataset as a CSV
vehicle_df.to_csv('theft of vehicle dataset.csv')
sector_by_year_df = (df
                    .groupby(['sector', 'year', 'category'])
                    .agg(crime_total = ('crime_count', 'sum'))
                    .sort_values(['sector', 'crime_total'], ascending=(True,False))
                    .reset_index()
                    )

# Inspect the first 5 rows
sector_by_year_df.head(10)
sector year category crime_total
0 CENTRE 2019 theft from vehicle 5550
1 CENTRE 2020 theft from vehicle 4273
2 CENTRE 2018 theft from vehicle 3898
3 CENTRE 2017 theft from vehicle 3615
4 CENTRE 2021 theft from vehicle 3438
5 CENTRE 2019 break & enter - commercial 2737
6 CENTRE 2018 break & enter - commercial 2290
7 CENTRE 2017 break & enter - commercial 2150
8 CENTRE 2022 theft from vehicle 1988
9 CENTRE 2020 break & enter - commercial 1896
sector_by_year_n_month_df = (df
                            .groupby(['sector', 'year', 'month', 'category'])
                            .agg(crime_total = ('crime_count', 'sum'))
                            .sort_values(
                              ['sector', 'crime_total'], 
                              ascending=(True,False)
                            )
                            .reset_index()
                            )

# Inspect the first 5 rows
sector_by_year_n_month_df.head(10)
sector year month category crime_total
0 CENTRE 2019 September theft from vehicle 614
1 CENTRE 2019 August theft from vehicle 566
2 CENTRE 2019 July theft from vehicle 551
3 CENTRE 2019 November theft from vehicle 524
4 CENTRE 2019 June theft from vehicle 519
5 CENTRE 2019 October theft from vehicle 495
6 CENTRE 2019 May theft from vehicle 472
7 CENTRE 2020 August theft from vehicle 458
8 CENTRE 2018 July theft from vehicle 457
9 CENTRE 2020 September theft from vehicle 452
bar_g = sns.catplot(
  x = 'year', 
  y = 'crime_total', 
  kind = 'bar',
  hue = 'sector',
  dodge = False,
  ci = None,
  data = sector_by_year_df
  )
  
sns.move_legend(bar_g, "lower center", ncol = 4, 
title = None, frameon = False, fontsize = 7
)

plt.tight_layout(pad=3)
plt.xlabel(xlabel = '')
plt.ylabel('Crime Events', fontsize=8)
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
plt.title('Calgary Crime Events by Region', fontsize = 15)
plt.show()

g = sns.catplot(x="sector", 
                y="crime_count", 
                kind="bar", 
                color='gray',
                ci='sd',
                data=region_df)

# Add the title
g.set(xlabel = '',
      ylabel = 'Number of Crime Events'
    )

g.set_xticklabels(rotation=30, horizontalalignment='right')
plt.tight_layout(pad=2)
plt.show()

Calgary Crime Events by Sector.