DEEP LEARNING/Deep Learning Project

DL(딥러닝) 실습 : Prophet을 활용하여 시카고 범죄율을 예측해 보자

신강희 2024. 5. 2. 12:31
728x90

STEP #0: PROBLEM STATEMENT

    • The Chicago Crime dataset : 2001 ~ 2017.
    • Datasource: 캐글 https://www.kaggle.com/currie32/crimes-in-*chicago*
    • Dataset contains the following columns:
      • ID: Unique identifier for the record.
      • Case Number: The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.
      • Date: Date when the incident occurred.
      • Block: address where the incident occurred
      • IUCR: The Illinois Unifrom Crime Reporting code.
      • Primary Type: The primary description of the IUCR code.
      • Description: The secondary description of the IUCR code, a subcategory of the primary description.
      • Location Description: Description of the location where the incident occurred.
      • Arrest: Indicates whether an arrest was made.
      • Domestic: Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
      • Beat: Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car.
      • District: Indicates the police district where the incident occurred.
      • Ward: The ward (City Council district) where the incident occurred.
      • Community Area: Indicates the community area where the incident occurred. Chicago has 77 community areas.
      • FBI Code: Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS).
      • X Coordinate: The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
      • Y Coordinate: The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
      • Year: Year the incident occurred.
      • Updated On: Date and time the record was last updated.
      • Latitude: The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
      • Longitude: The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
      • Location: The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block.

 

# 페이스북에서 만든 오픈소스 Prophet 라이브러리

 

Quick Start

Prophet is a forecasting procedure implemented in R and Python. It is fast and provides completely automated forecasts that can be tuned by hand by data scientists and analysts.

facebook.github.io

 

# 코랩에는 자동으로 prophet이 설치되어 있다. 따라서 다른 환경에서 설치 되어있지 않다면, 다음처럼 설치하면 된다.

  • pip install fbprophet
  • 위의 pip 설치 시 에러가 나면 다음처럼 설치해 준다 : conda install -c conda-forge fbprophet

STEP #1: IMPORTING DATA

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import random
import seaborn as sns
from prophet import Prophet

 

# Chicago_Crimes_2005_to_2007.csv 
# Chicago_Crimes_2008_to_2011.csv
# Chicago_Crimes_2012_to_2017.csv 파일을 읽되, 
# 각각 파라미터 on_bad_lines='skip' 추가 해준다.

 

chicago_df_1 = pd.read_csv('../data/Chicago_Crimes_2005_to_2007.csv', on_bad_lines='skip')
chicago_df_2 = pd.read_csv('../data/Chicago_Crimes_2008_to_2011.csv', on_bad_lines='skip')
chicago_df_3 = pd.read_csv('../data/Chicago_Crimes_2012_to_2017.csv', on_bad_lines='skip')

 

# 데이터 모양을 보고, 이상한 부분은 처리해 준다.

 

chicago_df_1.drop('Unnamed: 0' , axis = 1, inplace=True)

chicago_df_2.drop('Unnamed: 0' , axis = 1, inplace=True)

chicago_df_3.drop('Unnamed: 0' , axis = 1, inplace=True)

 

# 위의 3개 데이터프레임을 하나로 합친다. 

chicago_df = pd.concat([chicago_df_1, chicago_df_2, chicago_df_3])

 

# 600 만건의 범죄 데이터 ㅎㄷㄷ
chicago_df

 

STEP #2: EXPLORING THE DATASET

# Let's view the head of the training dataset

# Let's view the last elements in the training dataset

 

# 비어있는 데이터가 얼마나 되는지 확인하시오.

chicago_df.isna().sum()

ID                          0
Case Number                 7
Date                        0
Block                       0
IUCR                        0
Primary Type                0
Description                 0
Location Description     1974
Arrest                      0
Domestic                    0
Beat                        0
District                   89
Ward                       92
Community Area           1844
FBI Code                    0
X Coordinate            74882
Y Coordinate            74882
Year                        0
Updated On                  0
Latitude                74882
Longitude               74882
Location                74882
dtype: int64

 

# 다음 컬럼들을 삭제하시오.

'Case Number', 'Case Number', 'IUCR', 'X Coordinate', 'Y Coordinate','Updated On','Year', 'FBI Code', 'Beat','Ward','Community Area', 'Location', 'District', 'Latitude' , 'Longitude'

 

# 컴럼 한번더 확인

chicago_df.head(1)

 

# 삭제

chicago_df.drop(['Case Number', 'IUCR', 'X Coordinate', 'Y Coordinate', 'Updated On', 'Year', 'FBI Code', 'Beat', 'Ward', 'Community Area', 'Location', 'District', 'Latitude' , 'Longitude'] , axis = 1, inplace=True)

 

# 삭제가 잘되었는지 재확인

chicago_df.head()

 

# Date 컬럼을 보니, 날짜 형식으로 되어있다. 이를 파이썬이 이해할 수 있는 날짜로 바꿔서 다시 Date 컬럼에 저장하시오.

 

# info로 컬럼정보를 보면 Date를 문자열로 인식하고 있는걸 알수 있다.
chicago_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6017767 entries, 0 to 1456713
Data columns (total 8 columns):
 #   Column                Dtype 
---  ------                ----- 
 0   ID                    int64 
 1   Date                  object
 2   Block                 object
 3   Primary Type          object
 4   Description           object
 5   Location Description  object
 6   Arrest                bool  
 7   Domestic              bool  
dtypes: bool(2), int64(1), object(5)
memory usage: 332.9+ MB

 

chicago_df['Date']

0          04/02/2006 01:00:00 PM
1          02/26/2006 01:40:48 PM
2          01/08/2006 11:16:00 PM
3          04/05/2006 06:45:00 PM
4          02/17/2006 09:03:14 PM
                    ...          
1456709    05/03/2016 11:33:00 PM
1456710    05/03/2016 11:30:00 PM
1456711    05/03/2016 12:15:00 AM
1456712    05/03/2016 09:07:00 PM
1456713    05/03/2016 11:38:00 PM
Name: Date, Length: 6017767, dtype: object

 

.to_datetime 함수 기억할것 !! python 형식의 날짜로 변경하는 함수
# 년/월/일/시 => iso 포멧 전세게 평균 표준시 표현법 전세계 누구든 시간을 알아볼수 있음


chicago_df['Date'] = pd.to_datetime(chicago_df['Date'], format='%m/%d/%Y %I:%M:%S %p' )

 

chicago_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6017767 entries, 0 to 1456713
Data columns (total 8 columns):
 #   Column                Dtype         
---  ------                -----         
 0   ID                    int64         
 1   Date                  datetime64[ns]
 2   Block                 object        
 3   Primary Type          object        
 4   Description           object        
 5   Location Description  object        
 6   Arrest                bool          
 7   Domestic              bool          
dtypes: bool(2), datetime64[ns](1), int64(1), object(4)
memory usage: 332.9+ MB

ㄴ datetime64[ns] 형식으로 바뀐거 확인

 

chicago_df.iloc[ 0 , 1 ]

Timestamp('2006-04-02 13:00:00')

 

# 요일 확인번 ( 0이 월요일 ~ 6 일요일)
chicago_df.iloc[ 0 , 1 ].weekday()

6

 

# 전체에 적용
# pandas.Series.dt.weekday => dt. 많이 쓰인다 레퍼런스 참고해서 알아둘것
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

 

chicago_df['weekday'] = chicago_df['Date'].dt.weekday

chicago_df

 

# 문자로 나타내고 싶을때 / 한글로 설정하는 방법 검색
chicago_df['day_name'] = chicago_df['Date'].dt.day_name()

 

chicago_df['day_name'].value_counts()

day_name
Friday       910373
Wednesday    870841
Tuesday      865340
Thursday     860425
Saturday     858153
Monday       843525
Sunday       809110
Name: count, dtype: int64

### .str , .dt 엄청 많이 사용하니까 꼭 알아둘것!!!

 

Date 컬럼을 인덱스로 만드시오.

chicago_df.index = chicago_df['Date']

chicago_df

 

chicago_df.index

DatetimeIndex(['2006-04-02 13:00:00', '2006-02-26 13:40:48',
               '2006-01-08 23:16:00', '2006-04-05 18:45:00',
               '2006-02-17 21:03:14', '2006-03-30 22:30:00',
               '2006-04-05 12:10:00', '2006-04-05 15:00:00',
               '2006-04-05 21:30:00', '2006-04-03 03:00:00',
               ...
               '2016-05-03 23:30:00', '2016-05-03 23:50:00',
               '2016-05-03 22:25:00', '2016-05-03 23:00:00',
               '2016-05-03 23:28:00', '2016-05-03 23:33:00',
               '2016-05-03 23:30:00', '2016-05-03 00:15:00',
               '2016-05-03 21:07:00', '2016-05-03 23:38:00'],
              dtype='datetime64[ns]', name='Date', length=6017767, freq=None)

 

# 범죄 유형의 갯수를 세고, 가장 많은것부터 내림차순으로 보여주세요.

chicago_df['Primary Type'].value_counts()

Primary Type
THEFT                                1245111
BATTERY                              1079178
CRIMINAL DAMAGE                       702702
NARCOTICS                             674831
BURGLARY                              369056
OTHER OFFENSE                         368169
ASSAULT                               360244
MOTOR VEHICLE THEFT                   271624
ROBBERY                               229467
DECEPTIVE PRACTICE                    225180
CRIMINAL TRESPASS                     171596
PROSTITUTION                           60735
WEAPONS VIOLATION                      60335
PUBLIC PEACE VIOLATION                 48403
OFFENSE INVOLVING CHILDREN             40260
CRIM SEXUAL ASSAULT                    22789
SEX OFFENSE                            20172
GAMBLING                               14755
INTERFERENCE WITH PUBLIC OFFICER       14009
LIQUOR LAW VIOLATION                   12129
ARSON                                   9269
HOMICIDE                                5879
KIDNAPPING                              4734
INTIMIDATION                            3324
STALKING                                2866
OBSCENITY                                422
PUBLIC INDECENCY                         134
OTHER NARCOTIC VIOLATION                 122
NON-CRIMINAL                              96
CONCEALED CARRY LICENSE VIOLATION         90
NON - CRIMINAL                            38
HUMAN TRAFFICKING                         28
RITUALISM                                 16
NON-CRIMINAL (SUBJECT SPECIFIED)           4
Name: count, dtype: int64

 

# 상위 15개까지만 보여주세요.

chicago_df['Primary Type'].value_counts().head(15)

Primary Type
THEFT                         1245111
BATTERY                       1079178
CRIMINAL DAMAGE                702702
NARCOTICS                      674831
BURGLARY                       369056
OTHER OFFENSE                  368169
ASSAULT                        360244
MOTOR VEHICLE THEFT            271624
ROBBERY                        229467
DECEPTIVE PRACTICE             225180
CRIMINAL TRESPASS              171596
PROSTITUTION                    60735
WEAPONS VIOLATION               60335
PUBLIC PEACE VIOLATION          48403
OFFENSE INVOLVING CHILDREN      40260
Name: count, dtype: int64

 

# 시각화 order 용으로 쓰기위해 인덱스를 사용
my_order = chicago_df['Primary Type'].value_counts().head(15).index

 

# 상위 15개의 범죄 유형(Primary Type)의 갯수를, 비주얼라리징 하시오.

import seaborn as sb
import matplotlib.pyplot as plt

 

sb.countplot(data = chicago_df, y = 'Primary Type', order = my_order)
plt.show()

 

# 어디에서 가장 범죄가 많이 발생했는지, 범죄 장소(Location Descripton) 로 비주얼라이징 하시오.

 

chicago_df['Location Description'].value_counts().head(15).plot(kind='bar')
plt.show()

 

데이터를 주기별로 분석해 보자

 

# resample  'Y' 는 년도다. 년도로 리샘플한 후, 각 년도별 몇개의 범죄 데이터를 가지고 있는지 확인한다.

# resample 함수를 사용하려면, 인덱스가 datetime 이어야 한다. (인덱스에 날짜가 있어야 한다!!!!)
chicago_df.resample('Y').size()

Date
2005-12-31    455811
2006-12-31    794684
2007-12-31    621848
2008-12-31    852053
2009-12-31    783900
2010-12-31    700691
2011-12-31    352066
2012-12-31    335670
2013-12-31    306703
2014-12-31    274527
2015-12-31    262995
2016-12-31    265462
2017-12-31     11357
Freq: A-DEC, dtype: int64

 

# Y 옆에 S 를 적어주면 연도별 시작일을 기준으로 세준다.
chicago_df.resample('YS').size()

Date
2005-01-01    455811
2006-01-01    794684
2007-01-01    621848
2008-01-01    852053
2009-01-01    783900
2010-01-01    700691
2011-01-01    352066
2012-01-01    335670
2013-01-01    306703
2014-01-01    274527
2015-01-01    262995
2016-01-01    265462
2017-01-01     11357
Freq: AS-JAN, dtype: int64

 

# 위의 데이터를 plot 으로 시각화 한다. 범죄횟수를 눈으로 확인하자.
chicago_df.resample('YS').size().plot()
plt.show()

 

# 월별 범죄 발생 건수를 확인하자.

chicago_df.resample('M').size()

Date
2005-01-31    33983
2005-02-28    32042
2005-03-31    36970
2005-04-30    38963
2005-05-31    40572
              ...  
2016-09-30    23235
2016-10-31    23314
2016-11-30    21140
2016-12-31    19580
2017-01-31    11357
Freq: M, Length: 145, dtype: int64

 

chicago_df.resample('MS').size()

Date
2005-01-01    33983
2005-02-01    32042
2005-03-01    36970
2005-04-01    38963
2005-05-01    40572
              ...  
2016-09-01    23235
2016-10-01    23314
2016-11-01    21140
2016-12-01    19580
2017-01-01    11357
Freq: MS, Length: 145, dtype: int64

 

# 월별 범죄 발생 건수도 plot 으로 시각화 하자.

chicago_df.resample('M').size().plot()
plt.show()

 

# 분기별 범죄 건수도 확인하자.

chicago_df.resample('Q').size()

Date
2005-03-31    102995
2005-06-30    119769
2005-09-30    123550
2005-12-31    109497
2006-03-31    115389
2006-06-30    225489
2006-09-30    238423
2006-12-31    215383
2007-03-31    192791
2007-06-30    204361
2007-09-30    119086
2007-12-31    105610
2008-03-31    191523
2008-06-30    222331
2008-09-30    236695
2008-12-31    201504
2009-03-31    184055
2009-06-30    203916
2009-09-30    210446
2009-12-31    185483
2010-03-31    171848
2010-06-30    194453
2010-09-30    197116
2010-12-31    137274
2011-03-31     78167
2011-06-30     93064
2011-09-30     95835
2011-12-31     85000
2012-03-31     78574
2012-06-30     88283
2012-09-30     89685
2012-12-31     79128
2013-03-31     71651
2013-06-30     80776
2013-09-30     83510
2013-12-31     70766
2014-03-31     59964
2014-06-30     72991
2014-09-30     76090
2014-12-31     65482
2015-03-31     58503
2015-06-30     68239
2015-09-30     71782
2015-12-31     64471
2016-03-31     60843
2016-06-30     68085
2016-09-30     72500
2016-12-31     64034
2017-03-31     11357
Freq: Q-DEC, dtype: int64

 

# 분기별 범죄 건수도 시각화 하자.

chicago_df.resample('Q').size().plot()
plt.show()

 

# 범죄 발생 건수를 예측하고 싶다!!

STEP #3: 데이터 준비

# 날짜별로 먼저 해보자

# 이 데이터가 각날짜별 건수


df_prophet = chicago_df.resample('D').size().to_frame().reset_index().rename(columns={'Date' : 'ds', 0 : 'y'})

df_prophet

 

prophet = Prophet()

prophet.fit(df_prophet)

15:56:16 - cmdstanpy - INFO - Chain [1] start processing
15:56:17 - cmdstanpy - INFO - Chain [1] done processing
[104]:
<prophet.forecaster.Prophet at 0x29664619cd0>

 

future = prophet.make_future_dataframe(periods=365, freq='D')

forecast = prophet.predict(future)

 

prophet.plot(forecast)
plt.show()

 

prophet.plot_components(forecast)
plt.show()

 

# 월별로(매달 말일) 주기로 하여 데이터프레임을 만들고, 인덱스를 리셋하시오.

chicago_prophet = chicago_df.resample('M').size().to_frame().reset_index().rename(columns={'Date' : 'ds', 0 : 'y'})

chicago_prophet

 

prophet = Prophet()

prophet.fit(chicago_prophet)

16:02:49 - cmdstanpy - INFO - Chain [1] start processing
16:02:49 - cmdstanpy - INFO - Chain [1] done processing
[117]:
<prophet.forecaster.Prophet at 0x29665f49490>

 

future1 = prophet.make_future_dataframe(periods=12, freq='M')

forecast = prophet.predict(future)

 

prophet.plot(forecast)
plt.show()

 

prophet.plot_components(forecast)
plt.show()

 

Prophet 실습까지 종료

반응형