top of page

Abstract

This project is about classifying the type of crimes committed within San Francisco, given the time and location of a criminal occurrence. We investigated classification models including Naive Bayes, Logistic Regression, Random forest and KNN and analyze their pros and cons on this prediction task. We test these algorithms in Weka at first and implement them in Python. The best log-loss we get is 2.474 which ranks 697/2335 in public leaderboard of the competition.

Task

The dataset is collected by SF OpenData, the central clearinghouse for data published by the City and County of San Francisco. It contains more than eight hundreds thousand incidents during 2003-2015 derived from SFPD Crime Incident Reporting System. Our task is to classify 39 types of crimes in San Francisco based on 8 attributes which are dates, descript, days of week, police district, address of the crime scene, resolution, the longitude and latitude. The table below is the example of the original training data we have.

Fig 1: Example of the dataset that we have downloaded from Kaggle

Motivation

Safety is of utmost importance to live our lives nowadays. With this fact in mind, we are very interested in predicting crime classifications given the date and location of the past crime incidents. Hopefully, the result of this machine learning system can help us not to travel or commute through those locations at a specific time so that less number of people will not involve in future crime incidents.

Dataset

Fig 2: the distribution of the orginal dataset by features

From these graphs, we can find that for a certain day or month or district, the number of crimes may change a lot, but the distributions of the types of crimes do not have a big change. Legacy/Theft is always the most possible crime, and the ratio of it over the whole data is 19%. So, in this ML problem, no matter what algorithms we use, the accuracy is always around 20%, that’s why we use log-loss as the criteria to evaluate our models.  

Data Preprocessing

We ignore “Descript” and “Resolution” as they are not present in the testing data containing only 6 other attributes: “DayOfWeek”, “PdDistrict”, “Address”, “Dates” and X and Y coordinates. As we know, date can classify the data, but it does not help the system to learn and classify the unseen data.

To use feature “Dates”, we split “Dates” into numeric features: “Day”, “Month”, “Year”, “hour” and “minute”. Next, we implement the numeric grid index which is the index of the 10*10 area sections derived from the (X, Y) coordinates. As there are over 20000 different addresses and its overlapping with (X, Y) coordinates, we think it is useless to retain this feature “Address”.

 

We have also considered these three problems when dealing with data:

  1. Whether or not will we retain feature “minute”? As “minute” is not evidently useful, we thought it should be dropped. But when we add this feature, the accuracy is a little higher. So we decided to retain it.

  2. We also split 24-hour to 48-half-hour and 72-quarter respectively. The result shows that 48 half-hour has higher accuracy than 24-hour or 72-quarter. (e.g. 48-hour rule: 1:29 -> 3, 13:19 -> 26)

  3. We also try to get the best grid size to split the SF area. After testing, We found that 10*10 grid is better than 8*8 grid and 12*12 grid.

 

We use Matlab to pre-process the whole data and change it to arff files.

 

At first, for the development purpose, we randomly sample 10000 training cases to test some algorithms in Weka. We adopt the 10-fold cross validation to train and validate our model. Then we train the model in Python with scikit-learn package. Because the original training data is of 877,982 cases, it’s very hard for our computers to calculate models. So I randomly select 60,000 cases to calculate the models. It is appropriate because the distribution of different features are almost the same considering randomly selecting data.

​

The figure 3 is the graph for training data of 877,982 cases:

Fig 3: Graphs for Training Data of 877,982 Cases (Weka)

Fig 4: Graphs for Training Data of 60,000 Cases (Weka)

The figure 4 is the graph for training data of 60,000 cases:

​

From the two graphs, we can see that their distribution are almost the same. Besides, when we use dataset of 30000 and 60000 cases, the log-loss of the testing data don’t change much. It means that if the number of training cases is large enough, the log-loss will not change much considering the randomness of the data.

​

​

bottom of page