About the Project

All the code that went into scraping, cleaning, and building the app can be found on my github .

Introduction

This project takes a user input of city information and uses the average of multiple distance measurements to find the most similar cities in America

The data was scraped from City Town Info. It is not the most complete data set with multiple states only including data from one city. However, through a python script, I was able to get fairly complete data on over 17 thousand US cities. From that data, I wanted a user to either choose an exsisting city or input features that they would like from a city and have a recommendation made for places like that. The end product is a user friendly experience that I am satisfied with.

It is an interesting project right now beacuse of the amount of people that have lost jobs and need to relocate or that have come to enjoy remote work and want to relocate.


Data Preparation

Using a python script, I scraped data from City Town Info for over 17,000 US cities

At first I wanted to find the right dataset online that was cleaned and ready to go. I wanted demographic and local information on a city to city basis. This proved harder to find than I thought. Many sites offered close to what I was looking for, but at a steep price ($500 +). Being a college student, I could not afford something like that, so I started to look at websites that offered that info. City Town Info offered a lot of data, and seemed plausable to scrape.

The data came in all as text. I cleaned up the html text, and converted everything to its proper string, integer, or float. This I then broke out to proper rows and columns so the data was tidy. I then ran through some feature selection and choose the top 20 most important features down from ~80.


Mining / Learning from the Data

From the start I knew that the most obvious way to make recommedations with this dataset was K-Nearest Neighbors, but the normal, off-the-shelf packages would not work in my case

My recommendation problem was almost inbetween supervised and unsupervised learning. I had ordered data, but was not predicting a label. I just needed a distance measurement like in K-Nearest Neighbors. With some advice and google, I found that the spacy package in python has prebuilt distance measurements. My approach is to use three of the most popular: Euclidean Distance, Manhattan Distance, and Chebyshev Distance.

            euclidean = scipy.spatial.distance.cdist(data, obs, metric='euclidean')
            manhattan = scipy.spatial.distance.cdist(data, obs, metric='cityblock')
            chebyshev = scipy.spatial.distance.cdist(data, obs, metric='chebyshev')

            combined = (euclidean + manhattan + chebyshev) / 3
                  

By finding the distance between the user input and all other point in the dataset with each method, I could take an average of the three and then find the n closest points. There may be better ways to go about it, but this was the best solution I could find; to implement my own version of Nearest Neighbors.


Results

The result is the app

Conclusions

With data that I have access to, the final result work great

There are some limitations with this project. First, the data that we are using was collected from an incomplete source. Many states are missing many cities and city data. This means that those few states are going to be under represented. However, there is data for every state and every type of city (i.e. old/young, large/small, rich/poor). Second is a problem with our comparison algorithm. The main problem with Nearest Neighbors is the curse of dementionality. This means the more features we include when calculating distance, the more difficult it is to have confidence in the results. This is why I included a way to compare only one ideal feature in the app.

Overall, this project was a success. The recommendation system works and can handle many different situations. It is interesting to see what cities have in common. Take note as you try out the app.


Lessons Learned

My two main takaways are from data integrity and displaying results

If I were to start this project over, I would spend more time trying to find more complete data. I did not do my due diligence in making sure all the data was there. By the time I realized some states were missing many cities, I was deep into the project. The data that we use in machine learning is the output. This means that if the data in missing or incomplete, we will have a baised output. My other change would be to make an output that you can connect with. My output doesn't make me do anything. Presenting findings is just as important as the findings. The presentation is what leads to action.