Clustering European cities

Soumya De
8 min readAug 11, 2019

IBM Data Science Capstone Project

An image signifying Europe

This project has been developed as a part of IBM’s Data Science Professional Certificate course on Coursera. In this project Sckit learn’s KMeans clustering algorithm has been used to cluster top 500 European cities, in terms of population. The dataset for this project has been created by collecting data from various sources. This blog thoroughly describes each and every steps that have been used to develop the project.

1. Introduction

1.1 Background

Europe is the continent where people get awestruck by its natural beauty, epic history and dazzling artistic & culinary diversity. Europe’s cultural heritage is its biggest single draw: the birthplace of democracy in Athens, the renaissance art of Florence, the graceful canal of Venice and lot more. Despite its population density Europe maintains spectacular natural scenery. Cheers! Salud! Prost! Europe has some best nightlife in the world. Globally famous DJs keep the party going in London, Berlin and Paris, all of which also offer top-class entertainment, especially theater and live music. After one has ticked off the great museums, panoramic vistas and energetic nightlife comes the fun part, that people like me would enjoy, the magnificent menus: the pizza in Naples, souvlaki in Santorini or even haggis in Scotland. Europe’s diversity and global reach is its trump card.

1.2 Problem description

The rich diversity and magnificent history of Europe have always fascinated me to live there. And I believe there are lot of people like me out somewhere trying to find out a place, rather I would say a city, to live in Europe. This problem is designed to address these type of people. The problem is to cluster 500 cities in Europe on basis art, food and heritage. And then examine each cluster to find out which group/cluster offers most diverse characteristics so that we could select a city from that cluster as potential place to live in Europe.

2. Data

2.1 Data requirements & collection

To the address the problem described above, we need a dataset that will contain city-country pairs along with respective latitude & longitude and data about venues across each of these cities in Europe. So to prepare a dataset of this kind we mainly need data basically from three sources:

  1. The name of the cities and respective countries from http://worldpopulationreview.com/continents/cities-in-europe. To obtain this data BeautifulSoup package has been used to directly scrape from the site itself. The given site contains name of top 500 cities all over Europe, in terms of population. Along with the city names it also contains the name of the country and population of the corresponding cities.
  2. The geographical coordinates of each city from OpeCageGeocode api
  3. The data about 100 venues within a 10 km radius of each city from FourSquare api

The data_collection.ipynb script in the repository has been used to obtain the data and store it in the disk.

3. Methodology

3.1 Exploratory Data Analysis

The dataset contains information about 43658 venues, across 499 cities in the 36 countries of Europe. Each data in the dataset contains the eight following attribute:

  1. Venue (Name of the venue)
  2. Venue category
  3. Venue latitude
  4. Venue longitude
  5. City
  6. City latitude
  7. City longitude
  8. Country

To visualize each city, a map of Europe has plotted with cities superimposed on top (as shown below). The names of the cities, countries are not shown in the map intentionally cause it would have made the below image look messy.

500 European cities superimposed on map of Europe

The number of cities in each country of the collected data is shown in the following bar graph.

Number of cities in each country of the dataset

The bar graph shown above tells that a number of cities in Germany, Spain, The United Kingdom and Russia are highly populated and the number is quite big in Russia. Therefore, the number of venues in each in each country will also produce similar looking bar graph. This is simply because more venues will be retrieved for the country with more number of cities.

Number of venues in each country of the dataset

There is a total of 561 venue categories found in the dataset. The following section has a wordcloud which shows the venues(venue category) that are popular among the Europeans. The larger the word the more frequent it occurs in the dataset.

Word cloud representing popular venues in Europe

This wordcloud clearly justifies the background section of the introduction. Europe certainly has varieties: coffee shops to parks, bakery to steakhouse, movie theater to historic Sites. It has a whole lot options to explore.

This notebook has been used to perform Exploratory data analysis.

3.2 Feature Engineering

Now to address the problem, that is clustering these cities on the basis of culture, life style we need look out for various common places visited by population in these cities. Therefore, accomplish this task we require 10 most common places in each city as features. Steps that are involved in feature engineering are listed below:

  1. Create one-hot encoding for each cities using unique venue category as feature set
  2. Grouping data by city and by taking the mean of the frequency of occurrence of each category
  3. Create the new dataframe with top 10 venues for each city

3.3 K-Means clustering on the obtained data

After the data has been prepared, we need to cluster the cities into 7 different clusters. Scikit Learn’s KMeans has been used for this purpose. When clustering is done, we obtained dataset that contains the cluster labels along with 10 most common venues for each city. The clusters are plotted on a folium map centered on Europe and the results are given in the results section of this blog.

The notebook in the link has been used for feature engineering and clustering.

4. Results

So, the KMeans algorithm has resulted in reasonably good clusters. But somehow there are two cities: Kalininskiy, Russia (light blue) and Arad, Romania (light green) that are not included in any of the clusters and formed clusters of their own Cluster 4 and Cluster 5 respectively. The rest of the five clusters are formed well enough. Cluster 1 (red) and Cluster 3 (blue) are formed mainly by the cities situated in Spain and United Kingdom respectively. The most common place in Cluster 3 is a pub or a bar and in Cluster 1 it is a Spanish restaurant. The cafes and restaurants offering various cuisines are popular among cities in Cluster 7 (orange). In Cluster 2 (purple) most of the cities belong from Russia and Ukraine. Cafes, parks, restautants and gyms are common places in cities of Cluster 2. Interestingly, Cluster 6 (yellow) is the most diverse among all: parks, theaters, historic sites, restautants, art museums and lot more to offer. The figure below shows the resulted clusters of European cities superimposed on a map of Europe.

Clustered cities superimposed on a map of Europe

5. Discussions

From the obtained results we observed the following inferences. The light blue and the light green, the cities of Kalininskiy, Russia and Arad, Romania have formed separate clusters of themselves. But somehow, they both are similar to each other based on the features from 4th most common venue to 10th most common venue. These features are same for the two clusters Cluster 4 and 5. Next, the Cluster 1 (red markers) is comprising of cities mostly situated in Spain and Spanish restaurants are famous venues in these cities. Along with Spanish restaurants, parks, art museums, plaza are also popular. The blue markers belong to Cluster 3. In this cluster most cities are from the United Kingdom. Pubs, bar, coffee shops, parks are most common venues in these cities. Surprisingly Birmingham has Indian restaurants as the third most common venue knowing the fact that India is in a separate continent. So there must be a considerable population who love Indian food. The orange markers denote cities belonging to Cluster 7, people living in these cities have knack for good food. Restaurants of various cuisines: Italian, Japanese, Mediterranean etc., cafes, parks, gyms are some of the most common places in the cities of Cluster 7. Cluster 2 (purple markers) consists of the European cities where the most common place is a coffee shop. Now comes my favorite cluster, Cluster 6 (yellow markers). Cities like Moscow, London, Berlin, Rome, Paris and many other popular cities that people often come across are in Cluster 6. The cities of Cluster 6 are culturally and socially diverse. Historic sites, restaurants, museums, scenic lookout, yoga studio, opera house are some of the most common sites in these cities.

6. Conclusion

This is the final section of the report and here I would try unfold the whole story that had been going on so far. To conclude I would say Europe has a great cultural heritage, glorious sceneries, fascinating nightlife and magnificent menus. So among the top populated cities of Europe the cities that are belonging to Cluster 6 are most variant. It includes : the birthplace of democracy in Athens, Renaissance art of Florence, graceful canals of Venice, the Napoleonic splendor of Paris, and the multilayered historical and cultural canvas of London, Imperial palaces in Russia’s former capital St. Petersburg and ongoing project of Gaudi’s La Sagrada Família in Barcelona. Despite of this rich cultural heritage cities in Cluster 6 maintains spectacular natural scenery: rugged Scottish Highlands with glens and lochs, steppe-like plains of central Spain, beaches across Mediterranean’s northern coast where beach holidays were practically invented. Mountain lovers should head to Alps: they march across central Europe taking in France, Switzerland, Austria, northern Italy and tiny Liechtenstein. Nightlife of London, Berlin and Paris are breathtaking. World famous Dj’s keep the party going in these cities. Other key locations for high energetic nightlife include Moscow, Belgrade, Budapest and Madrid. Continue to party on the continent’s streets at a multiplicity of festivals, from city parades attended by thousands to concerts in an ancient amphitheatre. After one have gone through the great heritage, scenery and nightlife, what’s left? A chance to indulge in a culinary adventure to beat all others. Who wouldn’t want to snack on pizza in Naples, souvlaki in Santorini or even haggis in Scotland? But did you did you also know that Britain has some of the best Indian restaurants in the world, that turkey’s doner kebab is a key part of contemporary German food culture, and that in Netherlands you can gorge on an Indonesian rijsttafel (rice table)? Europe is a castle of fairy-tale, a voyage through history, traveler’s paradise and it will be a great experience to live in one of these great European cities.

7. References

  1. http://worldpopulationreview.com/continents/cities-in-europe
  2. https://opencagedata.com/api
  3. https://foursquare.com/developers/apps

GitHub Repository : https://github.com/soumyagamer/First-Repository

--

--