1. 程式人生 > >Visualising economic data using Plotly

Visualising economic data using Plotly

Visualising economic data using Plotly

Since I am an Economist by training and love programming and data science I wanted to combine these passions and do some fun data analysis. This post makes use of a variety of python libraries to scrape and visualise economic data. I hope this is useful to some of you and that you enjoy readingthis as much as I did doing it.

The first thing I need to do is get some data. Since Wikipedia is the source of all internet knowledge (not!!) let’s start there. I decided to scrape a table from the following Wikipedia page: [Wiki]. I thought it would be interesting to look at some of the richest and poorest countries in the world. The table in question ranks countries by their GDP per capita.

Before I go any further it is probably a good idea to give a brief explanation of what GDP per capita is (you tend to take for granted that a lot of people don’t really speak “economics”). Simply put, it is a measure of how wealthy a country is. It is essentially the value of all the goods and services produced within a countries borders in a year divided by the population. This gives us a way of describing the average level of wealth per person in that country. It is quite an important economic variable and is often used to compare wealth levels across countries and across time.

In general, GDP per capita can increase for the following reasons.1. GDP increases2. Population decreases3. A combination of both.

This measure is generally thought to be a better indication of a countries wealth than GDP. So now we have the brief econ primer out of the way lets dig into the analysis. As I mentioned before, I will be scrapping data from Wikipedia so using Beautiful soup seems like a no-brainer. This library greatly simplifies extracting data from web pages and is pretty much the go-to library for web scraping in python.

What is BeautifulSoup?

Beautiful soup is a python library for pulling data out of html and xml files. This makes the library extremely useful for extracting info from web pages. If you want more info on how exactly the library works and the various tasks you can perform with beautiful soup, feel free to read the [Documentation]

In order to use beautiful soup, it is worth knowing some simple html tags. Having a little bit of knowledge of html will make it a great deal easier to search for the data we want. For example, Wikipedia uses a table tag for the tables it displays on its web pages. Knowing this, we can simply parse the html and look only for information contained within these tags.

First off I need to import all the necessary libraries for the analysis.

import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from bubbly.bubbly import bubbleplot
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly as py
import plotly.graph_objs as go
init_notebook_mode(connected=True) #do not miss this line
from plotly import tools

Now that I have loaded the libraries in we are ready to start the analysis.The code below allows us to load the webpage into our Jupyter notebook and pass it to the BeautifulSoup class to create a soup object.

req=requests.get(web_page)page = req.textsoup = BeautifulSoup(page, ‘html.parser’)soup.title

Ok so it looks like it worked and we can use some functions on the soup object and extract the data we want. I mentioned before that we were interested in the table tag. Below is the code to extract all of the tables from the wiki page.

table = soup.find_all(“table”, “wikitable”)len(table)
from IPython.display import IFrame, HTMLHTML(str(table))

The code above returns a list where each entry contains one of the tables on the page. This page only has five tables so it is pretty easy to just get the table we need which happens to be the first entry in the list. We can also confirm whether it is by using the HTML command from IPython.display which prints the table as it appears on Wikipedia.

Now that we have the table it is just a matter of getting the country names and the GDP per capita. To do this, we need to know a bit more about the structure of HTML tables. In particular, we should know about the <th>, <tr> and <td> tags. These stand for table header, table row, and cell respectively. Ok so let’s try extracting some of the data.

GDP_PC = table[0]table_rows = GDP_PC.find_all(‘tr’)header = table_rows[1]
table_rows[1].a.get_text()

This code finds all the tr tags which indicate the rows of the table. We then get the header of the table and print it out to give the results below. This corresponds to the country names and we extract the name we need using a.get_text(). Each index in table_rows corresponds to a country and the country name is located in the <a> tag and is the same for each value of the index.

Now, all we need to do to get all the country names is to loop through table_rows, extract the data and append to a list.

countries = [table_rows[i].a.get_text() for i in range(len(table_rows))[1:]]cols = [col.get_text() for col in header.find_all(‘th’)]

Python has a really nice succinct way of coding these kinds of loops using list comprehensions. Note I skip the first entry of table_rows as it does not correspond to a country. We also use a list comprehension to extract the column headers which will be useful later on. The above code is equivalent to the for loop below.

country = []for i in range(len(table_rows))[1:]: country.append(table_rows[i].a.get_text())

Next up is the <td> tag. This is where our GDP per capita data is stored in the table. The data is pretty messy, however, and there are a number of workarounds we can use to get the correct data out and into the right format. Let’s take a quick look at one of the data points.

temp = GDP_PC.find_all(‘td’)temp[5].get_text()

This gives us ‘114,430\n’. We can see that all the data is defined as stringsand there are commas and line breaks in each cell so we will need to fix this later. First, let’s concentrate on getting the data into a list.

temp = GDP_PC.find_all(‘td’)GDP_per_capita = [temp[i].get_text() for i in range(len(temp)) if “,” in temp[i].get_text()]GDP_per_capita = [i for i in GDP_per_capita if ‘\xa0’ not in i]
temp_list = []for i in range(len(temp)): temp_list.append(temp[i].get_text())new_list = temp_list[-11:]
numbers = [i for i in new_list if “\n” in i]
for i in numbers: GDP_per_capita.append(i)
rank = list(range(len(countries)))

There is a lot going on in the code above so let’s go through it step by step.The first thing I do is find all the cells in GDP_PC and store in a temp variable.The next line loops through this variable and grabs the text if it contains a comma. I did this since most of the entries are in the thousands and therefore contain a comma. This approach does, however, miss the last four entries as they are hundreds of dollars so I have to create a workaround for that which is what the new_list and numbers variables are doing. Finally, I append these entries onto the GDP_per_capita list and also generate a rank column which is just numbers from 1 to 192. This may not be the most efficient way of doing this and there is probably a better way but hey it worked so I am happy with it.

After extracting the three columns, rank, country and GDP per capita as lists we need to merge these together and create a pandas data frame. This will make plotting and analysing the data much simpler. There is a handy function called zip that allows us to do this allowing us to create two separate data frames. One for the top 20 richest countries and one for the bottom twenty poorest countries. The code below implements this.

data = zip(rank[0:21],countries[0:21], GDP_pc[0:21])import pandas as pdcols = [‘Rank’, ‘Country’, ‘GDP Per Capita’]data1 = pd.DataFrame(list(data), columns = cols)
data2 = zip(rank[-21:],countries[-21:], GDP_pc[-21:])data2 = pd.DataFrame(list(data2), columns = cols)

We now have our top and bottom 20 countries in pandas data frames. Before we can plot the data we need to do a little bit more cleaning. The data are currently defined as strings so we need to fix this in order to use certain Pandas functions. The code below removes HTML line breaks “\n”, commas and defines the data type as int.

data1['GDP Per Capita'] = data1['GDP Per Capita'].apply(lambda x: x.replace('\n', '')).astype(int)
data2['GDP Per Capita'] = data1['GDP Per Capita'].apply(lambda x: x.replace(',', '')).astype(int)

We finally have our data ready to create some nice looking visualisations.

Introduction to Plotly

Now we can move onto Plotly to create what I think are really nice visualisations. I really like this library and it is simple enough to make pretty interactive plots. If you want to know about what kind of graphs you can create I encourage you to read the documentation [Website]

Below is the code to create a simple bar chart of the top 10 richest countries in the world. First, we pass the data to go.Bar to create a bar chart with the country names on the x-axis and the GDP per capita on the y-axis. We then store this in a list and it gets passed to the go.Figure method. The same steps here apply to create all the different types of plots in Plotly. Some of the results may or may not surprise you. For example, the top 10 is littered with countries heavily focused on producing oil such as Qatar and Kuwait who get approximately 70 and 94 per cent of government revenue from oil. A lot of these countries tend to have relatively small populations and large economies so it is not really surprising that they are very rich based on this measure (a lot of wealth to share out among a relatively small population).

trace1 = go.Bar( x = data1.Country, y = data1[‘GDP Per Capita’])
data = [trace1]layout = go.Layout( title=’Top 20 countries ranked by GDP per Capita’)
fig = go.Figure(data = data, layout = layout)py.offline.iplot(fig)

Pretty easy right? Now for the poorest countries. Not surprisingly, these countries tend to be concentrated in Africa where populations tend to grow rapidly and the economies lag behind the more developed nations.

After getting a quick overview of the top 10 and bottom 10 countries lets try and get a more broad overview looking at the world as a whole. A good way of doing this is by using a map. In Plotly you can create choropleth maps which shade the different regions based on some variable. In our case that is GDP per capita. Countries with a higher GDP per capita will have a darker shade of red. The most important things to note about this code is the country names passed into the locations argument and the location mode argument. These must match for the plot to work. You can also use country codes and even longitude and latitude to achieve the same plot but I think this is probably the easiest way. Notice that Plotly allows you to zoom in to particular regions for a closer look which is a really nice feature.

We can see that the richest countries tend to be centered in North America Europe, and the oil producing nations while the poorest countries are in Africa denoted by the lighter colour.

data = [ dict( type=’choropleth’, locations = data_all[‘Country’], autocolorscale = True, z = data_all[‘GDP Per Capita’], locationmode = ‘country names’, marker = dict( line = dict ( color = ‘rgb(255,255,255)’, width = 2 ) ), colorbar = dict( title = “Millions USD” ) ) ]
layout = dict( title = ‘Top Countries by GDP per capital’)
fig = go.Figure(data = data, layout = layout)py.offline.iplot(fig)

Do People from Rich Nations Live Longer

Ok, now that I have shown you some simple plots using Plotly I want to go a step further and create something really cool. There is a really nice library called bubbly which creates bubble charts and has some interesting features to enhance the level of interactivity you can have with your charts. You can do this with Plotly but there is quite a bit of coding involved to achieve the desired effect and bubbly makes it super easy. Credit to [Aashitak] for this library. There is also a nice [kaggle kernel] showing how the library works under the hood and is definitely worth checking out.

What I want to do here is create a bubble chart looking at GDP per capita vs life expectancy. The chart also takes into account the population of each country and what continent the country is in. I obtained all of the data from the world bank website. Below is the code to read the data in using pandas and I create a list of unique values of the countries, continents and years which will be useful for manipulating the data. As it turns out this is a pretty famous visualisation created by the gapminder foundation. They have a really nice tool for plotting this and other charts available [Here] if anyone wants to check it out.

The world bank data that I use here is in a completely different format then the gapminder_indicator dataset on Kaggle (which this plot is originally based on). To use the bubble library we need the data to be in the format of the latter so there is a bit of data manipulation required. The reason I used the world bank data is that it has a slightly longer time series and I wanted to get a view of more recent developments. The code below loads the datasets in and extracts the same countries that are used in the gapminder dataset.

gdp = pd.read_csv(“gdp_per_capota.csv”, engine = “python”)life = pd.read_csv(“LifeExp.csv”, engine = “python”)pop = pd.read_csv(“population.csv”, engine = “python”)gapminder_indicators = pd.read_csv(“gapminder_indicators.csv”, engine = “python”)
countries = gapminder_indicators.country.unique()continents = gapminder_indicators.continent.unique()years = gapminder_indicators.year.unique()
[‘Country Name’, ‘1982’, ‘1987’, ‘1992’, ‘1997’, ‘2002’, ‘2007’, ‘2010’, ‘2013’, ‘2016’]
# Filter countries firstgdp_new = gdp[gdp[‘Country Name’].isin(countries)]life_new = life[life[‘Country Name’].isin(countries)]pop_new = pop[pop[‘Country Name’].isin(countries)]
# # Now filter yearsyears = [str(year) for year in years]years = years[6:]for i in [‘2010’, ‘2013’, ‘2016’]: years.append(i)
years.insert(0,”Country Name”)
gdp_new = gdp_new[years]life_new = life_new[years]pop_new = pop_new[years]

The gapminder_indicator dataset has the data in the correct format (long format, see below) for plotting so essentially we need to manipulate the three datasets into the same format and merge them together before I can plot them using bubbly.

 country      continent year  lifeExp  pop      gdpPercap Afghanistan  Asia     1952   28.801  8425333  779.445314  Afghanistan  Asia     1957   30.332  9240934  820.853030 Afghanistan  Asia     1962   31.997  10267083 853.100710 Afghanistan  Asia     1967   34.020  11537966 836.197138 Afghanistan  Asia     1972   36.088  13079460 739.981106

The world banks data set is formatted differently with the population for each year being allocated a different column (wide format). Below is the code I use to manipulate the world bank data into the correct format.

Country name  1960      1961      1962 Aruba        54211.0   55438.0   56225.0 Afghanistan  8996351.0 9166764.0 9345868.0 Angola       5643182.0 5753024.0 5866061.0 Albania      1608800.0 1659800.0 1711319.0 Andorra      13411.0   14375.0   15370.0
melted_gdp = pd.melt(gdp_new, id_vars = ["Country Name"], var_name = "Year", value_name = "Data")
grouped_gdp = melted_gdp.groupby(["Country Name"]).apply(lambda x: x.sort_values(["Year"], ascending = True)).reset_index(drop=True)
melted_life = pd.melt(life_new, id_vars = ["Country Name"], var_name = "Year", value_name = "Data")
grouped_life = melted_life.groupby(["Country Name"]).apply(lambda x: x.sort_values(["Year"], ascending = True)).reset_index(drop=True)
​melted_pop = pd.melt(pop_new, id_vars = ["Country Name"], var_name = "Year", value_name = "Data")
grouped_pop = melted_pop.groupby(["Country Name"]).apply(lambda x: x.sort_values(["Year"], ascending = True)).reset_index(drop=True)
temp = pd.merge(grouped_gdp, grouped_life, on = ['Country Name', 'Year'], how = 'inner')
temp = pd.merge(temp, grouped_pop, on = ['Country Name', 'Year'], how = 'inner')
cols= ['Country Name', 'Year', 'Data_x', 'Data_y', 'Data']
temp = temp[cols]​
data = temp.copy()

Let me explain what I am doing here. The melt function collapses all the year columns into one row alongside the values for each year in the values row. I then groupby the country names and sort each row by year so I am left with a dataset that has the country sorted alphabetically and year sorted chronologically. This is the same as the gapminder_indicators. The datasets are then merged datasets on the country name and year. I think you may be able to this in one Pandas function but I decided to do it in a bit more of a manual way as it is a good way of practicing how to manipulate data step by step.

The other thing we need to do now is to create a continent column which maps the country to the correct continent as this information will be used when plotting. To do this we create a dictionary using the gapminder dataset and then map this dictionary to a new column in my merged dataset.

dictionary = dict(zip(gapminder_indicators[‘country’], gapminder_indicators[‘continent’]))data[“continent”] = data[“Country Name”].map(dictionary)
data.rename(columns = {‘Data_x’: ‘GDP_pc’, ‘Data_y’: ‘Life Expectancy’, ‘Data’: ‘Population’}, inplace=True)

Finally, we have a finished dataset and we can create our plot. We use the bubbleplot function in the bubbly library to do this. The function creates a beautiful interactive plot of life expectancy vs GDP per capita and plots the size of the bubble according to the population of the country. The bubbles are also coloured by the continent and we are able to plot all of this information across time which is really nice. The most notable changes are China and India indicated by the largest purple bubbles. At the start of the sample, they were among the poorest countries and had a relatively low life expectancy.Over time, however, the made a substantial move towards the upper right of the chart indicating large increases in both GDP per capita and life expectancy. This pretty much mirrors what we have seen with China becoming an economic powerhouse over the last 20 years or so.

What is also clear from the chart is that there is a positive correlation between GDP per capita and Life Expectancy. As one increases the other also tends to increase. Of course, this tells us nothing about any causal relationship and it is unclear whether countries have a higher life expectancy because they are rich or countries are rich because they have a higher life expectancy. That is perhaps a question for an economics research paper and not this particular blog post.

So that is how you can extract data from the internet using beautiful soup and also how to use data visualisations to interpret and uncover trends in data which might not be immediately obvious looking at the raw data.

from bubbly.bubbly import bubbleplot
figure = bubbleplot(dataset=data, x_column=’GDP_pc’, y_column=’Life Expectancy’,  bubble_column=’Country Name’, time_column=’Year’, size_column=’Population’, color_column=’continent’,  x_title=”GDP per Capita”, y_title=”Life Expectancy”, title=’Gapminder Global Indicators’, x_logscale=True, scale_bubble=3, height=650)
iplot(figure, config={‘scrollzoom’: True})