Hey everyone,
A couple days back I learned a new technique with CRAN for visualizing geographic data via mapping. This is something I’ve always wanted to do, and I’m glad I’ve received a project where this skill would be of some use (what you see here is not the actual project, fyi). Go ahead and click on the picture above. This is something that I created using a combination of Excel, CRAN, data from China NBS, and shapefiles from GADM and the U.S. Census Bureau. Basically, I gathered statistical data from China’s National Bureau of Statistics, manipulated it in Excel, linked it to information contained in the cartographic boundary files from GADM, and then composed the images you see here using a programming language called CRAN. What you see above is a heatmap of 2011 GDP per Capita by province, with the shades of color linked to GDP/Capita. The darker shades of blue represent regions with higher nominal average wealth and the lighter shades represent regions with a lower nominal average wealth. You can see that in this country, wealth is concentrated on the east coast in large metropolitan areas (Like Beijing, Tianjin, Shanghai, Hong Kong, and Ordos City), as well as in Inner Mongolia (represented as Nei Mongol on the map).
Data Manipulation in Excel
For this project, I extracted historical GDP per capita from China NBS using Excel by directly linking the cells to the website. My main task in this step was to reconcile the listing of the administrative regions from the cartographic boundary files with those listed from China NBS. I found that there were 5 differences in nomenclature, namely that Inner Mongolia, Ningxia, Xinjiang, and Tibet from China NBS were called Nei Mongol, Ningxia Hui, Xinjiang Uygur, and Xizang in the shapefiles, respectively. Furthermore, I couldn’t find the region listed as “Paracel Islands” from China NBS. In this case there were a small enough number of mismatches for me to correct them all manually.
I ran into another problem when I learned that Hainan used to be a part of Guangdong, and that Chongqing used to be part of Sichuan. This gave me headaches when I was trying to figure out how to divide the provinces into octiles for the graphics. I used the formula that you see in the above picture, which for the most part, kept the separated provinces together (and in the same quantile!) for older years. I did however, have to adjust one of the data points manually. I then created modified vectors to export into RStudio via csv, and in this case each vector had a nice, convenient set of 32 points.
Graphic Composition
My next task involved importing the data into RStudio, and then using CRAN code to generate the graphics. I looked around on Stack Overflow and found threads here and here, proposing ideas on how to link a numeric data vector (in my case, GDP per Capita for each province) to regions within an image. Neither of these proposals worked that well for me, so I had to combine the two and then make some modifications to the code for my purposes. Here’s the end result (if collapsed, click “show source” to see the code):
[code language=”r” toolbar=”TRUE” wraplines=”FALSE” collapse=”FALSE”]
setwd(“./China/”)
library(maptools)
library(RColorBrewer)
##substitute your shapefiles here
state.map <- readShapeSpatial(“CHN_adm0.shp”)
counties.map <- readShapeSpatial(“CHN_adm1.shp”)
## these are the variables we will be plotting
chinagdp <- read.csv(“chinagdp.csv”,header=TRUE)
counties.map@data$x1978 <- chinagdp$X1978M
counties.map@data$x1980 <- chinagdp$X1980M
counties.map@data$x1985 <- chinagdp$X1985M
counties.map@data$x1990 <- chinagdp$X1990M
counties.map@data$x1995 <- chinagdp$X1995M
counties.map@data$x2000 <- chinagdp$X2000
counties.map@data$x2005 <- chinagdp$X2005
counties.map@data$x2010 <- chinagdp$X2010
## put the lab point x y locations of the zip codes in the data frame for easy retrieval
labelpos <- data.frame(do.call(rbind, lapply(counties.map@polygons, function(x) x@labpt)))
names(labelpos) <- c(“x”,”y”)
counties.map@data <- data.frame(counties.map@data, labelpos)
plot.heat <- function(counties.map,state.map,z,title=NULL,breaks=NULL,reverse=FALSE,cex.legend=1,bw=.2,col.vec=NULL,plot.legend=TRUE) {
##Break down the value variable
if (is.null(breaks)) {
breaks=
seq(
floor(min(counties.map@data[,z],na.rm=TRUE)*10)/10
,
ceiling(max(counties.map@data[,z],na.rm=TRUE)*10)/10
,.1)
}
counties.map@data$zCat <- cut(counties.map@data[,z],breaks,include.lowest=TRUE)
cutpoints <- levels(counties.map@data$zCat)
if (is.null(col.vec)) col.vec <- heat.colors(length(levels(counties.map@data$zCat)))
if (reverse) {
cutpointsColors <- rev(col.vec)
} else {
cutpointsColors <- col.vec
}
levels(counties.map@data$zCat) <- cutpointsColors
plot(counties.map,border=gray(.8), lwd=bw,axes = FALSE, las = 1,col=as.character(counties.map@data$zCat))
if (!is.null(state.map)) {
plot(state.map,add=TRUE,lwd=1)
}
##Edit the legend information here
if (plot.legend) legend(“bottomleft”,c(“1st”,”2nd”,”3rd”,”4th”,”5th”,”6th”,”7th”,”8th”), fill = rev(cutpointsColors),bty=”n”,title=title,cex=cex.legend)
##title(“Cartogram”)
}
# plot.heat(counties.map,state.map,z=”x1978″,breaks=c(0,4,8,12,16,20,24,28,32),col.vec=brewer.pal(8,”Blues”),plot.legend=TRUE,reverse=TRUE,title=”Octile n (8th is the highest)”)
# title(main=”China: GDP per Capita by Administrative Region, 1978″)
# plot.heat(counties.map,state.map,z=”x1980″,breaks=c(0,4,8,12,16,20,24,28,32),col.vec=brewer.pal(8,”Blues”),plot.legend=TRUE,reverse=TRUE,title=”Octile n (8th is the highest)”)
# title(main=”China: GDP per Capita by Administrative Region, 1980″)
# plot.heat(counties.map,state.map,z=”x1985″,breaks=c(0,4,8,12,16,20,24,28,32),col.vec=brewer.pal(8,”Blues”),plot.legend=TRUE,reverse=TRUE,title=”Octile n (8th is the highest)”)
# title(main=”China: GDP per Capita by Administrative Region, 1985″)
# plot.heat(counties.map,state.map,z=”x1990″,breaks=c(0,4,8,12,16,20,24,28,32),col.vec=brewer.pal(8,”Blues”),plot.legend=TRUE,reverse=TRUE,title=”Octile n (8th is the highest)”)
# title(main=”China: GDP per Capita by Administrative Region, 1990″)
# plot.heat(counties.map,state.map,z=”x1995″,breaks=c(0,4,8,12,16,20,24,28,32),col.vec=brewer.pal(8,”Blues”),plot.legend=TRUE,reverse=TRUE,title=”Octile n (8th is the highest)”)
# title(main=”China: GDP per Capita by Administrative Region, 1995″)
# plot.heat(counties.map,state.map,z=”x2000″,breaks=c(0,4,8,12,16,20,24,28,32),col.vec=brewer.pal(8,”Blues”),plot.legend=TRUE,reverse=TRUE,title=”Octile n (8th is the highest)”)
# title(main=”China: GDP per Capita by Administrative Region, 2000″)
plot.heat(counties.map,state.map,z=”x2005″,breaks=c(0,4,8,12,16,20,24,28,32),col.vec=brewer.pal(8,”Blues”),plot.legend=TRUE,reverse=TRUE,title=”Octile n (8th is the highest)”)
title(main=”China: GDP per Capita by Administrative Region, 2005″)
# plot.heat(counties.map,state.map,z=”x2010″,breaks=c(0,4,8,12,16,20,24,28,32),col.vec=brewer.pal(8,”Blues”),plot.legend=TRUE,reverse=TRUE,title=”Octile n (8th is the highest)”)
# title(main=”China: GDP per Capita by Administrative Region, 2010″)
## plot text
with(counties.map@data[c(-2,-10,-25,-28,-32),], text(x,y,NAME_1,cex=.7,font=2))
with(counties.map@data[2,],text(x,y+1.1,NAME_1,cex=.7,font=2))
with(counties.map@data[28,],text(x+1.7,y+.7,NAME_1,cex=.7,font=2))
with(counties.map@data[10,],text(x-.5,y-1,NAME_1,cex=.7,font=2))
with(counties.map@data[25,],text(x+3.3,y,NAME_1,cex=.7,font=2))
with(counties.map@data[32,],text(x-.7,y,NAME_1,cex=.7,font=2))
[/code]
When I first created the graphics, the only color option I had was red because the function heat.colors() only returned red colors. To fix that, I loaded the package RColorBrewer which allowed me to choose contiguous colors of any “typical” shade, like red, blue, green, etc. When I tried to run the code in the first Stack Overflow link, the script halted at the function plot.heat(), and it took me a while to figure out that it wouldn’t run because I didn’t have the function defined. Luckily, the second Stack Overflow link had the function defined (but at the same time, it didn’t have any code for labels). I thus combined the two to get the features I wanted.
I then changed the source files (the shapefiles along with the manipulated data from Excel), and assigned vectors for each year of data to separate variables. Some of the other changes I made were within the plot.heat() function itself (lines 23-49), mostly dealing with aesthetic features like the legend format. After doing this, I ran the code a couple times more and noticed that the labels were all bunched up together, so I decided to manually change the label coordinates (lines 78-83) for regions that were located closely together.
You’ll notice that I commented out a lot of plots (lines 52-74), which keeps the code from executing – in this case I created the plots one at a time by commenting/uncommenting these blocks of code. I should make some further improvements later to get all the plots created as a batch.
Result
You can see from the above that in recent decades, much of the wealth has shifted eastward. Most strikingly, Tibet and Qinghai were once the richest regions of China, and have since then become some of the poorest (see the earlier image from Excel for the actual rankings. Since GDP per Capita is just a single variable, I don’t want to make any hasty conclusions about income inequality amongst the provinces yet, as I’ll have to take a deeper look into the data for further analysis. What’s cool about this procedure is that it can be used with several types of data by just changing the vectors in the source code. I combined all of the images and created an animated .gif, which allows you to see the how the distribution of wealth has changed over time:
That’s it for today. I hope you enjoyed it!