Category Archives: Mathematics

No. 81: A Brief Introduction to Sweave

22 January, 2013 3:08 AM / Leave a Comment / Gene Dan

Hey everyone,

I’ve been using RStudio more regularly at work, and last week I discovered a useful feature called Sweave that allows me to embed R code within a LaTeX document. As the PDF is being compiled, the R code is executed and the results are inserted into the document, creating publication-quality reports. To see what I mean, take a look at the following code:

[code language=”R”]documentclass{article}
usepackage{parskip}
begin{document}
SweaveOpts{concordance=TRUE}

Hello,\\
Let me demonstrate some of the capabilities of Sweave. Here are the first 20 rows of a data frame depicting temperatures in New York City. I can first choose to output the code without evaluating it:

<<eval=false>>=
library(‘UsingR’)
five.yr.temperature[1:20,]
@

and then evaluate the preceding lines with the output following this sentence:

<<echo=false>>=
library(‘UsingR’)
five.yr.temperature[1:20,]
@
end{document}

[/code]

After compilation, the resulting PDF looks like this:

View PDF

Within a Sweave document, the embedded R code is nested within sections called “code chunks”, the beginning of which are indicated with the characters $latex <<>>=$ , and the end of which are indicated with the character $latex @$. The above example contains two code chunks, one to print the R input onto the document without evaluating it, and the second to print the R output without printing the R input. This is achieved by using the options “eval=false” and “echo=true”. The option eval specifies whether or not the R code should be evaluated, and the option echo specifies whether the R input should be displayed onto the PDF.

Sweave also has the capability to print graphics onto your PDF. The following example applies three different smoothing techniques to a dataset containing temperatures in New York City, and then plots the results in a scatter plot:

[code language=”R”]

documentclass{article}
usepackage{parskip}
begin{document}
SweaveOpts{concordance=TRUE}

Here’s a chart depicting three different smoothing techniques on a dataset. Below, you’ll see some R input, along with the resulting diagram:
<<fig=true>>=
library(‘UsingR’)
attach(five.yr.temperature)
scatter.smooth(temps~days,col=”light blue”,bty=”n”)
lines(smooth.spline(temps~days),lty=2,lwd=2)
lines(supsmu(days, temps),lty=3,lwd=2)
legend(x=110,y=40,lty=c(1,2,3),lwd=c(1,2,2),
legend=c(“scatter.smooth”,”smooth.spline”,”supsmu”))
detach(five.yr.temperature)
@

end{document}

[/code]

View PDF

Pretty neat, right? I’d have to say that I’m extremely impressed with RStudio’s team, and their platform has made both R and LaTeX much more enjoyable for me to use. From the above examples, we can conclude that there are at least two benefits from using Sweave:

There’s no need to save images, or copy and paste output into a separate file. Novice users of R would likely generate the R output in a separate instance of R, copy both the R input and output into a textfile, and then copy those pieces into a final report. This process is both time consuming and error prone.
The R code is evaluated when the LaTeX document is compiled, and this means that both the R input and R output within the file report correspond to each other. This greatly reduces the frequency of errors, and increases the consistency of the code you see in the final report.

Because of this, I’ve found Sweave to be extremely useful on the job, especially in the documentation of code.

Additional Resources
The code examples that you see above use data provided from a book that I’m currently working through, Using R for Introductory Statistics. The book comes with its own package called ‘UsingR’ which contains several data sets that are used in its exercises. Sweave has an official instruction manual, which can be found on it’s official home page, here. I found the manual to be quite technical, and I believe it might also be difficult for people who are not thoroughly familiar with the workings of LaTeX. I believe the key to learning Sweave is to simply learn the noweb syntax and to experiment with adjusting the code-chunk options yourself.

noweb
An article on Sweave from RNews
A tutorial by Nicola Sartori
The Joy of Sweave by Mario Pineda-Krch
More links from UMN
An article from Revolution Analytics

Posted in: Logs, Mathematics / Tagged: LaTeX, R, R LaTeX integration, RStudio, Statistics, Sweave

No. 78: Project Euler #1

30 November, 2012 3:45 AM / 1 Comment / Gene Dan

Hey everyone,

I took a few weeks off posting in order to study for Exam MFE, which I took 4 weeks ago. I’m still waiting for the result, so in the meantime I’ve decided to knock out the CAS modules and finish up my SQL course before I have to start studying hard again. Anyway, I’ve finally gotten around to attempting the Project Euler problems that my friends, along with other actuaries, recommended to me as a good way to improve my programming skills.

Basically, Project Euler is a website that hosts mathematical problems that anyone can attempt to solve with whatever method they wish. Most of the problems are simply stated and can be understood by most people who have taken a basic course in number theory. However, they are often very computationally intensive if attempted directly so most people try to solve them by writing computer programs. Oftentimes students will use the site in an iterative fashion to practice their programming skills when picking up a new language. For example, one of my colleagues who has solved 80 problems first used the site to learn VBA, and then did the same problems again in R, and then again in C++.

For my first attempt, I decided to use VBA because I already know most of the basic syntax. Before starting, I decided that I would not seek any outside help or lookup any solutions when trying to solve these problems, because I want to work these out on my own. So please don’t give me any tips! (although I do appreciate any advice for any of my other projects).

The first problem is stated as follows:

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.
Find the sum of all the multiples of 3 or 5 below 1000.

Solution:

[code language=”vb”]
Sub euler1()

Dim x As Long ‘natural number to be tested
Dim y As Long ‘current sum of the multiples of 5 or 3

y = 0
For x = 1 To 999
If x Mod 3 = 0 Or x Mod 5 = 0 Then
y = y + x
End If
Next x

Debug.Print y

End Sub
[/code]

The final output is 233168. As you can see, the problem is quite easy to understand (and this one was very easy to solve as well) for anyone who passed grade school math. You can probably solve it by sitting at your desk and adding every multiple of 3 or 5 below 1000 by hand, but that would be extremely boring and you probably have better things to do. That’s why you write a program to solve the problem. The solution I wrote above is a simple brute force method (I tested every natural number below 1000 to see if it was evenly divided by 3 or 5) that took less than a second for my computer to solve. The key to understanding the solution is the mod operator, which returns the remainder in a division operation (for example 5 mod 3 equals 2). The statement “If x Mod 3 = 0 Or x Mod 5 = 0” tells the computer to execute the subsequent line, or to add x to the cumulative sum if either x mod 3 or x mod 5 equals 0, in other words, if the remainder after dividing x by 3 or 5 is 0.

So far I’ve solved 17 of the problems, mostly through brute force although I was able to solve one of them by hand. However, as the problems get progressively harder and more computationally intensive, brute force methods become less feasible and more creative thought is required. For example, problem 12 took my computer at least an hour to solve (I’m not sure how long it actually took because I left it on after I went to sleep). This means my solution wasn’t efficient, and that I ought to give it another shot (reduce the amount of iterative code, perhaps).

Posted in: Logs, Mathematics / Tagged: project euler, project euler #1, project euler vba, VBA

No. 76: Exploring Financial Data with Quantmod

6 October, 2012 1:36 AM / 1 Comment / Gene Dan

Hey everyone,

Things have been extremely busy lately – I feel like the quality of my posts will suffer but I’ll try to put anything I find interesting here. I’m currently studying for my fifth actuarial exam – 3F/MFE – which covers models for financial economics. The exam covers the latter half of McDonald’s Derivatives Markets, which is a graduate textbook covering the basics of financial derivatives – such as forwards and options, and introduces the reader to elementary stochastic processes like random walks and Brownian Motion used to theoretically price these derivatives.

Anyway, some snooping around on the internet for open-source financial software led me to some quantitative finance packages in R – namely Quantmod, which I’ll be discussing here. Unfortunately it seems that (correct me if I’m wrong) the Quantmod project has been abandoned for some time, but its ease of use was my reason to showcase it here. It looks like other packages such as Quantlib have more active involvement – but I currently lack the technical skills/financial knowledge to work with that, but I may revisit it later.

The example above may be a familiar image to some of you – it is a chart depicting the CBOE VIX (volatility index), which measures the implied volatility of S&P 500 index options – in other words – it is an estimate of the short-term volatility of the stock market. Notice that 2008 saw high volatility – this makes sense because the stock market plunged that year. This actually isn’t the highest value seen in the index – to my knowledge that occured on Black Monday in 1987 when the DJIA lost a quarter of its value in less than a day. The above chart was generated using Quantmod with the following code:

[sourcecode language=”r”]
library("quantmod")
getSymbols("^VIX",src="yahoo")
barChart(VIX)
[/sourcecode]

Well, that was simple. Quantmod has a function called getSymbols() that extracts the desired data. The two arguments above (there are more arguments than that – I chose to use the defaults for the rest), specify the ticker symbol to extract (^VIX) and the source of the financial data (yahoo.com). The next function, barchart, plots the data.

<span style="color:#0000ff;">&gt;str(VIX)</span>
An ‘xts’ object from 2007-01-03 to 2012-10-05 containing:
  Data: num [1:1453, 1:6] 12.2 12.4 11.8 12.5 11.9 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:6] "VIX.Open" "VIX.High" "VIX.Low" "VIX.Close" ...
  Indexed by objects of class: [Date] TZ: 
  xts Attributes:  
List of 4
 $ tclass : chr [1:2] "POSIXct" "POSIXt"
 $ tzone  : chr ""
 $ src    : chr "yahoo"
 $ updated: POSIXct[1:1], format: "2012-10-05 20:02:43"

An ‘xts’ object from 2007-01-03 to 2012-10-05 containing:

Data: num [1:1453, 1:6] 12.2 12.4 11.8 12.5 11.9 ...

- attr(*, "dimnames")=List of 2

..$ : NULL

..$ : chr [1:6] "VIX.Open" "VIX.High" "VIX.Low" "VIX.Close" ...

Indexed by objects of class: [Date] TZ:

xts Attributes:

List of 4

$ tclass : chr [1:2] "POSIXct" "POSIXt"

$ tzone : chr ""

$ src : chr "yahoo"

$ updated: POSIXct[1:1], format: "2012-10-05 20:02:43"

The output above tells us that the variable VIX is a time-series object (xts) from January 1st, 2003 to today. The following is the output of the first 10 rows of daily data:

&gt;VIX[1:10]
<span style="color:#000000;">           VIX.Open VIX.High VIX.Low VIX.Close VIX.Volume VIX.Adjusted
2007-01-03    12.16    12.75   11.53     12.04          0        12.04
2007-01-04    12.40    12.42   11.28     11.51          0        11.51
2007-01-05    11.84    12.25   11.68     12.14          0        12.14
2007-01-08    12.48    12.83   11.78     12.00          0        12.00
2007-01-09    11.86    12.47   11.69     11.91          0        11.91
2007-01-10    12.34    12.50   11.43     11.47          0        11.47
2007-01-11    11.42    11.48   10.50     10.87          0        10.87
2007-01-12    10.93    10.93   10.14     10.15          0        10.15
2007-01-16    10.64    10.89   10.40     10.74          0        10.74
2007-01-17    10.90    10.90   10.35     10.59          0        10.59
</span>

>VIX[1:10]

<span style="color:#000000;"> VIX.Open VIX.High VIX.Low VIX.Close VIX.Volume VIX.Adjusted

2007-01-03 12.16 12.75 11.53 12.04 0 12.04

2007-01-04 12.40 12.42 11.28 11.51 0 11.51

2007-01-05 11.84 12.25 11.68 12.14 0 12.14

2007-01-08 12.48 12.83 11.78 12.00 0 12.00

2007-01-09 11.86 12.47 11.69 11.91 0 11.91

2007-01-10 12.34 12.50 11.43 11.47 0 11.47

2007-01-11 11.42 11.48 10.50 10.87 0 10.87

2007-01-12 10.93 10.93 10.14 10.15 0 10.15

2007-01-16 10.64 10.89 10.40 10.74 0 10.74

2007-01-17 10.90 10.90 10.35 10.59 0 10.59

</span>

Here’s an example of a candle chart covering the last 90 days of the index:

A candle chart allows you to visualize the daily spread in addition to the trend over a longer period of time. In the above chart, the colored rectangles are bounded by the open/close values of the day, and the endpoints of the vertical line segments represent the hi/low values of the day.

The function used to generate this chart is:

[sourcecode language=”r”]
candleChart(VIX,multi.col=TRUE,theme="white",subset="last 90 days")
[/sourcecode]

The argument “subset” specifies the slice of time to be plotted in the chart. It’s nice because it accepts natural language – all I had to do was type in “last 90 days” to specify the time interval. Anyway that’s it for now – In the meantime I’ll be spending up to 4-5 hours a day preparing for my test at the beginning of next month.

Posted in: Logs, Mathematics

No. 74: Population Decline in Japan – Visualizing Data by Prefecture Using CRAN

3 September, 2012 11:37 PM / 1 Comment / Gene Dan

Oh geez…I hope I got the labels right

Hey everyone,

Last week, I wrote about creating map graphics with R, using Chinese GDP per capita as an example. This week I’m writing about same thing, except this time the data is on the population of Japan. I made a few minor changes to the way I created the animated .gif files. Instead of using the first websites I could find, I instead installed GIMP, which is a free graphics editor similar to Photoshop. It gave me a lot more control over the size of the images, along with the frame rate. So, if you click on any of the images in this post, you’ll notice that the image quality is a little higher.

Population Decline in Japan

The aging of Japan has been an important issue for several decades now, with population set to decrease at an alarming rate over the next few decades. Census data show that the nationwide population peaked some time within the last 5 years, and has already begun to decline, and is projected to decline by up to 50% by the end of the century, should current trends continue. Now why shouldn’t that be a good thing? Isn’t Japan (along with the rest of the world) overpopulated as is? I think over the long term, should the population stabilize at a lower level, we would see more sustainability, and less depletion of natural resources. However, the transition to such a state isn’t pretty. If you look more closely at the census data, you’ll see that a dramatic shift in the age distribution has occurred over the last few decades.

In the image below, you’ll see some Excel files that I extracted from the Japanese Statistics Bureau depicting the median age by prefecture:

As you can see, most of the spreadsheet is in Japanese, but the prefectures were ordered in ISO_3166-2 territory codes, which enabled me to figure out which prefecture was which. You can view the complete dataset here if you have Google Docs. From this table you’ll find that the Median age has more than doubled in several prefectures. I extracted the figures into a .csv file and imported it into RStudio to generate the images used to create the .gif in the next image. The image below is an animated figure depicting the shift in Median age since 1920 (click for full resolution):

Median Age by Prefecture, 1920 – 2005
(Click on Image for Full Resolution)

The above image was generated using the code below (click ‘show source’ to show source):

[sourcecode language=”r” toolbar=”TRUE” wraplines=”FALSE” collapse=”TRUE”]
setwd("./Japan/")
library(maptools)
library(RColorBrewer)
##substitute your shapefiles here
state.map <- readShapeSpatial("JPN_adm0.shp")
counties.map <- readShapeSpatial("JPN_adm1.shp")
## these are the variables we will be plotting
jpnmedage <- read.csv("JpnMedAge1920-2005.csv",header=TRUE)
str(jpnmedage)
counties.map@data$x1920 <- jpnmedage$X1920
counties.map@data$x1930 <- jpnmedage$X1930
counties.map@data$x1940 <- jpnmedage$X1940
counties.map@data$x1950 <- jpnmedage$X1950
counties.map@data$x1960 <- jpnmedage$X1960
counties.map@data$x1970 <- jpnmedage$X1970
counties.map@data$x1980 <- jpnmedage$X1980
counties.map@data$x1990 <- jpnmedage$X1990
counties.map@data$x2000 <- jpnmedage$X2000
counties.map@data$x2005 <-jpnmedage$X2005
## put the lab point x y locations of the zip codes in the data frame for easy retrieval
labelpos <- data.frame(do.call(rbind, lapply(counties.map@polygons, function(x) x@labpt)))
names(labelpos) <- c("x","y")
counties.map@data <- data.frame(counties.map@data, labelpos)

plot.heat <- function(counties.map,state.map,z,title=NULL,breaks=NULL,reverse=FALSE,cex.legend=1,bw=.2,col.vec=NULL,plot.legend=TRUE) {
##Break down the value variable
if (is.null(breaks)) {
breaks=
seq(
floor(min(counties.map@data[,z],na.rm=TRUE)*10)/10
,
ceiling(max(counties.map@data[,z],na.rm=TRUE)*10)/10
,.1)
}
counties.map@data$zCat <- cut(counties.map@data[,z],breaks,include.lowest=TRUE)
cutpoints <- levels(counties.map@data$zCat)
if (is.null(col.vec)) col.vec <- heat.colors(length(levels(counties.map@data$zCat)))
if (reverse) {
cutpointsColors <- rev(col.vec)
} else {
cutpointsColors <- col.vec
}
levels(counties.map@data$zCat) <- cutpointsColors
plot(counties.map,border=gray(.8), lwd=bw,axes = FALSE, las = 1,col=as.character(counties.map@data$zCat))
if (!is.null(state.map)) {
plot(state.map,add=TRUE,lwd=1)
}
##Edit the legend information here
if (plot.legend) legend("bottomleft",inset=c(0.03,0),c("20-25","25-30","30-35","35-40","40-45","45+"), fill = cutpointsColors,bty="n",title=title,cex=cex.legend)
##title("Cartogram")
}

plot.heat(counties.map,state.map,z="x1920",breaks=c(20,25,30,35,40,45,Inf),col.vec=brewer.pal(6,"Purples"),plot.legend=TRUE,reverse=FALSE,title="Median Age")
title(main="Japan: Median Age by Prefecture, 1920")

plot.heat(counties.map,state.map,z="x1930",breaks=c(20,25,30,35,40,45,Inf),col.vec=brewer.pal(6,"Purples"),plot.legend=TRUE,reverse=FALSE,title="Median Age")
title(main="Japan: Median Age by Prefecture, 1930")

plot.heat(counties.map,state.map,z="x1940",breaks=c(20,25,30,35,40,45,Inf),col.vec=brewer.pal(6,"Purples"),plot.legend=TRUE,reverse=FALSE,title="Median Age")
title(main="Japan: Median Age by Prefecture, 1940")

plot.heat(counties.map,state.map,z="x1950",breaks=c(20,25,30,35,40,45,Inf),col.vec=brewer.pal(6,"Purples"),plot.legend=TRUE,reverse=FALSE,title="Median Age")
title(main="Japan: Median Age by Prefecture, 1950")

plot.heat(counties.map,state.map,z="x1960",breaks=c(20,25,30,35,40,45,Inf),col.vec=brewer.pal(6,"Purples"),plot.legend=TRUE,reverse=FALSE,title="Median Age")
title(main="Japan: Median Age by Prefecture, 1960")

plot.heat(counties.map,state.map,z="x1970",breaks=c(20,25,30,35,40,45,Inf),col.vec=brewer.pal(6,"Purples"),plot.legend=TRUE,reverse=FALSE,title="Median Age")
title(main="Japan: Median Age by Prefecture, 1970")

plot.heat(counties.map,state.map,z="x1980",breaks=c(20,25,30,35,40,45,Inf),col.vec=brewer.pal(6,"Purples"),plot.legend=TRUE,reverse=FALSE,title="Median Age")
title(main="Japan: Median Age by Prefecture, 1980")

plot.heat(counties.map,state.map,z="x1990",breaks=c(20,25,30,35,40,45,Inf),col.vec=brewer.pal(6,"Purples"),plot.legend=TRUE,reverse=FALSE,title="Median Age")
title(main="Japan: Median Age by Prefecture, 1990")

plot.heat(counties.map,state.map,z="x2000",breaks=c(20,25,30,35,40,45,Inf),col.vec=brewer.pal(6,"Purples"),plot.legend=TRUE,reverse=FALSE,title="Median Age")
title(main="Japan: Median Age by Prefecture, 2000")

plot.heat(counties.map,state.map,z="x2005",breaks=c(20,25,30,35,40,45,Inf),col.vec=brewer.pal(6,"Purples"),plot.legend=TRUE,reverse=FALSE,title="Median Age")
title(main="Japan: Median Age by Prefecture, 2005")

## plot text
# with(counties.map@data[c(-2,-10,-25,-28,-32),], text(x,y,NAME_1,cex=.7,font=2))
# with(counties.map@data[2,],text(x,y+1.1,NAME_1,cex=.7,font=2))
# with(counties.map@data[28,],text(x+1.7,y+.7,NAME_1,cex=.7,font=2))
# with(counties.map@data[10,],text(x-.5,y-1,NAME_1,cex=.7,font=2))
# with(counties.map@data[25,],text(x+3.3,y,NAME_1,cex=.7,font=2))
# with(counties.map@data[32,],text(x-.7,y,NAME_1,cex=.7,font=2))
[/sourcecode]

To get a better idea of what that data meant, I should have included a graphic of the ratio of working to retired population (and maybe birth rates as well), but I didn’t have the time. Anyway, when the population of Japan declines, it won’t decline uniformly across the age groups. As the population declines, the ratio of old to young people will increase dramatically, putting a strain on younger workers to care for the old. This will result in a labor shortage, and possibly a pensions crisis as the income of the younger workers won’t be enough to fund the pensions of the retired population. Japan also has a very high proportion of debt to GDP – about 200% (twice the ratio of the United States’) – and I suspect the dual strains of having to care for the older workers and make interest payments on the debt will leave the next generation of Japanese without enough funds to provide for their own children, leading to a further decline in birthrate, and even fewer people to care for this generation when they retire – it’s a vicious cycle.

For my next example, I calculated 5-year population growth rates by prefecture using the census data below, also obtained from the Japanese Statistics Bureau:

You can view the full dataset here with Google Docs. I used R code similar to the code shown previously to generate the animated image below (click on the picture for full resolution), which shows the change in growth rate by prefecture over time:

Projected Population Change, 1925-2025
(Click on Image for Full Resolution)

The graph below shows the national annual growth rate from 1873 to 2009:

In the above graph and the animated picture above, you’ll see the dramatic shift in growth rate during the WW2 years.

Final Notes

It’s clear to me that Japan will have to increase immigration or implement policies to stabilize the population. The nation will also have to implement policy reform to get its debt under control so as to not cripple the economy in the next few decades.

In this post I’ve also shown some of the other capabilities of the package RColorBrewer – you can see in the very top picture, along with the last map, that I used divergent color palettes instead of a continuous one. That was especially helpful in depicting negative growth rates.

Posted in: Mathematics / Tagged: japan median age by prefecture, map graphics with R, populatin of japan by prefecture, population decline in Japan, population of japan

No. 73: Geographical Mapping Graphics using CRAN – China GDP per Capita by Province

30 August, 2012 1:24 AM / 6 Comments / Gene Dan

Hey everyone,

A couple days back I learned a new technique with CRAN for visualizing geographic data via mapping. This is something I’ve always wanted to do, and I’m glad I’ve received a project where this skill would be of some use (what you see here is not the actual project, fyi). Go ahead and click on the picture above. This is something that I created using a combination of Excel, CRAN, data from China NBS, and shapefiles from GADM and the U.S. Census Bureau. Basically, I gathered statistical data from China’s National Bureau of Statistics, manipulated it in Excel, linked it to information contained in the cartographic boundary files from GADM, and then composed the images you see here using a programming language called CRAN. What you see above is a heatmap of 2011 GDP per Capita by province, with the shades of color linked to GDP/Capita. The darker shades of blue represent regions with higher nominal average wealth and the lighter shades represent regions with a lower nominal average wealth. You can see that in this country, wealth is concentrated on the east coast in large metropolitan areas (Like Beijing, Tianjin, Shanghai, Hong Kong, and Ordos City), as well as in Inner Mongolia (represented as Nei Mongol on the map).

Data Manipulation in Excel

For this project, I extracted historical GDP per capita from China NBS using Excel by directly linking the cells to the website. My main task in this step was to reconcile the listing of the administrative regions from the cartographic boundary files with those listed from China NBS. I found that there were 5 differences in nomenclature, namely that Inner Mongolia, Ningxia, Xinjiang, and Tibet from China NBS were called Nei Mongol, Ningxia Hui, Xinjiang Uygur, and Xizang in the shapefiles, respectively. Furthermore, I couldn’t find the region listed as “Paracel Islands” from China NBS. In this case there were a small enough number of mismatches for me to correct them all manually.

I ran into another problem when I learned that Hainan used to be a part of Guangdong, and that Chongqing used to be part of Sichuan. This gave me headaches when I was trying to figure out how to divide the provinces into octiles for the graphics. I used the formula that you see in the above picture, which for the most part, kept the separated provinces together (and in the same quantile!) for older years. I did however, have to adjust one of the data points manually. I then created modified vectors to export into RStudio via csv, and in this case each vector had a nice, convenient set of 32 points.

Graphic Composition

My next task involved importing the data into RStudio, and then using CRAN code to generate the graphics. I looked around on Stack Overflow and found threads here and here, proposing ideas on how to link a numeric data vector (in my case, GDP per Capita for each province) to regions within an image. Neither of these proposals worked that well for me, so I had to combine the two and then make some modifications to the code for my purposes. Here’s the end result (if collapsed, click “show source” to see the code):

[code language=”r” toolbar=”TRUE” wraplines=”FALSE” collapse=”FALSE”]
setwd(“./China/”)
library(maptools)
library(RColorBrewer)
##substitute your shapefiles here
state.map <- readShapeSpatial(“CHN_adm0.shp”)
counties.map <- readShapeSpatial(“CHN_adm1.shp”)
## these are the variables we will be plotting
chinagdp <- read.csv(“chinagdp.csv”,header=TRUE)
counties.map@data$x1978 <- chinagdp$X1978M
counties.map@data$x1980 <- chinagdp$X1980M
counties.map@data$x1985 <- chinagdp$X1985M
counties.map@data$x1990 <- chinagdp$X1990M
counties.map@data$x1995 <- chinagdp$X1995M
counties.map@data$x2000 <- chinagdp$X2000
counties.map@data$x2005 <- chinagdp$X2005
counties.map@data$x2010 <- chinagdp$X2010

## put the lab point x y locations of the zip codes in the data frame for easy retrieval
labelpos <- data.frame(do.call(rbind, lapply(counties.map@polygons, function(x) x@labpt)))
names(labelpos) <- c(“x”,”y”)
counties.map@data <- data.frame(counties.map@data, labelpos)

plot.heat <- function(counties.map,state.map,z,title=NULL,breaks=NULL,reverse=FALSE,cex.legend=1,bw=.2,col.vec=NULL,plot.legend=TRUE) {
##Break down the value variable
if (is.null(breaks)) {
breaks=
seq(
floor(min(counties.map@data[,z],na.rm=TRUE)*10)/10
,
ceiling(max(counties.map@data[,z],na.rm=TRUE)*10)/10
,.1)
}
counties.map@data$zCat <- cut(counties.map@data[,z],breaks,include.lowest=TRUE)
cutpoints <- levels(counties.map@data$zCat)
if (is.null(col.vec)) col.vec <- heat.colors(length(levels(counties.map@data$zCat)))
if (reverse) {
cutpointsColors <- rev(col.vec)
} else {
cutpointsColors <- col.vec
}
levels(counties.map@data$zCat) <- cutpointsColors
plot(counties.map,border=gray(.8), lwd=bw,axes = FALSE, las = 1,col=as.character(counties.map@data$zCat))
if (!is.null(state.map)) {
plot(state.map,add=TRUE,lwd=1)
}
##Edit the legend information here
if (plot.legend) legend(“bottomleft”,c(“1st”,”2nd”,”3rd”,”4th”,”5th”,”6th”,”7th”,”8th”), fill = rev(cutpointsColors),bty=”n”,title=title,cex=cex.legend)
##title(“Cartogram”)
}

# plot.heat(counties.map,state.map,z=”x1978″,breaks=c(0,4,8,12,16,20,24,28,32),col.vec=brewer.pal(8,”Blues”),plot.legend=TRUE,reverse=TRUE,title=”Octile n (8th is the highest)”)
# title(main=”China: GDP per Capita by Administrative Region, 1978″)

# plot.heat(counties.map,state.map,z=”x1980″,breaks=c(0,4,8,12,16,20,24,28,32),col.vec=brewer.pal(8,”Blues”),plot.legend=TRUE,reverse=TRUE,title=”Octile n (8th is the highest)”)
# title(main=”China: GDP per Capita by Administrative Region, 1980″)

# plot.heat(counties.map,state.map,z=”x1985″,breaks=c(0,4,8,12,16,20,24,28,32),col.vec=brewer.pal(8,”Blues”),plot.legend=TRUE,reverse=TRUE,title=”Octile n (8th is the highest)”)
# title(main=”China: GDP per Capita by Administrative Region, 1985″)

# plot.heat(counties.map,state.map,z=”x1990″,breaks=c(0,4,8,12,16,20,24,28,32),col.vec=brewer.pal(8,”Blues”),plot.legend=TRUE,reverse=TRUE,title=”Octile n (8th is the highest)”)
# title(main=”China: GDP per Capita by Administrative Region, 1990″)

# plot.heat(counties.map,state.map,z=”x1995″,breaks=c(0,4,8,12,16,20,24,28,32),col.vec=brewer.pal(8,”Blues”),plot.legend=TRUE,reverse=TRUE,title=”Octile n (8th is the highest)”)
# title(main=”China: GDP per Capita by Administrative Region, 1995″)

# plot.heat(counties.map,state.map,z=”x2000″,breaks=c(0,4,8,12,16,20,24,28,32),col.vec=brewer.pal(8,”Blues”),plot.legend=TRUE,reverse=TRUE,title=”Octile n (8th is the highest)”)
# title(main=”China: GDP per Capita by Administrative Region, 2000″)

plot.heat(counties.map,state.map,z=”x2005″,breaks=c(0,4,8,12,16,20,24,28,32),col.vec=brewer.pal(8,”Blues”),plot.legend=TRUE,reverse=TRUE,title=”Octile n (8th is the highest)”)
title(main=”China: GDP per Capita by Administrative Region, 2005″)

# plot.heat(counties.map,state.map,z=”x2010″,breaks=c(0,4,8,12,16,20,24,28,32),col.vec=brewer.pal(8,”Blues”),plot.legend=TRUE,reverse=TRUE,title=”Octile n (8th is the highest)”)
# title(main=”China: GDP per Capita by Administrative Region, 2010″)

## plot text
with(counties.map@data[c(-2,-10,-25,-28,-32),], text(x,y,NAME_1,cex=.7,font=2))
with(counties.map@data[2,],text(x,y+1.1,NAME_1,cex=.7,font=2))
with(counties.map@data[28,],text(x+1.7,y+.7,NAME_1,cex=.7,font=2))
with(counties.map@data[10,],text(x-.5,y-1,NAME_1,cex=.7,font=2))
with(counties.map@data[25,],text(x+3.3,y,NAME_1,cex=.7,font=2))
with(counties.map@data[32,],text(x-.7,y,NAME_1,cex=.7,font=2))
[/code]

When I first created the graphics, the only color option I had was red because the function heat.colors() only returned red colors. To fix that, I loaded the package RColorBrewer which allowed me to choose contiguous colors of any “typical” shade, like red, blue, green, etc. When I tried to run the code in the first Stack Overflow link, the script halted at the function plot.heat(), and it took me a while to figure out that it wouldn’t run because I didn’t have the function defined. Luckily, the second Stack Overflow link had the function defined (but at the same time, it didn’t have any code for labels). I thus combined the two to get the features I wanted.

I then changed the source files (the shapefiles along with the manipulated data from Excel), and assigned vectors for each year of data to separate variables. Some of the other changes I made were within the plot.heat() function itself (lines 23-49), mostly dealing with aesthetic features like the legend format. After doing this, I ran the code a couple times more and noticed that the labels were all bunched up together, so I decided to manually change the label coordinates (lines 78-83) for regions that were located closely together.

You’ll notice that I commented out a lot of plots (lines 52-74), which keeps the code from executing – in this case I created the plots one at a time by commenting/uncommenting these blocks of code. I should make some further improvements later to get all the plots created as a batch.

Result

You can see from the above that in recent decades, much of the wealth has shifted eastward. Most strikingly, Tibet and Qinghai were once the richest regions of China, and have since then become some of the poorest (see the earlier image from Excel for the actual rankings. Since GDP per Capita is just a single variable, I don’t want to make any hasty conclusions about income inequality amongst the provinces yet, as I’ll have to take a deeper look into the data for further analysis. What’s cool about this procedure is that it can be used with several types of data by just changing the vectors in the source code. I combined all of the images and created an animated .gif, which allows you to see the how the distribution of wealth has changed over time:

That’s it for today. I hope you enjoyed it!

Posted in: Mathematics / Tagged: china gdp per capita by province, geographic mapping graphics, mapping graphics, mapping graphics from CRAN, mapping graphics with r, plot.heat, plot.heat r, RColorBrewer, stack overflow

« Previous 1 … 6 7 8 9 10 Next »

Category Archives: Mathematics

No. 81: A Brief Introduction to Sweave

No. 78: Project Euler #1

No. 76: Exploring Financial Data with Quantmod

No. 74: Population Decline in Japan – Visualizing Data by Prefecture Using CRAN

No. 73: Geographical Mapping Graphics using CRAN – China GDP per Capita by Province

Post Navigation

Archives

Categories

Links

Texas Cycling