Monthly Archives: March 2013

You are browsing the site archives by month.

No. 83: Basic Simulation Using R

13 March, 2013 2:00 AM / Leave a Comment / Gene Dan

Today I’d like to demonstrate a few examples of simulation by using R’s built-in pseudorandom number generator. We’ll start by calling the function runif(n), which returns a vector of n draws from the uniform distribution on the interval [0,1]. To see what I mean, runif(50) will return 50 random numbers between 0 and 1 (inclusive):

[code language=”r” wraplines=”FALSE”]> runif(50)
[1] 0.79380213 0.02640186 0.48848994 0.50689348 0.27242565 0.37866590 0.50134423 0.04855088 0.35709235 0.06587394 0.04107046 0.52542577 0.31302174
[14] 0.65262709 0.60967237 0.45131387 0.55305078 0.83903314 0.72698109 0.06292518 0.47579002 0.15186000 0.71345801 0.71252703 0.22304757 0.20179550
[27] 0.57375115 0.06144426 0.87460214 0.87085905 0.52197596 0.79827053 0.35533929 0.23212775 0.30441290 0.29824819 0.59430450 0.92366848 0.63523013
[40] 0.59757710 0.67266388 0.06165364 0.12924342 0.10372910 0.49521401 0.31687057 0.08331765 0.51155404 0.35502189 0.65212223
[/code]

Interestingly, the numbers generated above aren’t actually random. R uses a process called pseudorandom number generation, which uses an algorithm to generate a long string of deterministic digits that appear to be random to most people, unless they have godlike powers of pattern recognition. The algorithm acts upon an initial value, called a seed, and for each seed the algorithm will return the same sequence of numbers. The term period refers to how long the sequence can go before it repeats itself. For example, Microsoft Excel’s PRNG (pseudorandom number generator) has a relatively short period, as (depending on the application) the sequence of numbers will repeat itself unless you frequently re-seed the algorithm. That is, if you generate a sequence 938745…, you’ll see 938745… again without too many draws.

The default PRNG used by R is called the Mersenne Twister, an algorithm developed in 1998 by Matsumoto and Nishimura. Other choices are available, such as Wichman-Hill, Marsaglia-Multicarry, Super-Duper, Knuth-TAOCP, and L’Ecuyer-CMRG. You can even supply your own PRNG, if you wish.

We can plot a histogram of a vector of generated numbers in order to observe the distribution of our sample. Below, you’ll see a 4-plot panel depicting samples from a uniform distribution on [0,1], with different draws per sample:

[code language=”r”]
#Uniform Sampling
par(mfrow=c(2,2))
for(i in 1:4){
x <- runif(10**i)
hist(x,prob=TRUE, col=”grey”,ylim=c(0,2),main = paste(10**i,” Draws”))
curve(dunif(x),add=TRUE,col=”red”,lwd=2)}
[/code]

As you can see, the sample approaches the uniform distribution as the number of draws becomes larger.

Similarly, we can simulate observations from the normal distribution by calling the function rnorm(n,mean,sd), which returns a vector of n draws from the normal distribution with mean=mean and standard deviation = sd:

[code language=”r”]
#Normal Sampling
par(mfrow=c(2,2))
for(i in 1:4){
x <- rnorm(10**i)
hist(x,prob=TRUE,col=”grey”,ylim=c(0,.6),xlim=c(-4,4),main=paste(10**i,” Draws”))
curve(dnorm(x),add=TRUE,col=”red”,lwd=2)}
[/code]

Likewise, as the number of draws gets bigger, the sample approaches the normal distribution.

We can use R to demonstrate the binomial approximation to the normal distribution. The binomial distribution with parameters n and p is approximately normal with mean np and variance np(1-p), with n large and p not too small. We’ll draw from the binomial distribution with n = 50 and p = .5, and then plot a normal curve with mean = 25 and variance = 12.5. Notice that as we increase the number of draws, the histrogram looks more and more like the normal distribution:

[code language=”r”]
#Binomial approximation to Normal
par(mfrow=c(2,2))
n <- 50
p <- .5
for(i in 1:4){
x <- rbinom(10**i,n,p)
hist(x,prob=TRUE,col=”grey”,ylim=c(0,.2),xlim=c(10,40),main=paste(10**i,” Draws”))
curve(dnorm(x,n*p,sqrt(n*p*(1-p))),add=TRUE,col=”red”,lwd=2)}
[/code]

For fun, I decided to see how many simulated values my computer could handle. I created a vector of 1 billion draws from the standard normal distribution:

[code language=”r”]
x<-rnorm(1000000000)
hist(x,prob=TRUE,col=”grey”,main=”1000000000 Draws”)
curve(dnorm(x),add=TRUE,col=”red”,lwd=2)
[/code]

Which took about 20 minutes to execute, using almost all of my computer’s memory (16 GB). This was unnecessary, as I could have reproduced the image above without as many draws. Nevertheless, I’m very impressed with R’s capabilities, as a similar script would have been impossible in Excel if I wanted to store the numbers in memory, or it would have taken much longer if I had even decided to clear the memory throughout the routine.

Posted in: Logs, Mathematics / Tagged: binomial approximation to normal, prng, pseudo random number generator, pseudorandom number generation, R, random number generator, simulation

No. 82: Plotting Normal Distributions with R

12 March, 2013 3:03 AM / 1 Comment / Gene Dan

Hey everyone,

I’ve got some good news – I passed CA2 a few weeks ago and I’m glad I was able to knock out that requirement shortly after I passed CA1. The bad news is that I’ve only written three posts this year when I should have had ten, so I’ve got some catching up to do. Over the past couple of months, I’ve mostly been studying material related to the insurance industry, but I try to squeeze in some math or programming whenever I have time. Lately, I’ve been learning how to work with the SQL Server Management Studio interface to aggregate large datasets at work. For statistics, I’ve continued my studies with Verzani’s Using R for Introductory Statistics, which I started reading last year, but put off until this year due to exams. Today, I’d like to show you some of R’s plotting capabilities – we’ll start off with a plot of the standard normal distribution, and I’ll demonstrate how you can change the shape of the plotted distribution by adjusting its parameters.

If you’ve taken statistics, you’re most likely familiar with the normal distribution:

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}}\mathrm{e}^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

One of the nice things about this distribution is that its two parameters are the mean and variance, which are common statistics used in everyday language. The mean is the average, a measure of central tendency that describes the center of the distribution, and the variance is a statistic that describes the spread of the distribution – how widely the data points deviate from the mean. The following code generates a plot of the density function of a standard normal random variable, and then adds two curves that depict the same distribution shifted to the left:

[code language=”r”]
#Standard normal, then shifted to the left
x <- seq(-6,6,length=500)
plot(x,dnorm(x,mean=0,sd=1),type = “l”,lty=1,lwd=3,col=”blue”,main=”Normal Distribution”,ylim=c(0,0.5),xlim=c(-6,6),ylab=”Density”)
curve(dnorm(x,-1,1),add=TRUE,lty=2,col=”blue”)
curve(dnorm(x,-2,1),add=TRUE,lty=3,col=”blue”)
legend(2,.5,legend=c(“N ~ (0, 1)”,”N ~ (-1, 1)”,”N ~ (-2, 1)”),lty=1:3,col=”blue”)
[/code]

The code first generates a vector of length 500. This vector is then used as an argument to the dnorm() function, which returns the normal density of each element of the input vector. Notice that in line 2, dnorm(x,mean=0,sd=1) is a function with 3 arguments – the first specifies the input vector, the second specifies that the mean of the distribution equals 0, and the third argument specifies that the standard deviation of the distribution equals 1. The function returns a vector of densities which are in turn used as an input to the plot() function, which generates the solid blue line in the above figure. The next two lines of the script add the same distribution shifted 1 and 2 units to the left. You can see that in these two lines, the 2nd argument of the dnorm() function is -1 and -2, respectively – this means that I changed the mean of the distribution to -1 and -2, from 0, causing the leftward shift that you see above.

Similarly, I can shift the distribution to the right by increasing the mean:

[code language=”r”]
#Standard normal, then shifted to the right
x <- seq(-6,6,length=500)
plot(x,dnorm(x,mean=0,sd=1),type = “l”,lty=1,lwd=3,col=”purple”,main=”Normal Distribution”,ylim=c(0,0.5),xlim=c(-6,6),ylab=”Density”)
curve(dnorm(x,1,1),add=TRUE,lty=2,col=”purple”)
curve(dnorm(x,2,1),add=TRUE,lty=3,col=”purple”)
legend(-5.5,.5,legend=c(“N ~ (0, 1)”,”N ~ (1, 1)”,”N ~ (2, 1)”),lty=1:3,col=”purple”)
[/code]

Notice that I can change the position of the legend by specifying the x and y coordinates in the first two arguments of the legend() function.

The next script keeps the mean at 0, but adds two curves with the standard deviation increased to 1 and 2:

[code language=”r”]
#Standard normal, then increased variance
x <- seq(-6,6,length=500)
plot(x,dnorm(x,mean=0,sd=1),type = “l”,lty=1,lwd=3,col=”black”,main=”Normal Distribution”,ylim=c(0,0.5),xlim=c(-6,6),ylab=”Density”)
curve(dnorm(x,0,1.5),add=TRUE,lty=2,col=”red”)
curve(dnorm(x,0,2),add=TRUE,lty=3,col=”black”)
legend(-5.5,.5,legend=c(“N ~ (0, 1)”,”N ~ (0, 2.25)”,”N ~ (0, 4)”),lty=1:3,col=c(“black”,”red”,”black”))
[/code]

Here, I made the middle curve red by using the “col” argument in the plot() function. Personally, plotting is one of my favorite things to do with R. I feel that visualizing data helps you gain an intuitive grasp on the subject, and reveals patterns that you might not otherwise see with aggregated tables or simple summary statistics. Later on this week (hopefully tomorrow), I’ll demonstrate some simple simulations with the normal distribution.

Posted in: Logs, Mathematics / Tagged: cran, normal plot r, R

Monthly Archives: March 2013

No. 83: Basic Simulation Using R

No. 82: Plotting Normal Distributions with R

Archives

Categories

Links

Texas Cycling