Author Archives: Gene Dan

No. 86: Mobile App!

11 June, 2013 8:40 PM / Leave a Comment / Gene Dan

I noticed that a lot of inbound traffic came from mobile devices, so I decided to take a look at the front page using my phone. Unfortunately, I couldn’t read a single thing since the site wasn’t mobile ready. Therefore, I installed a mobile plugin that lets people view the blog via their portable devices.

I’ve only been using a smartphone for a few months myself, so I am not all that familiar with the technology. Maybe I’ll make some adjustments as I get used to it, but for now this will do. One of the cool things about it is that it displays LaTeX quite well. For example, here’s an unimportant, meaningless matrix:

\[A_{m,n} = \begin{pmatrix} a_{1,1} & a_{1,2} & \cdots & a_{1,n} \\ a_{2,1} & a_{2,2} & \cdots & a_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m,1} & a_{m,2} & \cdots & a_{m,n} \end{pmatrix}\]

Posted in: Logs

No. 85: Things I’ve Been Doing Lately

10 June, 2013 9:40 PM / Leave a Comment / Gene Dan

I haven’t written in here for a long time, so I’d like to list what I’ve been doing these past few months to get the ball rolling:

Databases:

I took an interest in the study of databases late last year, and I’ve gotten a little further by studying the underlying concepts of database design and data modeling. I took the initiative to learn the subject after struggling with the technical aspects of my job. Last year, when I was first working with generalized linear models, I quickly found out that I couldn’t design the queries necessary to get the data I needed, and that I couldn’t communicate effectively with the IT personnel who were responsible for implementing my pricing models (in this case, implementation means to program the database systems that the business uses to store and capture information). Whenever I would ask them something, they would respond via their language of choice, T-SQL, which I had seen glimpses of during my internship days but couldn’t understand. At that time, I only had a vague notion of what a relational database was, and that it was somehow fundamentally different from an Excel spreadsheet, although I didn’t know why; if you ask me today I’d be able to tell you some of the fundamental differences – such as how a database has strictly defined fields and entity relationships, whereas Excel does not, etc. (well you could design an Excel spreadsheet to have those features, but it’s impractical).

This was perhaps the first time I thought to myself that maybe learning about databases would be important. I took a look at some books on SQL and another one that a co-worker checked out of the library, and determined that it would take me many months, if not a few years of investment to acquire proficiency in the subject. I encounter unfamiliar software and bits of code on a daily basis, so I often have to make the decision on whether or not I should spend a significant amount of time studying a software suite or language, or if I should just learn enough of it to complete my next assignment. For example, sometimes (about once every six months or so) I have to run some obsolete legacy software that I know will soon be replaced in the near future, so I’ll decide to learn just enough to get what I need. On the other hand, I see articles almost every single week pertaining to either SQL or relational databases on Hacker News, and several of my friends at coder meetup groups (which I’ll explain in the next section) use databases or write code pertaining to them, so I saw this as another sign that studying the subject would be worthwhile.

If you are not familiar at all with what I’m talking about, you may have seen stories in the news about cybercriminals (ranging from teenage misfits to professional cybersoldiers) infiltrating corporate and government databases, and you may have heard or seen the phrase “SQL Injection” thrown around in the technology section of newspapers. These concepts pertain to databases, which serve as centralized repositories for storing important information for businesses and governments. Prior to working in the corporate sector, news stories such as these were the only exposure I had to databases, and perhaps Americans who are working in non-technical jobs have a similar level of exposure, if any, at all. I think the media sensationalizes cybercrime because it deals with technology and concepts that the average citizen finds esoteric. I think perhaps stories like these, along with recent revelations during the past week should alert you to the importance of not only databases but also computer and technological literacy itself. I have some opinions on the matter but I believe that they are not yet mature enough for full disclosure now (maybe later), other than my belief that our nation is in a very precarious situation pertaining to civil liberties and privacy – and that our technological literacy is crucial to protecting ourselves from oppression. Perhaps this was the reason that influenced me to study databases seriously, or maybe it was really because I thought it would be important for work, I’m not so sure.

Anyway, I started getting some assignments at work done by first learning basic SELECT queries and joins, which allowed me to manipulate the data into the format I needed for my models. Lately, I’ve been looking in to theories on database design, which at least in my impression seems like some kind of applied set theory. I’ve also gotten involved in some open-source project dealing with databases, which I won’t reveal at this moment (maybe later). All I have to say is that I’m pretty excited to be working on something worthwhile.

Computer User Groups

I joined about 10 computer user groups on meetup.com. These user groups are weekly or monthly gatherings of people who want to talk about interests they have in common – such as music, sports, and various other hobbies. In my case, I’ve joined groups that discuss Python, Big Data, Perl, Linux, and Machine Learning that meet on a monthly basis. I’ve only been to the Python one so far, but I’m planning on going to a machine learning gathering in a few weeks. I’ve met some really cool people and learned a lot via these meetups.

Linear Algebra

I think a couple years ago I wrote about reviewing the mathematics that I learned in high school, and you may have noticed that on my “Readings” page, I’ve been stuck on my College Algebra book at 18.9% over the past 2 years. In that time I hadn’t stopped learning mathematics – I passed 3 actuarial exams in that time. However, now that I’m done with the preliminary exams I’m reluctant to go back to it, as when I was reviewing basic algebra, I was so bored going over things I had learned in the past and wasn’t patient enough to sit through it to make steady progress. Furthermore, I felt like I was spending so much time reviewing that I couldn’t devote enough effort to learning the new mathematics that I’m interested in. Therefore, I’ve decided to limit my “reviewing” to about 30 minutes a night, but make it mandatory. The rest of my time will be devoted to databases and linear algebra.

I took an intense course in Linear Algebra over a five-week span and did well, but due to the short time over which I learned the material (and the fact that it’s been 5 years since I took the course), I’ve forgotten most of it. However, I’ve seen matrices appear more and more often in papers and in the work I’ve been doing (a computer uses matrix calculations to perform linear regression), so I decided to brush up on my linear algebra. This is technically a review, but since I spent a such a short time on the course the first time, I consider the material “new” enough to count as furthering my studies of mathematics. I also found the course very enjoyable, so I think this will give me a fresh perspective as I fill the gaps in my early education (after college algebra, I plan to review geometry, trig, and calculus) alongside the study of this subject.

Advances in computing power have only recently allowed us to realize the power of statistical matrix computations on large data sets. Matrices have been around for centuries, and Statistics has been around for centuries, but large organizations were not, at least until a few years ago, able to practically perform statistical calculations on data sets, nor did they have the technological capability to digitally store the data in their systems to make such calculations possible. Now that we can, a new field called data science is emerging, and it’ll play a crucial role in society in the near future (perhaps “data science” and “big data” are just buzzwords – it’s really just applied statistics).

I’d like to close with a simple demonstration on how a system of linear equations can be represented as a matrix:

\[ \begin{array}{rrrrrrr} x_1&-&2x_2&+&x_3&=&9 \\ &&2x_2&-&8x_3&=&8 \\ -4x_1&+&5x_2&+&9x_3&=&-9 \end{array}\]
can be represented as the augmented matrix:

\[ \left[\begin{array}{rrrr}1&-2&1&0\\ 0&2&-8&8 \\-4&5&9&-9\end{array}\right] \]

The properties of matrices allow us to sovlve systems of equations like these very efficiently. This particular example isn’t anything special, but I just wanted to show off the new LaTeX package I installed after switching to self-hosting. I think it’s much better than the default plugin used by wordpress.com – and I can even choose which one to use. This package is powered by MathJax which I think displays the formulas much more elegantly and cleanly than before. In addition, you can highlight each component of each formula, which is an improvement over what I had been using before. I’m satisfied with this plugin, although typing the above example was kind of a pain because the WordPress syntax for LaTeX expressions is a little different than what you would use for TeXLive. Actually, I think the terms in the equations above are too spaced out for my taste. I tried using the {align} environment but it came out weird, so I settled for the {array} environment. Maybe I’ll change it later.

I also have a new theme installed, which I like better than the temporary substitute I used last week. I think I’ll keep this one for now.

Posted in: Logs, Mathematics / Tagged: databases, python, relational databses, SQL

No. 84: New Hosting!

3 June, 2013 4:33 AM / Leave a Comment / Gene Dan

Hey everyone,

I migrated my blog to HostGator with the help of my web-designer friend, Jason Lee. You can check out his portfolio here – he’s pretty good.

I purchased a 3-year hosting plan about a year ago, but I didn’t get around to migrating my blog until this weekend. I wanted more control over my website, and I’m also planning on adding more content/features in the near future. Here’s the basic layout I had in mind:

genedan.com – a basic homepage linking to other areas of the site
genedan.com/blog – this blog
genedan.com/repositories – repositories for programs, code
genedan.com/scratchpad – a place for brainstorming, incomplete ideas

I didn’t really like the new MistyLook theme (an updated version of the green one that I had been using earlier) – the posts ended up looking too wide so I installed a different theme as a temporary substitute, until I find a better template or design one myself.

This was Jason’s first port from wordpress.com to a self-hosted site, so not everything went as expected. For example – my old site relied on wordpress.com’s built-in features like source code and LaTeX, whereas a wordpress installation on a self-hosted site would require its own plugins to display the mathematical notation. I’ll have to make some adjustments myself in the meantime.

To-do list:

– Fix the code blocks for each post containing code.
– Fix the LaTeX code for each post containing LaTeX formulas
– Get a new theme or design one myself
– Make sure all the images are intact
– Make sure other, miscellaneous mishaps didn’t happen (like large chunks of text disappearing)

Well, that’s it for today, thanks for reading!

Posted in: Logs

No. 83: Basic Simulation Using R

13 March, 2013 2:00 AM / Leave a Comment / Gene Dan

Today I’d like to demonstrate a few examples of simulation by using R’s built-in pseudorandom number generator. We’ll start by calling the function runif(n), which returns a vector of n draws from the uniform distribution on the interval [0,1]. To see what I mean, runif(50) will return 50 random numbers between 0 and 1 (inclusive):

[code language=”r” wraplines=”FALSE”]> runif(50)
[1] 0.79380213 0.02640186 0.48848994 0.50689348 0.27242565 0.37866590 0.50134423 0.04855088 0.35709235 0.06587394 0.04107046 0.52542577 0.31302174
[14] 0.65262709 0.60967237 0.45131387 0.55305078 0.83903314 0.72698109 0.06292518 0.47579002 0.15186000 0.71345801 0.71252703 0.22304757 0.20179550
[27] 0.57375115 0.06144426 0.87460214 0.87085905 0.52197596 0.79827053 0.35533929 0.23212775 0.30441290 0.29824819 0.59430450 0.92366848 0.63523013
[40] 0.59757710 0.67266388 0.06165364 0.12924342 0.10372910 0.49521401 0.31687057 0.08331765 0.51155404 0.35502189 0.65212223
[/code]

Interestingly, the numbers generated above aren’t actually random. R uses a process called pseudorandom number generation, which uses an algorithm to generate a long string of deterministic digits that appear to be random to most people, unless they have godlike powers of pattern recognition. The algorithm acts upon an initial value, called a seed, and for each seed the algorithm will return the same sequence of numbers. The term period refers to how long the sequence can go before it repeats itself. For example, Microsoft Excel’s PRNG (pseudorandom number generator) has a relatively short period, as (depending on the application) the sequence of numbers will repeat itself unless you frequently re-seed the algorithm. That is, if you generate a sequence 938745…, you’ll see 938745… again without too many draws.

The default PRNG used by R is called the Mersenne Twister, an algorithm developed in 1998 by Matsumoto and Nishimura. Other choices are available, such as Wichman-Hill, Marsaglia-Multicarry, Super-Duper, Knuth-TAOCP, and L’Ecuyer-CMRG. You can even supply your own PRNG, if you wish.

We can plot a histogram of a vector of generated numbers in order to observe the distribution of our sample. Below, you’ll see a 4-plot panel depicting samples from a uniform distribution on [0,1], with different draws per sample:

[code language=”r”]
#Uniform Sampling
par(mfrow=c(2,2))
for(i in 1:4){
x <- runif(10**i)
hist(x,prob=TRUE, col=”grey”,ylim=c(0,2),main = paste(10**i,” Draws”))
curve(dunif(x),add=TRUE,col=”red”,lwd=2)}
[/code]

As you can see, the sample approaches the uniform distribution as the number of draws becomes larger.

Similarly, we can simulate observations from the normal distribution by calling the function rnorm(n,mean,sd), which returns a vector of n draws from the normal distribution with mean=mean and standard deviation = sd:

[code language=”r”]
#Normal Sampling
par(mfrow=c(2,2))
for(i in 1:4){
x <- rnorm(10**i)
hist(x,prob=TRUE,col=”grey”,ylim=c(0,.6),xlim=c(-4,4),main=paste(10**i,” Draws”))
curve(dnorm(x),add=TRUE,col=”red”,lwd=2)}
[/code]

Likewise, as the number of draws gets bigger, the sample approaches the normal distribution.

We can use R to demonstrate the binomial approximation to the normal distribution. The binomial distribution with parameters n and p is approximately normal with mean np and variance np(1-p), with n large and p not too small. We’ll draw from the binomial distribution with n = 50 and p = .5, and then plot a normal curve with mean = 25 and variance = 12.5. Notice that as we increase the number of draws, the histrogram looks more and more like the normal distribution:

[code language=”r”]
#Binomial approximation to Normal
par(mfrow=c(2,2))
n <- 50
p <- .5
for(i in 1:4){
x <- rbinom(10**i,n,p)
hist(x,prob=TRUE,col=”grey”,ylim=c(0,.2),xlim=c(10,40),main=paste(10**i,” Draws”))
curve(dnorm(x,n*p,sqrt(n*p*(1-p))),add=TRUE,col=”red”,lwd=2)}
[/code]

For fun, I decided to see how many simulated values my computer could handle. I created a vector of 1 billion draws from the standard normal distribution:

[code language=”r”]
x<-rnorm(1000000000)
hist(x,prob=TRUE,col=”grey”,main=”1000000000 Draws”)
curve(dnorm(x),add=TRUE,col=”red”,lwd=2)
[/code]

Which took about 20 minutes to execute, using almost all of my computer’s memory (16 GB). This was unnecessary, as I could have reproduced the image above without as many draws. Nevertheless, I’m very impressed with R’s capabilities, as a similar script would have been impossible in Excel if I wanted to store the numbers in memory, or it would have taken much longer if I had even decided to clear the memory throughout the routine.

Posted in: Logs, Mathematics / Tagged: binomial approximation to normal, prng, pseudo random number generator, pseudorandom number generation, R, random number generator, simulation

No. 82: Plotting Normal Distributions with R

12 March, 2013 3:03 AM / 1 Comment / Gene Dan

Hey everyone,

I’ve got some good news – I passed CA2 a few weeks ago and I’m glad I was able to knock out that requirement shortly after I passed CA1. The bad news is that I’ve only written three posts this year when I should have had ten, so I’ve got some catching up to do. Over the past couple of months, I’ve mostly been studying material related to the insurance industry, but I try to squeeze in some math or programming whenever I have time. Lately, I’ve been learning how to work with the SQL Server Management Studio interface to aggregate large datasets at work. For statistics, I’ve continued my studies with Verzani’s Using R for Introductory Statistics, which I started reading last year, but put off until this year due to exams. Today, I’d like to show you some of R’s plotting capabilities – we’ll start off with a plot of the standard normal distribution, and I’ll demonstrate how you can change the shape of the plotted distribution by adjusting its parameters.

If you’ve taken statistics, you’re most likely familiar with the normal distribution:

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}}\mathrm{e}^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

One of the nice things about this distribution is that its two parameters are the mean and variance, which are common statistics used in everyday language. The mean is the average, a measure of central tendency that describes the center of the distribution, and the variance is a statistic that describes the spread of the distribution – how widely the data points deviate from the mean. The following code generates a plot of the density function of a standard normal random variable, and then adds two curves that depict the same distribution shifted to the left:

[code language=”r”]
#Standard normal, then shifted to the left
x <- seq(-6,6,length=500)
plot(x,dnorm(x,mean=0,sd=1),type = “l”,lty=1,lwd=3,col=”blue”,main=”Normal Distribution”,ylim=c(0,0.5),xlim=c(-6,6),ylab=”Density”)
curve(dnorm(x,-1,1),add=TRUE,lty=2,col=”blue”)
curve(dnorm(x,-2,1),add=TRUE,lty=3,col=”blue”)
legend(2,.5,legend=c(“N ~ (0, 1)”,”N ~ (-1, 1)”,”N ~ (-2, 1)”),lty=1:3,col=”blue”)
[/code]

The code first generates a vector of length 500. This vector is then used as an argument to the dnorm() function, which returns the normal density of each element of the input vector. Notice that in line 2, dnorm(x,mean=0,sd=1) is a function with 3 arguments – the first specifies the input vector, the second specifies that the mean of the distribution equals 0, and the third argument specifies that the standard deviation of the distribution equals 1. The function returns a vector of densities which are in turn used as an input to the plot() function, which generates the solid blue line in the above figure. The next two lines of the script add the same distribution shifted 1 and 2 units to the left. You can see that in these two lines, the 2nd argument of the dnorm() function is -1 and -2, respectively – this means that I changed the mean of the distribution to -1 and -2, from 0, causing the leftward shift that you see above.

Similarly, I can shift the distribution to the right by increasing the mean:

[code language=”r”]
#Standard normal, then shifted to the right
x <- seq(-6,6,length=500)
plot(x,dnorm(x,mean=0,sd=1),type = “l”,lty=1,lwd=3,col=”purple”,main=”Normal Distribution”,ylim=c(0,0.5),xlim=c(-6,6),ylab=”Density”)
curve(dnorm(x,1,1),add=TRUE,lty=2,col=”purple”)
curve(dnorm(x,2,1),add=TRUE,lty=3,col=”purple”)
legend(-5.5,.5,legend=c(“N ~ (0, 1)”,”N ~ (1, 1)”,”N ~ (2, 1)”),lty=1:3,col=”purple”)
[/code]

Notice that I can change the position of the legend by specifying the x and y coordinates in the first two arguments of the legend() function.

The next script keeps the mean at 0, but adds two curves with the standard deviation increased to 1 and 2:

[code language=”r”]
#Standard normal, then increased variance
x <- seq(-6,6,length=500)
plot(x,dnorm(x,mean=0,sd=1),type = “l”,lty=1,lwd=3,col=”black”,main=”Normal Distribution”,ylim=c(0,0.5),xlim=c(-6,6),ylab=”Density”)
curve(dnorm(x,0,1.5),add=TRUE,lty=2,col=”red”)
curve(dnorm(x,0,2),add=TRUE,lty=3,col=”black”)
legend(-5.5,.5,legend=c(“N ~ (0, 1)”,”N ~ (0, 2.25)”,”N ~ (0, 4)”),lty=1:3,col=c(“black”,”red”,”black”))
[/code]

Here, I made the middle curve red by using the “col” argument in the plot() function. Personally, plotting is one of my favorite things to do with R. I feel that visualizing data helps you gain an intuitive grasp on the subject, and reveals patterns that you might not otherwise see with aggregated tables or simple summary statistics. Later on this week (hopefully tomorrow), I’ll demonstrate some simple simulations with the normal distribution.

Posted in: Logs, Mathematics / Tagged: cran, normal plot r, R

« Previous 1 … 12 13 14 15 16 … 30 Next »

Author Archives: Gene Dan

No. 86: Mobile App!

No. 85: Things I’ve Been Doing Lately

No. 84: New Hosting!

No. 83: Basic Simulation Using R

No. 82: Plotting Normal Distributions with R

Post Navigation

Archives

Categories

Links

Texas Cycling