Monthly Archives: June 2013

You are browsing the site archives by month.

No. 89: Skipping Ahead

24 June, 2013 7:46 PM / 2 Comments / Gene Dan

I mentioned a few weeks ago that I’m spending about 30 minutes each day reading material that’s beyond my technical level of understanding. I‘ve been reading Data Mining at a rate of about 10 pages per day, with a strict limit of no more than 3 minutes per page. The first couple of chapters were pretty easy – they covered some of the basic goals of machine learning, such as pattern recognition and classification. The third chapter was a bit more technical than the first two, and now that I’m on the fourth chapter, I feel almost completely lost, especially when trying to understand the examples covering unfamiliar concepts like information theory and entropy.

There are few benefits to reading unfamiliar material – I think by doing so you’ll get a good idea of the subject’s prerequisites. In the case of data mining, I’ve learned that I’ll need to look in to algorithms and information theory. There were also some paragraphs that covered Bayesian statistics, which I studied last year but have gotten a bit rusty at – however, the inclusion of the subject indicates that I ought to review it if I want to have a deep understanding of how data mining works.

Data mining has some direct applications to actuarial work. For example, insurance companies need to divide policyholders into subsets and price the policies based on shared attributes amongst the policyholders in each subset. An example of this practice would be to charge the owner of a commercial supertanker a higher premium for a commercial shipping policy than the owner of a small tugboat. This is because the supertanker has a larger expected loss than the tugboat. This type of segmentation is currently done using a combination of basic statistics and business judgement (marketing, operating costs, etc.). Segmentation is easy when you think of extreme examples, like the one I just pointed out (based on size), however it becomes more difficult when you include other attributes such as age, manufacturer, where the ship is predominantly moored, and the climate of the geographic region where the ship travels on business. There might be significant overlap amongst individual ships regarding these categories, which makes classification difficult. In addition to the problem on how to classify ships, you also have the dilemma of choosing the optimal number of categories, the optimal size of each category, and the boundaries of each category.

Data mining aims to resolve this issue by quickly scanning large datasets and looking for similarities amongst ships that an actuary or underwriter would otherwise miss via traditional methods. Data mining isn’t widely used in the industry, but it’s quickly gaining ground as the rapid expansion in data-collecting technology has made such efforts feasible. I think it has the potential to significantly change the industry with respect to the way insurers price policies, and this is why I’ve become interested in the field.

I think data mining also has some very interesting applications in biology – for instance, you might remember learning about animal kingdoms back in grade school. I myself was taught that there were five, but nowadays a 6-kingdom classification system is popular in U.S. biology classrooms (some advocate for even more kingdoms). Taxonomy, which includes grouping organisms into kingdoms, as well as into separate species, is a problem that can be addressed with data mining. The following diagram looks as if the person who made it was trying to divide a dataset into two categories:

This reminded me of a case example in the Data Mining book on whether or not a dataset of flowers should be divided into two separate categories. The above image isn’t the same thing (although it looks similar to the graphic in the book), I just thought it was a pretty image that would look nice with this blog post. Apparently, the image is the result of a failed attempt at k-means clustering (I’m not going to pretend that I know what it means).

In the meantime I’ve been making some good progress in learning linear algebra, which is well within reach for me in terms of difficulty. I’ll also be going to a Houston Machine Learning Group meeting later this week, and if I find it interesting I’ll be writing about it here.

Posted in: Logs / Tagged: data mining

No. 88: Communicating Mathematics via LaTeX

17 June, 2013 8:40 PM / Leave a Comment / Gene Dan

I started learning LaTeX a couple of years ago, but it wasn’t until last year when I started studying for actuarial exams 4/C and 3/MFE that I really started to become proficient at using it. If you are not familiar with what LaTeX is, it’s a markup language that lets you easily (although it takes some effort to learn) write mathematical formulas on a computer screen by using the keyboard. For example, \[ f(x) = \frac{1}{\sigma\sqrt{2\pi}}\mathrm{e}^{-\frac{(x-\mu)^2}{2\sigma^2}} !\], which is the code written in the WordPress editor, produces the formula for the Normal Distribution:

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}}\mathrm{e}^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

You can try to write the above formula by using Microsoft Equation editor (like many of us did as highschoolers), but you’ll quickly realize that it takes an extremely long time, and you’ll be wishing that you had a faster, more efficient way of writing mathematical notation – this is where the usefulness of LaTeX becomes apparent.

I started using LaTeX while posting in the actuarial message boards, which are popular amongst candidates who are trying to study for exams. The bulletin board system has a LaTeX compiler installed, so you can easily consult other students from all across the world. For example, if I’m studying at 2:00 AM in the morning, I can post a question on the message board and there will most likely be someone who is awake at that time in Europe or China who would be willing to answer that question.

There are some more well-known message boards as well, such as StackOverflow and MathOverflow, where people (mostly from technical backgrounds) ask each other questions. Oftentimes they’ll use LaTeX to write technical notation – which greatly facilitates communication. I’ve found StackOverflow to be very helpful from time to time. On the other hand, I can’t even understand most of the questions being asked in MathOverflow, which is an online community of mathematicians asking each other research-level questions pertaining to mathematics. Fortunately, there’s another site under the StackExchange umbrella called Mathematics Stack Exachange, which caters to undergraduate and early graduate-level students, and is much more accessible. These websites are only 4 years old and have already made a huge impact on the way people collaborate on technical projects. I’m not sure if Don Knuth imagined this when he invented TeX way back in 1978, but if he did, he had tremendous foresight.

I’ll close by demonstrating a problem on matrices, which I started studying last week. I covered the basic row operations on matrices and today I just went over spanning and matrix equations in the Ax=b form. It’s been very interesting going back to material that I first learned 5 years ago – I have a different perspective now and it’s much like watching a movie – you always pick up something new the second time around.

Problem:

Solve the system of equations:

\[\begin{aligned} x_1 – 3x_3 &= 8 \\ 2x_1 + 2x_2 +9x_3 &= 7 \\ x_2 + 5x_3 &= -2 \end{aligned}\]

Solution:

We’ll start by writing the augmented matrix of this system of equations:

\[\left[ \begin{array}{rrrr} 1 & 0 & -3 & 8 \\ 2 & 2 & 9 & 7 \\ 0 & 1 & 5 & -2 \end{array} \right] \]

Replace row 2 with the sum of row 2 and negative 2 times row 1:

\[\left[ \begin{array}{rrrr} 1 & 0 & -3 & 8 \\ 0 & 2 & 15 & -9 \\ 0 & 1 & 5 & -2 \end{array} \right] \]

Interchange rows 2 and 3:

\[\left[ \begin{array}{rrrr} 1 & 0 & -3 & 8 \\ 0 & 1 & 5 & -2 \\ 0 & 2 & 15 & -9 \end{array} \right] \]

Replace row 3 with the sum of row 3 and negative 2 times row 2:

\[\left[ \begin{array}{rrrr} 1 & 0 & -3 & 8 \\ 0 & 1 & 5 & -2 \\ 0 & 0 & 5 & -5 \end{array} \right] \]

Scale row 3 by 1/5:

\[\left[\begin{array}{rrrr} 1 & 0 & -3 & 8 \\ 0 & 1 & 5 & -2 \\ 0 & 0 & 1 & -1 \end{array} \right] \]

Replace row 2 with the sum of row 2 and negative 5 times row 3:

\[\left[\begin{array}{rrrr} 1 & 0 & -3 & 8 \\ 0 & 1 & 0 & 3 \\ 0 & 0 & 1 & -1 \end{array} \right] \]

Replace row 1 with the sum of row 1 and three times row 3:

\[\left[\begin{array}{rrrr} 1 & 0 & 0 & 5 \\ 0 & 1 & 0 & 3 \\ 0 & 0 & 1 & -1 \end{array} \right] \]

This final matrix is equivalent to the following system of equations:

\[\begin{aligned} x_1 &= 5 \\ x_2 &= 3 \\ x_3 &= -1 \end{aligned} \]

Thus the solution set is \((5,3,-1)\)

The example above is how I would typically ask or answer a question posted on a bulletin board. Actually, giving out the entire solution (as above) is typically frowned upon and most people just give enough hints so that the person who asked the question can figure it out on their own. However, you can definitely see from the example that LaTeX allows you to cleanly print the matrices, which makes it much easier to understand. I remember back in high school when my friends and I would struggle trying to help each other out via AIM or some other chat client. I only wish I’d found out about LaTeX sooner.

Posted in: Logs, Mathematics / Tagged: LaTeX, mathoverflow, matrices, matrix, row operations, stackoverflow

No 87: Books I’ve Been Reading Lately

13 June, 2013 9:29 PM / 1 Comment / Gene Dan

I’ve been reading some books:

1984 and Starship Troopers

I don’t have a lot of time to read fiction due to work and study, so I started listening to audiobooks during my commute, and whenever I’m driving, in general. I’ve discovered that can get through books surprisingly quickly this way – 1984 and Starship Troopers took me roughly 2 weeks each to finish, both of which are about 300 pages long in print. I started first with Starship Troopers since the title was recognizable – I had seen the film adaptation first, but I found the book much more enjoyable – the movie was mostly an action-packed bloodbath whereas the book focused more on Juan Rico’s development as a soldier living under an authoritarian regime; training as a recruit and then as an officer. In this sense, the plots were almost completely different, and in my opinion the film really butchered the book. For example, the main character’s nationality was Filipino, which was actually an important point as the book was written during a particularly sensitive time with respect to race in the United States. This aspect was completely absent in the film adaptation.

I picked up 1984 so I could understand all the cultural references that I’ve seen in the media. It read like an adult version of Animal Farm, and touched upon many of the same subjects with respect to totalitarianism and the Communist revolution. It even had two characters representing Stalin (Big Brother) and Trotsky (Goldstein), just like animal farm did (Napoleon and Snowball, respectively). I thought the plot was okay, but it was really the ideas on state censorship, surveillance, and historical revisionism that stuck me as important, and these ideas were probably what made the book so culturally important. It was also one of the most quotable books I’ve ever read. I recognized some references that I already knew came from the book, but also some new ones that I didn’t realize were inspired by 1984:

Big Brother is watching you
We have always been at war with Eastasia
Hate Week
The chocolate rations have been reduced
Room 101
Newspeak, Doublethink
Thoughtcrime
Thought Police

There were a lot more, but that’s what I could think of off the top of my head. After 1984, I started reading Asimov’s Foundation, which I thought contained some pretty neat ideas on what society would be like in an age where humans have mastered interstellar travel. The book is mostly dialogue, which I found dry after the first few chapters. I’m currently reading this right now although I find that I have to switch between listening to the radio and the audiobook due to the lack of plot.

Modern Database Management
I’m reading Modern Database Management to get a better understanding of database design and structure. Several people have asked me why I’ve spent the time to do so and I’ve responded that it’s to better understand how data are stored in an organization, as well as how IT departments are structured so as to facilitate communication between myself and my coworkers in IT. I also think that learning how to write queries efficiently and effectively in SQL will increase my productivity and facility when it comes to manipulating data. This is often the most time-consuming task when it comes to performing statistical analyses, so I think these gains will be well worth the investment (about 200 hours or so).

Data Mining: Practical Machine Learning Tools and Techniques

I’ve decided to spend a little bit of time each day (30 minutes or so), reading advanced technical material that is beyond reach of my current level of proficiency. This is not so much for understanding but moreso for exposure to new subjects that I might want to study in the near future. I picked up Data Mining because it’s been heavily touted by the media as “the next big thing,” so I wanted to see what it was all about (although we must approach such claims with caution). I actually use some of these techniques at work – such as building regression models and using cross validation. The book is surprisingly accessible, and not very technical. I actually found the first few chapters accessible and enjoyable, and didn’t struggle like I thought I would. It mostly covers the purpose, motivations, and importance of data mining techniques and points the reader to external material should they want to explore the topic further. In my opinion it’s actually important to have a non-technical text on the subject, as many data mining tools are implemented via different computer languages and software packages – for example, it might not benefit the reader (or at least it would inconvenience him) if the book focused on the nuts and bolts of one computer language whereas his employer used another.

For this type of reading I’m not aiming at 100% understanding, I’m mostly reading these to see where I should go next, and to gain familiarity with the vocabulary and terminology used in certain subject areas. Should I ever decide to look into a subject more deeply, I’ll come back to a text for a second reading once I master the subject’s prerequisites.

Posted in: Logs / Tagged: 1984, data mining, modern database management, starship troopers

No. 86: Mobile App!

11 June, 2013 8:40 PM / Leave a Comment / Gene Dan

I noticed that a lot of inbound traffic came from mobile devices, so I decided to take a look at the front page using my phone. Unfortunately, I couldn’t read a single thing since the site wasn’t mobile ready. Therefore, I installed a mobile plugin that lets people view the blog via their portable devices.

I’ve only been using a smartphone for a few months myself, so I am not all that familiar with the technology. Maybe I’ll make some adjustments as I get used to it, but for now this will do. One of the cool things about it is that it displays LaTeX quite well. For example, here’s an unimportant, meaningless matrix:

\[A_{m,n} = \begin{pmatrix} a_{1,1} & a_{1,2} & \cdots & a_{1,n} \\ a_{2,1} & a_{2,2} & \cdots & a_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m,1} & a_{m,2} & \cdots & a_{m,n} \end{pmatrix}\]

Posted in: Logs

No. 85: Things I’ve Been Doing Lately

10 June, 2013 9:40 PM / Leave a Comment / Gene Dan

I haven’t written in here for a long time, so I’d like to list what I’ve been doing these past few months to get the ball rolling:

Databases:

I took an interest in the study of databases late last year, and I’ve gotten a little further by studying the underlying concepts of database design and data modeling. I took the initiative to learn the subject after struggling with the technical aspects of my job. Last year, when I was first working with generalized linear models, I quickly found out that I couldn’t design the queries necessary to get the data I needed, and that I couldn’t communicate effectively with the IT personnel who were responsible for implementing my pricing models (in this case, implementation means to program the database systems that the business uses to store and capture information). Whenever I would ask them something, they would respond via their language of choice, T-SQL, which I had seen glimpses of during my internship days but couldn’t understand. At that time, I only had a vague notion of what a relational database was, and that it was somehow fundamentally different from an Excel spreadsheet, although I didn’t know why; if you ask me today I’d be able to tell you some of the fundamental differences – such as how a database has strictly defined fields and entity relationships, whereas Excel does not, etc. (well you could design an Excel spreadsheet to have those features, but it’s impractical).

This was perhaps the first time I thought to myself that maybe learning about databases would be important. I took a look at some books on SQL and another one that a co-worker checked out of the library, and determined that it would take me many months, if not a few years of investment to acquire proficiency in the subject. I encounter unfamiliar software and bits of code on a daily basis, so I often have to make the decision on whether or not I should spend a significant amount of time studying a software suite or language, or if I should just learn enough of it to complete my next assignment. For example, sometimes (about once every six months or so) I have to run some obsolete legacy software that I know will soon be replaced in the near future, so I’ll decide to learn just enough to get what I need. On the other hand, I see articles almost every single week pertaining to either SQL or relational databases on Hacker News, and several of my friends at coder meetup groups (which I’ll explain in the next section) use databases or write code pertaining to them, so I saw this as another sign that studying the subject would be worthwhile.

If you are not familiar at all with what I’m talking about, you may have seen stories in the news about cybercriminals (ranging from teenage misfits to professional cybersoldiers) infiltrating corporate and government databases, and you may have heard or seen the phrase “SQL Injection” thrown around in the technology section of newspapers. These concepts pertain to databases, which serve as centralized repositories for storing important information for businesses and governments. Prior to working in the corporate sector, news stories such as these were the only exposure I had to databases, and perhaps Americans who are working in non-technical jobs have a similar level of exposure, if any, at all. I think the media sensationalizes cybercrime because it deals with technology and concepts that the average citizen finds esoteric. I think perhaps stories like these, along with recent revelations during the past week should alert you to the importance of not only databases but also computer and technological literacy itself. I have some opinions on the matter but I believe that they are not yet mature enough for full disclosure now (maybe later), other than my belief that our nation is in a very precarious situation pertaining to civil liberties and privacy – and that our technological literacy is crucial to protecting ourselves from oppression. Perhaps this was the reason that influenced me to study databases seriously, or maybe it was really because I thought it would be important for work, I’m not so sure.

Anyway, I started getting some assignments at work done by first learning basic SELECT queries and joins, which allowed me to manipulate the data into the format I needed for my models. Lately, I’ve been looking in to theories on database design, which at least in my impression seems like some kind of applied set theory. I’ve also gotten involved in some open-source project dealing with databases, which I won’t reveal at this moment (maybe later). All I have to say is that I’m pretty excited to be working on something worthwhile.

Computer User Groups

I joined about 10 computer user groups on meetup.com. These user groups are weekly or monthly gatherings of people who want to talk about interests they have in common – such as music, sports, and various other hobbies. In my case, I’ve joined groups that discuss Python, Big Data, Perl, Linux, and Machine Learning that meet on a monthly basis. I’ve only been to the Python one so far, but I’m planning on going to a machine learning gathering in a few weeks. I’ve met some really cool people and learned a lot via these meetups.

Linear Algebra

I think a couple years ago I wrote about reviewing the mathematics that I learned in high school, and you may have noticed that on my “Readings” page, I’ve been stuck on my College Algebra book at 18.9% over the past 2 years. In that time I hadn’t stopped learning mathematics – I passed 3 actuarial exams in that time. However, now that I’m done with the preliminary exams I’m reluctant to go back to it, as when I was reviewing basic algebra, I was so bored going over things I had learned in the past and wasn’t patient enough to sit through it to make steady progress. Furthermore, I felt like I was spending so much time reviewing that I couldn’t devote enough effort to learning the new mathematics that I’m interested in. Therefore, I’ve decided to limit my “reviewing” to about 30 minutes a night, but make it mandatory. The rest of my time will be devoted to databases and linear algebra.

I took an intense course in Linear Algebra over a five-week span and did well, but due to the short time over which I learned the material (and the fact that it’s been 5 years since I took the course), I’ve forgotten most of it. However, I’ve seen matrices appear more and more often in papers and in the work I’ve been doing (a computer uses matrix calculations to perform linear regression), so I decided to brush up on my linear algebra. This is technically a review, but since I spent a such a short time on the course the first time, I consider the material “new” enough to count as furthering my studies of mathematics. I also found the course very enjoyable, so I think this will give me a fresh perspective as I fill the gaps in my early education (after college algebra, I plan to review geometry, trig, and calculus) alongside the study of this subject.

Advances in computing power have only recently allowed us to realize the power of statistical matrix computations on large data sets. Matrices have been around for centuries, and Statistics has been around for centuries, but large organizations were not, at least until a few years ago, able to practically perform statistical calculations on data sets, nor did they have the technological capability to digitally store the data in their systems to make such calculations possible. Now that we can, a new field called data science is emerging, and it’ll play a crucial role in society in the near future (perhaps “data science” and “big data” are just buzzwords – it’s really just applied statistics).

I’d like to close with a simple demonstration on how a system of linear equations can be represented as a matrix:

\[ \begin{array}{rrrrrrr} x_1&-&2x_2&+&x_3&=&9 \\ &&2x_2&-&8x_3&=&8 \\ -4x_1&+&5x_2&+&9x_3&=&-9 \end{array}\]
can be represented as the augmented matrix:

\[ \left[\begin{array}{rrrr}1&-2&1&0\\ 0&2&-8&8 \\-4&5&9&-9\end{array}\right] \]

The properties of matrices allow us to sovlve systems of equations like these very efficiently. This particular example isn’t anything special, but I just wanted to show off the new LaTeX package I installed after switching to self-hosting. I think it’s much better than the default plugin used by wordpress.com – and I can even choose which one to use. This package is powered by MathJax which I think displays the formulas much more elegantly and cleanly than before. In addition, you can highlight each component of each formula, which is an improvement over what I had been using before. I’m satisfied with this plugin, although typing the above example was kind of a pain because the WordPress syntax for LaTeX expressions is a little different than what you would use for TeXLive. Actually, I think the terms in the equations above are too spaced out for my taste. I tried using the {align} environment but it came out weird, so I settled for the {array} environment. Maybe I’ll change it later.

I also have a new theme installed, which I like better than the temporary substitute I used last week. I think I’ll keep this one for now.

Posted in: Logs, Mathematics / Tagged: databases, python, relational databses, SQL

1 2 Next »

Monthly Archives: June 2013

No. 89: Skipping Ahead

No. 88: Communicating Mathematics via LaTeX

No 87: Books I’ve Been Reading Lately

No. 86: Mobile App!

No. 85: Things I’ve Been Doing Lately

Post Navigation

Archives

Categories

Links

Texas Cycling