Tag Archives: Data Mining

No. 89: Skipping Ahead

24 June, 2013 7:46 PM / 2 Comments / Gene Dan

I mentioned a few weeks ago that I’m spending about 30 minutes each day reading material that’s beyond my technical level of understanding. I‘ve been reading Data Mining at a rate of about 10 pages per day, with a strict limit of no more than 3 minutes per page. The first couple of chapters were pretty easy – they covered some of the basic goals of machine learning, such as pattern recognition and classification. The third chapter was a bit more technical than the first two, and now that I’m on the fourth chapter, I feel almost completely lost, especially when trying to understand the examples covering unfamiliar concepts like information theory and entropy.

There are few benefits to reading unfamiliar material – I think by doing so you’ll get a good idea of the subject’s prerequisites. In the case of data mining, I’ve learned that I’ll need to look in to algorithms and information theory. There were also some paragraphs that covered Bayesian statistics, which I studied last year but have gotten a bit rusty at – however, the inclusion of the subject indicates that I ought to review it if I want to have a deep understanding of how data mining works.

Data mining has some direct applications to actuarial work. For example, insurance companies need to divide policyholders into subsets and price the policies based on shared attributes amongst the policyholders in each subset. An example of this practice would be to charge the owner of a commercial supertanker a higher premium for a commercial shipping policy than the owner of a small tugboat. This is because the supertanker has a larger expected loss than the tugboat. This type of segmentation is currently done using a combination of basic statistics and business judgement (marketing, operating costs, etc.). Segmentation is easy when you think of extreme examples, like the one I just pointed out (based on size), however it becomes more difficult when you include other attributes such as age, manufacturer, where the ship is predominantly moored, and the climate of the geographic region where the ship travels on business. There might be significant overlap amongst individual ships regarding these categories, which makes classification difficult. In addition to the problem on how to classify ships, you also have the dilemma of choosing the optimal number of categories, the optimal size of each category, and the boundaries of each category.

Data mining aims to resolve this issue by quickly scanning large datasets and looking for similarities amongst ships that an actuary or underwriter would otherwise miss via traditional methods. Data mining isn’t widely used in the industry, but it’s quickly gaining ground as the rapid expansion in data-collecting technology has made such efforts feasible. I think it has the potential to significantly change the industry with respect to the way insurers price policies, and this is why I’ve become interested in the field.

I think data mining also has some very interesting applications in biology – for instance, you might remember learning about animal kingdoms back in grade school. I myself was taught that there were five, but nowadays a 6-kingdom classification system is popular in U.S. biology classrooms (some advocate for even more kingdoms). Taxonomy, which includes grouping organisms into kingdoms, as well as into separate species, is a problem that can be addressed with data mining. The following diagram looks as if the person who made it was trying to divide a dataset into two categories:

This reminded me of a case example in the Data Mining book on whether or not a dataset of flowers should be divided into two separate categories. The above image isn’t the same thing (although it looks similar to the graphic in the book), I just thought it was a pretty image that would look nice with this blog post. Apparently, the image is the result of a failed attempt at k-means clustering (I’m not going to pretend that I know what it means).

In the meantime I’ve been making some good progress in learning linear algebra, which is well within reach for me in terms of difficulty. I’ll also be going to a Houston Machine Learning Group meeting later this week, and if I find it interesting I’ll be writing about it here.

Posted in: Logs / Tagged: data mining

No 87: Books I’ve Been Reading Lately

13 June, 2013 9:29 PM / 1 Comment / Gene Dan

I’ve been reading some books:

1984 and Starship Troopers

I don’t have a lot of time to read fiction due to work and study, so I started listening to audiobooks during my commute, and whenever I’m driving, in general. I’ve discovered that can get through books surprisingly quickly this way – 1984 and Starship Troopers took me roughly 2 weeks each to finish, both of which are about 300 pages long in print. I started first with Starship Troopers since the title was recognizable – I had seen the film adaptation first, but I found the book much more enjoyable – the movie was mostly an action-packed bloodbath whereas the book focused more on Juan Rico’s development as a soldier living under an authoritarian regime; training as a recruit and then as an officer. In this sense, the plots were almost completely different, and in my opinion the film really butchered the book. For example, the main character’s nationality was Filipino, which was actually an important point as the book was written during a particularly sensitive time with respect to race in the United States. This aspect was completely absent in the film adaptation.

I picked up 1984 so I could understand all the cultural references that I’ve seen in the media. It read like an adult version of Animal Farm, and touched upon many of the same subjects with respect to totalitarianism and the Communist revolution. It even had two characters representing Stalin (Big Brother) and Trotsky (Goldstein), just like animal farm did (Napoleon and Snowball, respectively). I thought the plot was okay, but it was really the ideas on state censorship, surveillance, and historical revisionism that stuck me as important, and these ideas were probably what made the book so culturally important. It was also one of the most quotable books I’ve ever read. I recognized some references that I already knew came from the book, but also some new ones that I didn’t realize were inspired by 1984:

Big Brother is watching you
We have always been at war with Eastasia
Hate Week
The chocolate rations have been reduced
Room 101
Newspeak, Doublethink
Thoughtcrime
Thought Police

There were a lot more, but that’s what I could think of off the top of my head. After 1984, I started reading Asimov’s Foundation, which I thought contained some pretty neat ideas on what society would be like in an age where humans have mastered interstellar travel. The book is mostly dialogue, which I found dry after the first few chapters. I’m currently reading this right now although I find that I have to switch between listening to the radio and the audiobook due to the lack of plot.

Modern Database Management
I’m reading Modern Database Management to get a better understanding of database design and structure. Several people have asked me why I’ve spent the time to do so and I’ve responded that it’s to better understand how data are stored in an organization, as well as how IT departments are structured so as to facilitate communication between myself and my coworkers in IT. I also think that learning how to write queries efficiently and effectively in SQL will increase my productivity and facility when it comes to manipulating data. This is often the most time-consuming task when it comes to performing statistical analyses, so I think these gains will be well worth the investment (about 200 hours or so).

Data Mining: Practical Machine Learning Tools and Techniques

I’ve decided to spend a little bit of time each day (30 minutes or so), reading advanced technical material that is beyond reach of my current level of proficiency. This is not so much for understanding but moreso for exposure to new subjects that I might want to study in the near future. I picked up Data Mining because it’s been heavily touted by the media as “the next big thing,” so I wanted to see what it was all about (although we must approach such claims with caution). I actually use some of these techniques at work – such as building regression models and using cross validation. The book is surprisingly accessible, and not very technical. I actually found the first few chapters accessible and enjoyable, and didn’t struggle like I thought I would. It mostly covers the purpose, motivations, and importance of data mining techniques and points the reader to external material should they want to explore the topic further. In my opinion it’s actually important to have a non-technical text on the subject, as many data mining tools are implemented via different computer languages and software packages – for example, it might not benefit the reader (or at least it would inconvenience him) if the book focused on the nuts and bolts of one computer language whereas his employer used another.

For this type of reading I’m not aiming at 100% understanding, I’m mostly reading these to see where I should go next, and to gain familiarity with the vocabulary and terminology used in certain subject areas. Should I ever decide to look into a subject more deeply, I’ll come back to a text for a second reading once I master the subject’s prerequisites.

Posted in: Logs / Tagged: 1984, data mining, modern database management, starship troopers

Tag Archives: Data Mining

No. 89: Skipping Ahead

No 87: Books I’ve Been Reading Lately

Archives

Categories

Links

Texas Cycling