, , ,

Topic modeling is a type of unsupervised learning. The commonly used algorithm is called LDA (Latent Dirichlet Allocation). It takes in many different articles, and extract several “topics” hidden behind the large volume of documents.

Each topic is a multinomial distribution of many different words. For example, the algorithm may find a topic containing many different words such as “1.8% restaurant, 0.7% menu, 0.6% dining, 0.06% chef, …”, and we can roughly interpret this topic as “food”. Moreover, a document talking about food prices may consist in different topics, such as 60% food topic, 30% financial topic, and so on. In this way, one can have a bird’s-eye view of the entire corpus without reading overwhelmingly large number of articles.

Inspired by this topic modeling results from 100K Wikipedia articles, and several other work such as the “Automated Biography for a Nation”, I would like to apply this technology to articles related to oil industry. At first, I considered to use the wikipedia for oil industry Petrowiki. However, after some test, I found that there is only less than 3,000 articles, hence the value of topic modeling is not very big. Therefore, I was considering some larger datasets.

Fortunately, I enrolled the CS229 Machine Learning course taught by Andrew Ng, and I found two other classmates who are also interested in topic modeling. We found it is of more interest to study the relationship between the news articles and the oil price. So, we chose this subject at our starting point.

Now, we have been working on this project for about five hours per week since October 2015, and we already have the project midterm Milestone Report available here. Thanks to the effort made by every team member. More interesting results are yet to come!