Earlier 2015, I got an invitation from SPE, saying that I am selected as a contributor for PetroWiki, a Wikipedia style open encyclopedia for petroleum engineering. The first batch of articles were created from the Petroleum Engineering Handbook published by SPE.
Later, I had an interesting thought on working on topic modeling — a subject of natural language processing (NLP), trying to play around with categories and concepts. It takes in lots of articles, and try to find the topics hidden under the large corpus. Refer to my earlier post here.
I found the interesting work of 50 topics discovered from 100k Wikipedia articles, there is an online topic visualization website for that. And I immediately found that Petrowiki is an excellent place for me to try topic modeling. I want to summarize the Petrowiki and see what are the topics involved in petroleum engineering! I want to see how are they arranged. (You can jump to the last diagram for a quick summary.)
I used gensim, with LDAvis package to run the LDA algorithm. I retrieved 3200 times and got 765 articles. A fascinating series of 10 topics are revealed. The topics are ordered by their weights (high to low) in the corpus.
Topics are manually interpreted by myself:
- 1: Reservoir, reservoir engineering
- 2: Phase behavior, and applications (wax precipitation, etc.)
- 3: Well logging, petrophysics and core analysis
- 4: Drilling & completion
- 5: Pumping, artificial lifting
- 6: Polymer, foam, etc.
- 7: Pipeline & corrosion
- 8: (Less reliable) decision analysis & facturing
- 9: (Less reliable): gas lift
- 10: Drilling bit technology
A visualization of topic relationships. The area represent the weight of topic, the distance between topics represent the similarity between them.
How did I interpreted all the topics? That is even more fascinating part, namely to examine the document-topic relationship and the topic-word relationship. The following plots shows the word distribution for topic 5, it is easy to interpret it as the pumping topics. Note, the words are stemmed using the Porter stemmer from NLTK, so words like “separate”, “separator”, “separates”, and “separators” are all stemmed into “separ”. The confusing part is the fact that “gas” is stemmed into “ga”.
The following 2 screen shots show the articles sorted according to topic weights. So, we are checking the articles that are most relevant to topic A. After looking at them, one can conclude that topic A (left) is about drilling bit technology. We want to ask reader to guess, what is topic B?
It is interesting to examine the topic that I am familiar: reservoir engineering, shown as below. One can see, “reservoir”, “gas” are standard words; “seismic” is about seismic imaging, that is an important source of information for building reservoir model; “SPE” is another common acronym; “permeability”, “model”, “simulation” are all typical reservoir engineering terminology! However, there are words not informative: “eq” (equation), “fig” (figure), and even “rtenotitle” (default title for untitled image, pure artifact).
Unfortunately I didn’t have time to put it into web browser as the Wikipedia viewer does. But I am thinking to upload the LDAvis result to my Stanford website later. It is fun to explore stuff like this!
Finally, a graph summarize the main points.