# An Introduction to Statistical Text Mining

In Machine Learning.

tags: text mining python nltk

*Update* 2015-07-24: Please check out the latest slides for this course, which now includes a quick introduction to dependency parsing, a more terse start, and several corrections and improvements all over.

Last week we had a really great time at the first hands-on text mining workshop in the context of the Advanced Statistics and Data Mining Summer School. This one-week course is an introduction to text mining from the "bottom up" to a bunch of motivated summer students, with the practical parts based on Python and the NLTK. The presentation was part of the 9th iteration of the Summer School that is located in the sunniest capital of Europe: Madrid (well, *ex aequo* with Athens, at least). In its context, I presented 15 hours worth of practical and theoretical background on machine learning for text mining to fourteen participants from all over the world. With this post I am sharing the slides and tutorial files (IPython Notebooks) with the world for free (see the Creative Commons BY-SA licensing details). While much of the material should speak for itself, it might not "save" you from visiting a text mining class (maybe mine, next summer?).

## Overview

The theory mostly focuses on explaining the methods and statistics behind machine learning *for text mining* - without requiring a particular background other than sharp high-school maths. The practicals on the other hand make use of Python, particularly online IPython Notebooks. Therefore, prior contact with Python would be useful to profit the most from the practicals, but it is not required to follow the course. IPython's Notebooks act very much like MATLAB Notebooks, but run online in any modern browser, are based on open source software that requires no license fee, and provide facilities to place documentation and mathematical formulas between the code snippets, evaluation results, and graphical plots. Therefore, these notebooks serve as take home material for the students to study and play around with, even well after the course. To illustrate the power of IPython, learning how to work in this environment immediately enables you to run distributed, high-performance computations "off the shelf" (e.g., via StarCluster on Amazon EC2 Clusters).

## Day 1 - Introduction

Lecture 1 starts with a light-weight introduction to text mining, natural language understanding and generation, and a statistics approach to artificial intelligence in general. It provides a taste of both what is to come in the course and what might be of interest to look into afterwards. It closes with a brief reprise of elementary Bayesian statistics and conditional probability, using the Monty Hall problem as a practical example of *Bayes' Rule*. The practical #1 introduces the use of IPython and its online Notebook, as well as some aspects of NumPy (particularly, contrasting NumPy/SciPy to R). The most important aspects of the **Natural Language ToolKit** (NLTK) Python library are presented during the second part of the practical; The NTLK is frequently used during the course. As an exercise to become familiar with Python and the NLTK, participants are encouraged to implement a function to let two NLTK chat-bots converse between each other.

## Day 2 - Language Modeling

On the second day, lecture 2 introduces the chain rule of conditional probability and builds on that to take you from the *Markov property* over *language modeling* to *smoothing techniques*. In class, you learn how to operate on n-gram frequency tables and we work out the conditional probability distributions of bi- and trigram models. This should ensure students obtain well-grounded foundations of statistical language processing. After a quick overview, the accompanying exercises #2 demonstrate how to build language models with the NLTK and participants are tasked with generating their own models using advanced smoothing techniques.

## Day 3 - String Processing

The third day lecture focuses on *string processing* and the algorithmic aspects of text mining. The slides contain an overview of state machines, like regular expressions, PATRICIA tries, and *Minimal Acyclic Deterministic Finite [State] Automata* (MADFA). Particular attention is payed to string metrics and similarity measures. The last theoretical part is a thorough introduction to *Locality Sensitive Hashing* (LSH), and its use as a tool to efficiently cluster millions of documents or implement approximate string matching over very large-scale term dictionaries is discussed. During the 3rd exercises students apply LSH to a toy problem: improving the speed of Peter Norvig-style spelling correction.

## Day 4 - Text Classification

After the more algorithmic topics, the lecture of day four takes the participants into the realm of "real" machine learning. For starters, ways to calculate syntactic and semantic document similarity are presented, culminating in *Latent Semantic Analysis*. The slides introduce the Naive Bayes classifier as an example to prepare the audience for the *Maximum Entropy classifier* (Multinomial Logistic Regression), which forms the central topic of this session. To "cool down" a bit after a math-heavy section, *sentiment analysis* is discussed as a popular domain for text classification. Finally, the important topic of evaluating set-based classifier performance via Accuracy, F-measure, and MCC Score are discussed. The fourth practical then walks the participants through a Maximum Entropy sentiment classifier and encourages the audience to improve its performance by designing better features and making clever use of all the gained knowledge in the course so far.

## Day 5 - Graphical Models

On the final day, *probabilistic graphical models* dominate the lecture. With the slides for day #5, first the main differences between Bayesian Networks and Markov Random Fields are introduced. Connected to yesterday's text classification topics, *Latent Dirichlet Allocation* is presented as the first graphical model. Then, for the remainder of the talk, we move into *dynamic* (temporal) models and their applications. We revisit the Markov chain and go from *Hidden Markov Models* all the way to *Conditional Random Fields*. To gear participants up for practical text mining, we look into actual applications of the presented models, for example, (semantic) *Word Representations* or *Named Entity Recognition*. The participants are introduced to the prospect of using the assigned labels for more advanced tasks. For example, *Relationship Extraction* now becomes feasible by combining the shallow parse output as features for classifiers presented in other sessions of the Advanced Statistics and Data Mining Summer School. The rank-based performance measures of ROC and PR curves are discussed as way to evaluate the labeled results. During the last practical, implementing a **shallow parser** using NLTK and the Stanford Taggers is demonstrated in class. Overall, the practicals should provide the students with the basic software tools to embark on their own text mining adventures right after the course.

## Remarks

You can contact me if you would like to discuss having this tutorial in the context of your own venue or have questions and comments about the slides' content. Other than that, I hope to have awakened your interest in statistical text mining and/or to have provided you with useful material, both as a student or teacher.