## An Introdcution to Statistical Text Mining

Last week I had to a really great time giving my first "practical" workshop, introducing text mining from the "bottom up" to a bunch of highly motivated summer students. This presentation was part of the 9th iteration of the Advanced Statistics and Data Mining Madrid Summer School. In its context, I presented 15 hours worth of practical and theoretical classes on machine learning for text mining. Due to my very extreme Open Source, Open Accesss and Copyleft stance, I feel it is my very obligation to share the slides and the tutorial files with the world (i.e., you, dear reader) for free (but limited to non-commerical use). While I do hope much of the material can speak for itself, I think it will not "save" you from coming to my class (next year?). I definitely would look forward to your particiaption in a presentation that will make you understand and work through all the maths and statistics presented in the work linked here.

## Overview

The theory mostly focuses on explaining the methods and statiscs behind machine learning *for text mining* - without requiring a particlar background other than sharp high-school maths. The practicals on the other hand make use of Python, particularly online IPython Notebooks. Therefore, prior contact with Python will be very useful to profit the most from the practicals, but is not required to follow the course. IPython's Notebooks act very much like MATLAB Notebooks, but run online in any modern browser and provide facilities to place markup annotations and mathematical forumlas between the code snippets and their return values. Therefore, the notebooks should serve as "take home material" to study and play around with, even after the course. Given the power of IPython, learning to work in this environment would immediately enable you to do distributed, high-performance computing right "off the shelf" (e.g., via StarCluster on Amazon EC2 Clusters).

## Day 1

Lecture 1 starts with a light-weight introduction to text mining, natural language understanding and generation, and a statistics approach to artificial intelligence in general. It provides a taste of both what is to come in the course and what might be of interest to look into afterwards. It closes with a brief introduction to Bayesian statistics and conditional probability, particularly an example application of *Bayes' Rule* (the Monty Hall Problem). The practical #1 discusses how to use IPython and its online Notebook, as well as some aspects of NumPy (particularly, contrasting NumPy/SciPy to R) and the most important aspects of the **Natural Language ToolKit** (NLTK) Python library as the second part. The NTLK will be frequently used during the course. As an exercise to become familiar with Python and the NLTK, participants are encouraged to implement a function to let two NLTK chat-bots converse between each other.

## Day 2

On the second day, lecture 2 introduces the chain rule of conditional probablitity and builds on that to take you from the *Markov property* over *language modeling* to *smoothing techniques*. In class, you will learn how to operate on n-gram frequency tables and work out the conditional probability distributions of bi- and trigrams. This should ensure the students will have a well-grounded foundation of statistical language processing. After a quick overview, the accompanying exercises #2 will demonstrate how to build language models in NLTK and you will be tasked with generating your own models using an advanced smoothing technique.

## Day 3

The third day lecture then focuses on *string processing* and the algorithmic aspects of text mining. It quickly presents an overview of state machines, like regular expressions, PATRICIA tries, or *Minimal Acyclic Deterministic Finite [State] Automata* (MADFA). Particular focus will be payed to string metrics and similarity measures and then a thorough introduction to *Locality Sensitive Hashing* (LSH) will give you your first tool to efficiently cluster millions of documents or implement approximate matching over very large-scale term dictionaries. The 3rd exercises will ask you to apply LSH to a toy problem: spelling correction.

## Day 4

After the "algorithmic day", the lecture of day four finally takes the participants into the realm of "true" machine learning. For starters, ways to calculate syntactic and semantic document similarity are presented, culminating in *Latent Semantic Analysis*. The slides introduce the Naive Bayes classifier as an example to prepare the audience for the *Maximum Entropy classifier* (Multinomial Logistic Regression) as the main topic of this session. To "cool down" a bit, *sentiment analysis* is discussed as a common example case of text classification problems. Finally, the important topic of evaluating set-based classifier performance using Accuracy, F-measure, and MCC Score are discussed in closing. The fourth practical then will walk through a Maximum Entropy sentiment classifier and will ask the audience to improve its performance by choosing the right features and making clever use of all the gained knowledge in the course so far.

## Day 5

On the final day, *probabilistic graphical models* will dominate the lecture. With the slides for day #5, the main differences between Bayesian Networks and Markov Random Fields are introduced. As a repriese of yesterday's text classification topics, *Latent Dirichlet Allocation* is presented as the first (static) graphical model. Then, for the remained of the talk, we move into *dynamic* (temporal) models, revisting the Markov chain to go from *Hidden Markov Models* all the way to *Conditional Random Fields*. To provide a closure for the course geared towards practical text mining, we will discuss some applications for the presented content, such as (semantic) *Word Representations* or *Named Entity Recognition*. The participants are left with the prospect of using the labels assigned to "their" texts using these dynamic models for more advanced tasks. For example, *Relationship Extraction* now becomes feasible by combining the shallow parse output we learned to generate as features for classifiers presented in other sessions of the Advanced Statistics and Data Mining course. The rank-based performance measures of ROC and PR curves are discussed as way to evaluate the labeled results. During the last practical, implementing a **shallow parser** using NLTK and the Stanford Taggers is demonstrated in class, which should enable the students to start off on their own text mining adventures right after the course.

## Resume

Apart from next year's ASDM (hopefully!), you can naturally alway contact me if you would like to discuss the possiblitiy of having this tutorial in the context of your own venue. Otherwise, I hope to have wetted your interest in coming to this course next year and/or have provided you with useful material either as a student or a teacher!