A sober perspective on Deep Learning

This is a short tutorial to compare the outcome of applying Deep Learning techniques to a text classification problem, using word embeddings and a convolutional neural network (CNN), via Keras (with Theano, for simplicity - but any Keras backend will do), GloVe embeddings, and a SciKit-Learn dataset. The original tutorial is taken from very useful blog about "doing" Deep Learning with Keras (in Python) and you might find different details in the original tutorial post that I will highlight. Our goal will be to beat the state-of-the-art (SOTA) with Deep Learning (spoiler alert: we will not beat it), to better understand the limits and considerations you need to take into acoount with this machine learning technique.

We will work with a CNN to try and beat the best published results from models other than neural networks on the 20 Newsgroup dataset (and will discuss Reuters-21578 a bit, too). If you like, particularly the 20 Newsgroups corpus is the equivalent of "MINST" for computer vision, but for text mining. However, one caveat applies: The datasets are relatively small for what Deep Learning "needs" (and this has a huge impact on the final results), and for the Reuters dataset, most of the categories indeed are far too small to apply any serious Deep Learning techniques. Therefore, we will only look into the 20 Newsgroups dataset here.

For those two corpora, the best (non-neural) baselines achieve around 90% accuracy on the 20 Newsgroups corpus, using the official (roughly 3:2) split. Even the off-the-shelf SciKit-Learn setup achieves around 85% accuracy on this set (using a Linear SVM). For Reuters-21578, the state-of-the-art ("ante Deep Learning") was 94% micro-averaged $F_1$ Score using the official ModApte split, but only selecting documents among the ten most frequent categories (200 or more documents), and a micro-averaged 89% $F_1$ Score is using all 90 categories. And, obviously, it is important that the evaluation follows a single multilabel classification setting, not 10 or 90 individual binary classification problems... Most Deep Learning literature only focuses on the simpler 10-categories subset of Reuter-21578, and/or treats it as 10 distinct binary classification problems, because otherwise there are too few examples to work with. Yet the "true" SOTA baseline you should keep in mind for that set when reading a paper oon Deep Learning using it, is the 94% micro-averaged $F_1$ score on the mulit-label task with the top 10 categories; Anything else is just "cheating yourself".

Installation and setup

We will be using Keras "on Theano" (or CNTK or TensorFlow, as you prefer [1]) to understand how to build a simple classifier with a neural network - that short-handedly beats all results you've seen so far.

[1] Theano is probably well suited for teaching/learning networks, while Microsoft's CNTK is probably best suited for langauge modeling (of those three choices!) and Google's TensorFlow is certainly the best all-rounder and probably the most popular. If you already have something else than Theano (the default, AFAIK) set up with Keras, go with that.

Installing Keras is simple:

conda install keras
# or:
pip3 install keras

Similarly, installing Theano is easy, too, although getting it to work with your Nvidia GPU - assuming you have one in your laptop in the first place - might be more fidegty (see Theano's instructions about libgpuarray if you are not using conda, which does that for you via pygpu) - but still a lot simpler than with most other neural network libraries (if they are not supported by conda...)

conda install theano # no Nvidia GPU
conda install theano pygpu # with Nvidia GPU support
# or:
pip install Theano[doc]
In [1]:
import keras
# only to ensure its installed:
import theano # or your favorite choice
Using Theano backend.

Word embeddings

Instead of properly building your own word embeddings, we will take a "shortcut" and use a precomputed set of (single) word embeddings, GloVe, which is hosted and distributed by Stanford. (Yes, so our results will be sub-SOTA-par, but this is a blog post, not an attempt to beat the SOTA in a peer reviewed journal...) We'll "pretend" that we are just trying to get a quick prototype running to see if our idea (multi-label classification of documents with convolutional networks) works. Once we've ensured it does, feel free to "scale" it to larger and/or more specific embeddings [1] and/or more complex networks (we'll use just a 1D conv net here).

[1] Generally, if you have the time and resources to build your own embeddings, you will always be better off with embeddings from the most representative documents for your target domain. This is particularly true for capturing the right embeddings of named entities and idioms that are highly specific for your specific domain, while the semantics of collocations go well beyond that of single words.

If you are more of a fan of the FastText model, language-specific 300 dimensional word embeddings for a great many languages were made available by Facebook's research department recently.

For now, we'll just stick to the smallest possible/simplest set (in the hopes that our laptops can cope with the data): glove.6B.zip - word embeddings created from WikiPedia. And, we will only use the 50 dimensional embeddings (again, in the hopes that this works on your local laptop; if you have a GPU, use 100 dimensional set for a [little] performance gain - while the 300d set contributes no gains over 100d). Note the download is almost 1GB ("As homework" you can experiment with larger GloVe collections [42B, 840B] to see if that helps improve the final performance.)

In [2]:
#!wget http://nlp.stanford.edu/data/glove.6B.zip
#!unzip glove.6B.zip
In [8]:
EMBEDDING_DIM = 50 # use 100 on a GPU, or to get max. performance
In [9]:
%pylab inline --no-import-all
embeddings_index = {}

with open(WORD_VECTOR_FILE) as stream:
    for line in stream:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors with dim=%s.' % (
Populating the interactive namespace from numpy and matplotlib
Found 400000 word vectors with dim=100.

I.e., we now have loaded 400,000 word vector representations.

Corpus setup

Here, we shall differ from the blog post in a positive sense: We will keep using the headers, which contain the subject, and can often give us critcal hints.

In [10]:
from sklearn.datasets import fetch_20newsgroups

train = fetch_20newsgroups()
test = fetch_20newsgroups(subset='test')

print("------------ TRAIN ------------")
print("\nLABEL:", train.target_names[train.target[0]],
      "=", train.target[0])
print("\n\n------------ TEST ------------")
print("\nLABEL:", test.target_names[test.target[-1]],
      "=", test.target[-1])
------------ TRAIN ------------
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

- IL
   ---- brought to you by your neighborhood Lerxst ----

LABEL: rec.autos = 7

------------ TEST ------------
From: adamsj@gtewd.mtv.gtegsc.com
Subject: Re: Homosexuality issues in Christianity
Reply-To: adamsj@gtewd.mtv.gtegsc.com
Organization: GTE Govt. Systems, Electronics Def. Div.
Lines: 18

In article , revdak@netcom.com (D. Andrew Kille) writes:
> Of course the whole issue is one of discernment.  It may be that Satan
> is trying to convince us that we know more than God.  Or it may be that
> God is trying (as God did with Peter) to teach us something we don't
> know- that "God shows no partiality, but in every nation anyone who fears
> him and does what is right is acceptable to him." (Acts 10:34-35).
> revdak@netcom.com

Fine, but one of the points of this entire discussion is that "we"
(conservative, reformed christians - this could start an argument...
But isn't this idea that homosexuality is ok fairly "new" [this
century] ? Is there any support for this being a viable viewpoint
before this century? I don't know.) don't believe that homosexuality
is "acceptable to Him". So your scripture quotation doesn't work for

-jeff adams-

LABEL: soc.religion.christian = 15

However, the GloVe word vectors don't come with "apostrophe forms" and instead expand those contractions to full words; Here, we will do the same (removing them, thereby once more differing from the orignal Keras blog post).

In [11]:
import re

lexicon = (
    (re.compile(r"\bdon't\b"), "do not"),
    (re.compile(r"\bit's\b"), "it is"),
    (re.compile(r"\bi'm\b"), "i am"),
    (re.compile(r"\bi've\b"), "i have"),
    (re.compile(r"\bcan't\b"), "cannot"),
    (re.compile(r"\bdoesn't\b"), "does not"),
    (re.compile(r"\bthat's\b"), "that is"),
    (re.compile(r"\bdidn't\b"), "did not"),
    (re.compile(r"\bi'd\b"), "i would"),
    (re.compile(r"\byou're\b"), "you are"),
    (re.compile(r"\bisn't\b"), "is not"),
    (re.compile(r"\bi'll\b"), "i will"),
    (re.compile(r"\bthere's\b"), "there is"),
    (re.compile(r"\bwon't\b"), "will not"),
    (re.compile(r"\bwoudn't\b"), "would not"),
    (re.compile(r"\bhe's\b"), "he is"),
    (re.compile(r"\bthey're\b"), "they are"),
    (re.compile(r"\bwe're\b"), "we are"),
    (re.compile(r"\blet's\b"), "let us"),
    (re.compile(r"\bhaven't\b"), "have not"),
    (re.compile(r"\bwhat's\b"), "what is"),
    (re.compile(r"\baren't\b"), "are not"),
    (re.compile(r"\bwasn't\b"), "was not"),
    (re.compile(r"\bwouldn't\b"), "would not"),

def fix_apostrophes(text):
    text = text.lower()
    for pattern, replacement in lexicon:
        text = pattern.sub(replacement, text)

    return text

text_train = list(map(fix_apostrophes, train.data))
text_test = list(map(fix_apostrophes, test.data))

Data preparation

Keras comes with its own test preprocessing facilities (very much like Gensim's, by the way, but not quite as powerful).

In [12]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

Limit extraction to the words found in the training set (only [2]!), selecting the NUM_UNIQ_WORDS most frequent tokens as feature candidates (see padding below) only; It turns out, we can live with less than what the blog post uses and still get nearly the "same" results, and, as already descibed, we will remove the single quote apostrophe character ('):

In [13]:

tokenizer = Tokenizer(
    lower=False, # use True if you don't fix_apostrophes
    # Keras' default filters don't remove the single quote
    # apostrophe (') - filter it, as GloVe doesn't know it 

[2] About that above remark regarding "only", and the next step: The original tutorial makes a rather typical mistake - it fits the word extraction on the test data, too. Therefore, the blog post's preprocessing takes its own test data into account.

The test data is only there to apply and evaluate your model, not to develop it: "In real life", you don't actually get a chance to tune your setup against the test data. Therefore, such errors lead to overley optimistic evaluation results of classifiers and machine learning models (and possibly "irreproducible results" for your fellow researchers). As a sad word of warning, a far too big proportion of peer-reviewed research contains such trivial, but mission-critical errors, and for "papers" on arXiv and similar sites, the only right assumption is that the evaluation results presented are probably wrong, unless you can prove (yourself) otherwise.

In any case, you should never "fit" or "tune" anything in your (preprocessing or not) pipeline using test data, as you are guaranteed to get an overly optimistic result (that will not hold against truly "unseen" data, because you now have overfitted your model). At least if you are building a real-life, "production" classifier, this single advice will be probably the most imporant thing you need to keep in mind.

In [14]:
# only fit on training data!
print('Found %s unique tokens.' % len(tokenizer.word_index))
# so that texts_to_sequences will only be using
# the NUM_UNIQ_WORDS most frequent *training* words!

# generate "word index" vectors from both train and test
# (using only the NUM_UNIQ_WORDS most frequent ones)
seq_train = tokenizer.texts_to_sequences(text_train)
seq_test = tokenizer.texts_to_sequences(text_test)
Found 126595 unique tokens.

Note that our sequences are now integers, where each integer is an index (from tokenizer.word_index), in the order in which the tokens appeared in the document, and only for the "selected" (NUM_UNIQ_WORDS) tokens.

Next, we chop our sequences to equally sized vectors of MAX_SEQ_LEN, thereby generating the actual input "document vector" for our model. Unlike the blog post, though, we will take the first MAX_SEQ_LEN words (by setting truncating='post'), not the last. That is, we will be using at most the first MAX_SEQ_LEN words of each document, and each element in the vector will be an index for that word at the given position:

In [15]:
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

MAX_SEQ_LEN = 1000

data_train = pad_sequences(seq_train, maxlen=MAX_SEQ_LEN, truncating='post')
data_test = pad_sequences(seq_test, maxlen=MAX_SEQ_LEN, truncating='post')

labels_train = to_categorical(np.asarray(train.target))
labels_test = to_categorical(np.asarray(test.target))
print('Size of training set:', len(train.data))
print('Shape of training data tensor:', data_train.shape)
print('Shape of taraining label tensor:', labels_train.shape)
print('\nSize of test set:', len(test.data))
print('Shape of test data tensor:', data_test.shape)
print('Shape of test label tensor:', labels_test.shape)
Size of training set: 11314
Shape of training data tensor: (11314, 1000)
Shape of taraining label tensor: (11314, 20)

Size of test set: 7532
Shape of test data tensor: (7532, 1000)
Shape of test label tensor: (7532, 20)

We have 11,311 training examples/documents for training, characterized as a 1000-dimensional word count vecotor, and a 20-dimensional label vector (20 Newsgroups...) for each example/document. And we have 7532 test examples/documents for evaluating our approach.

Note that here, too, we significantly deviate from the blog post. Instead of using the official ~3:~2 split (for each 3 training articles, you leave aside 2 test articles, roughly) on the 20 Newsgroups corpus, the post uses an easier 4:1 random split. Between the preprocessing issues and the non-standard split, these two issues alone explain why the blog post achieves such a surprisingly good results with such a simple architecture, with way higher performance scores than anything previously seen in the research literature.

In [16]:
print(data_train[0][:10], "...", data_train[0][-10:])
[0 0 0 0 0 0 0 0 0 0] ... [ 113  186  203 1438 1327    2   14   37   58 7828]

What you see above are word indexes - and/or leading zeros, if the document didn't contain enough words.

Finally, we rename the dataset to follow the same nomencalture as used in the blog post (instead of using the random 1:4 split).

In [17]:
x_train = data_train
y_train = labels_train
x_val = data_test
y_val = labels_test

Building the model

This is probably the "technically" most interesting part - you will see just how incredibly easy it is to transform our word count vectors into proper word embedding vectors and plug that into a neural network with Keras. This is really the part where Keras shines - unlike "low-level" APIs provided directly by the framework, Keras makes building standard (and not so standard...) networks really easy.

First, we generate the weight matrix for the connections between the input (the padded "document vector" sequences) and the embedding layer. Those weights therefore will be the GloVe word embedding vectors, one for each of the NUM_UNIQ_WORDS possible words we have.

In [18]:
num_uniq_input_words = min(NUM_UNIQ_WORDS, len(tokenizer.word_index))
embedding_matrix = np.zeros((num_uniq_input_words, EMBEDDING_DIM))

for word, i in tokenizer.word_index.items():
    if i == NUM_UNIQ_WORDS:
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
    #    # words not found in the index use all-zero vectors
    #    print("not in index:", word)

Next, we need a bunch of "components" used to build our neural network:

In [31]:
from keras.layers import Dense, Dropout, Flatten, Input
from keras.layers import Conv1D, MaxPooling1D, Embedding

The embedding layer; Note that the tutorial set trainable to False, to avoid that the embeddings get change; However, there is no conceivable reason to do that, and in fact, performance suffers if held constant. So this layer will expand our one-dimensional MAX_SEQ_LEN "document word index vectors" into MAX_SEQ_LEN times EMBEDDING_DIM matrices, replacing the index values with the appropriate GloVe word embedding vector, hence it is a a simple "vector lookup" this layer is doing.

In [49]:
embedding_layer = Embedding(
    trainable=False) # no need to keep embeddings fixed (as in the blog post)!
# use True if you have a GPU, False if not or to get max. performance
# note that with False, your final accuracy will be 5-10% lower

Plugging our input tensor shape ("layer") and the embeddings lookup layer together:

In [41]:
sequence_input = Input(shape=(MAX_SEQ_LEN,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)

So that's it - at this point, you've seen how easy it is to "transform" a collection of words in a document into a semantically meaningful tensor ready for Deep Learning.

Next, we shall set up a conv net with three 5 word-window-sized, 1-dimensional (documents are "linear") convolutions, each followed by a max-pooling regularization (with a factor of 2, 5, and 35 max-pooling of our words). Why? Because you are an expert, or read tons of literature to find the "best" architecture, or (here) are simply following a blog post... In essence, these max-pooled convolutions "compress" the MAX_SEQ_LEN document vector into one one single number (times the EMBEDDING_DIM).

The WildML blog has excellent posts explaining the basics of text classiciation with conv nets and an example conv net for text classification. Note that the posts assumes you have a 2-dimenstional word-sentence document matrix as input, while we are using a 1-dimensional document vector (and hence use Conv1D, not Conv2D).

With that in mind, we make a few minor tweaks to the blog post:

  • Instead of using fixed, 128 dimensional outputs, we stick with our embedding dimension.
  • We remove the final ReLU layer and add a Dropout layer to get stronger regularization (which will allows us to keep training for more epochs, getting to a higher accuracy - but also training longer...).

And after the convolutions, we "roll out" (aka. flatten) the convolutions into a single-dimensional "document vector".

(Tip: for better -but far more complex- architectures look at the Conclusions...)

In [50]:
x = embedded_sequences

for layer in [
    Conv1D(EMBEDDING_DIM, 5, activation='relu'),
    Conv1D(EMBEDDING_DIM, 5, activation='relu'),
    Conv1D(EMBEDDING_DIM, 5, activation='relu'),
    Dropout(0.5), # very critical to tune this hyper-parameter! (.25 - .5)
    x = layer(x)
          "input:", layer.input_shape,
          "- output:", layer.output_shape)
conv1d_13 input: (None, 1000, 100) - output: (None, 996, 100)
max_pooling1d_13 input: (None, 996, 100) - output: (None, 199, 100)
conv1d_14 input: (None, 199, 100) - output: (None, 195, 100)
max_pooling1d_14 input: (None, 195, 100) - output: (None, 39, 100)
conv1d_15 input: (None, 39, 100) - output: (None, 35, 100)
dropout_5 input: (None, 35, 100) - output: (None, 35, 100)
max_pooling1d_15 input: (None, 35, 100) - output: (None, 1, 100)
flatten_5 input: (None, 1, 100) - output: (None, 100)

Note that to get to the 1 x EMBEDDING_DIM final size, the MAX_SEQ_LEN needs to be 1000 (so you might need/want to fiddle with the parameters of the max-pooled convolutions if you change MAX_SEQ_LEN). For MAX_SEQ_LEN = 1000, the final output tensor from the convolusions is 1 x EMBEDDING_DIM due to:

In [43]:
# for conv layers, subtract kernel size minus one to get the output size
# for max pool layers, divide by the pool size

On the final output - a vector with EMBEDDING_DIM numbers - we apply a softmax transformation down to the number of category lables, thereby giving us the propabilities for each category:

In [44]:
n_cats = len(train.target_names)
preds = Dense(n_cats, activation='softmax')(x)

Finally, we place the layers into a proper Keras model, using multi-class cross-entropy training handled by an RMS-prop gradient descent optimizer. Here, typically, I get asked: Why am I not using Adam? Because of Occams razor: Using Adam instead of RMS-prop makes hardly any difference on this example. And we ask Keras to report the current (training and validation) accuracy at each epoch.

In [45]:
from keras.models import Model

model = Model(sequence_input, preds)
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         (None, 1000)              0         
embedding_3 (Embedding)      (None, 1000, 100)         1000000   
conv1d_10 (Conv1D)           (None, 996, 100)          50100     
max_pooling1d_10 (MaxPooling (None, 199, 100)          0         
conv1d_11 (Conv1D)           (None, 195, 100)          50100     
max_pooling1d_11 (MaxPooling (None, 39, 100)           0         
conv1d_12 (Conv1D)           (None, 35, 100)           50100     
dropout_4 (Dropout)          (None, 35, 100)           0         
max_pooling1d_12 (MaxPooling (None, 1, 100)            0         
flatten_4 (Flatten)          (None, 100)               0         
dense_2 (Dense)              (None, 20)                2020      
Total params: 1,152,320
Trainable params: 1,152,320
Non-trainable params: 0

Model training

Now, we train... Note that we here actually commit another crime: We evaluate the model on the test data! Instead, if done properly, you should be evaluating on a subset of the training data, and only once your entire model is build, evaluate it on the official test data. But that would leave our model with even less training data to work with... At the end of the day, we can probably assume that no researcher gets stuff published without that "cheat" [1], we will just do the same: Evaluate training progress directly against (aka. "by overfitting the model on") the test data.

[1] And that explains why community evaluations exist: To tell you the "real truth"! Because only then nobody gets access to the test data before the final evaluation. That is, community evaluations are like (IMO: far) more serious versions of the now popular Kaggle tasks.

WARNING: This step can take a very long time unless you have a (or more...) GPU[s] (just see how long you wait for the next Epoch and multiply by n. epochs to estimate the overall runtime). Using 100 (or 300) dimensional embeddings and training them makes the epochs take much longer.

In [46]:
model.fit(x_train, y_train,
          validation_data=(x_val, y_val))
Train on 11314 samples, validate on 7532 samples
Epoch 1/20
11314/11314 [==============================] - 175s - loss: 2.6601 - acc: 0.1409 - val_loss: 2.3280 - val_acc: 0.2411
Epoch 2/20
11314/11314 [==============================] - 177s - loss: 1.9916 - acc: 0.3037 - val_loss: 1.8654 - val_acc: 0.3828
Epoch 3/20
11314/11314 [==============================] - 181s - loss: 1.5137 - acc: 0.4449 - val_loss: 1.4921 - val_acc: 0.5133
Epoch 4/20
11314/11314 [==============================] - 182s - loss: 1.1167 - acc: 0.6070 - val_loss: 1.3270 - val_acc: 0.5593
Epoch 5/20
11314/11314 [==============================] - 180s - loss: 0.8647 - acc: 0.7036 - val_loss: 1.1327 - val_acc: 0.6518
Epoch 6/20
11314/11314 [==============================] - 180s - loss: 0.6720 - acc: 0.7757 - val_loss: 1.0200 - val_acc: 0.6887
Epoch 7/20
11314/11314 [==============================] - 181s - loss: 0.5374 - acc: 0.8269 - val_loss: 0.8951 - val_acc: 0.7260
Epoch 8/20
11314/11314 [==============================] - 181s - loss: 0.4317 - acc: 0.8632 - val_loss: 0.8843 - val_acc: 0.7339
Epoch 9/20
11314/11314 [==============================] - 185s - loss: 0.3440 - acc: 0.8906 - val_loss: 0.8010 - val_acc: 0.7600
Epoch 10/20
11314/11314 [==============================] - 184s - loss: 0.2688 - acc: 0.9156 - val_loss: 0.7563 - val_acc: 0.7666
Epoch 11/20
11314/11314 [==============================] - 178s - loss: 0.2184 - acc: 0.9334 - val_loss: 0.7849 - val_acc: 0.7689
Epoch 12/20
11314/11314 [==============================] - 180s - loss: 0.1740 - acc: 0.9473 - val_loss: 0.8139 - val_acc: 0.7604
Epoch 13/20
11314/11314 [==============================] - 180s - loss: 0.1367 - acc: 0.9570 - val_loss: 0.9951 - val_acc: 0.7179
Epoch 14/20
11314/11314 [==============================] - 182s - loss: 0.1044 - acc: 0.9683 - val_loss: 0.7785 - val_acc: 0.7900
Epoch 15/20
11314/11314 [==============================] - 180s - loss: 0.0877 - acc: 0.9725 - val_loss: 0.8641 - val_acc: 0.7828
Epoch 16/20
11314/11314 [==============================] - 184s - loss: 0.0769 - acc: 0.9775 - val_loss: 0.9341 - val_acc: 0.7637
Epoch 17/20
11314/11314 [==============================] - 188s - loss: 0.0551 - acc: 0.9847 - val_loss: 0.8329 - val_acc: 0.7963
Epoch 18/20
11314/11314 [==============================] - 183s - loss: 0.0521 - acc: 0.9860 - val_loss: 0.8523 - val_acc: 0.7997
Epoch 19/20
11314/11314 [==============================] - 178s - loss: 0.0590 - acc: 0.9861 - val_loss: 1.0596 - val_acc: 0.7645
Epoch 20/20
11314/11314 [==============================] - 183s - loss: 0.0479 - acc: 0.9883 - val_loss: 0.9919 - val_acc: 0.7821
CPU times: user 1h 32min 3s, sys: 6min 59s, total: 1h 39min 2s
Wall time: 1h 43s

Geat! So we've built a neural network that learns to classify documents. Yet, as we can see, it takes ages to train (and even with a GPU: a lot more time) compared to all former models. Worse, it does not achieve the accuracy of the best off-the shelf models from SciKit-Learn. The above model can be made to achieve around 80% max. accuracy if you can use 300- or 100-dimensional word-embeddings and train the model long enough (15-20 epochs). With 50d embeddings and no embedding layer training, you need to run for about 20-30 epochs to converge on around 67% accuracy. So these results are a long shot from the 90% SOTA accuracy that is possible on this dataset and even the 85% we can achieve with the very ad-hoc "blitz-classification-experiment" from SciKit-Learn's own tutorial.


Overall, I wrote this "tutorial" mostly to demonstrate that you should probably focus on simple things first, before you dive head-first into Deep Learning:

  • Learn your own embeddings (ideally from a text collection matching your target domain), and make sure to add those mission-critical collocations that Mikolov points out in his famous word2vec NIPS paper (pro-tip: all of which is trivial when working with Gensim).
  • It is cool that we now can replace old-school TF-IDF vectors with modern-day neural word embeddings; But do use them in an "old school" ML model first, because its simple and fast; If for nothing else than to make sure you need or even can expect more performance from a Deep Learning classifier at an econmically viable expense.

And so, the challenge remains: How to actually beat the state-of-the-art in text classification with Deep Learning? At the very least: "Its tricky"! Very well-designed conv nets, and rather recent research on belief networks only in the most recent years managed to achieve the same ballpark results as the state-of-the-art results with "old school" ML models. But most of even the current Deep Learning research does not actually beat those "old" models on these two datasets. (Which is not to say that stuff like VAEs are very cool though - and probalby would unfold their "full beauty" if you had a much larger dataset - see next.) And probably by using LSTMs, GRU-RNNs, and (IMO: in particular) CharCNNs, to train sequence models might get you even beyond the state-of-the-art - if you have the resources and the data to even think of that, and an extensive amount of time to develop your specialized classifier/system. (And the resources to run the inference on that mega-model in production, too, by the way...)

That is to say, yes, Deep Learning can claim it beats standard ML methods on this task (text classification), but the effort to do so is highly disproptionate if compared to "traditional" ML methods: You need very large datasets, designing the model is incredibly complex and time consuming, and training and using the setup is several orders of magnitude more costly.

Take-home message

In my opinion, the Deep Learning literature is littered with evaluation results that claim to beat all former state-of-the-art, but indeed are quite frequently not much better (or ex-aequo, and often even worse). However, computer vision, machine translation, and dependency parsing being the now famous cases where Deep Learning indeed has "pushed the envelope" by a substantial margin on the same, public (and often, small) community datasets for evaluating the approach and comparing it to existing methods. And nearly no paper at all discusses how much more resources go into setting up, developing, training, and using Deep Learning models as opposed to traditional Machine Learning.

That being said, many other applications (apart from CV, MT, and DP) can profit from Deep Learning for the following reason: Iff you have much more training data (thousands, or even millions of examples per label), then, because Deep Learning can easily be scaled to work on such gigantic datasets, it indeed beats other methods (Support Vector Machines, Random Forrests, Nearest Neighbours, Gradient Boosting, etc.).

At the end of the day, I think Deep Learning is a very exciting technology you should learn to master, but you should take much of it with a very large grain of salt due to how much time and money you will need to invest.