A review of sparse sequence taggers


tl;dr Right now, use Wapiti unless you want to go beyond first-order and/or linear models, need the fastest possible training cycles, or are a Scala programmer, in which case you would be best advised to choose Factorie. OK, so that's that for a stressed out generation; Read on if you want to know why I recommend those two tools.

Overview: The goal of this review is to identify the "best" generic CRF- or MEMM-based sequence tagger software with a free (MIT/BSD-like) license. We will only take discriminative models into account, so if your beef are generative models and/or non-sparse data (e.g., HMMs), you have come to the wrong place. This article will look into their abilities to define and generate features, training times, tagging throughput, and tagging performance by way of working through a common sequence labeling problem: tagging the parts-of-speech of natural language. While PoS tagging can be considered a "solved" problem, PoS tagging performance differences are still a source of academic controversy and therefore an ideal testing ground.

Nearly any interesting natural language processing (NLP) task starts with word tagging: That is, resolving each word's grammatical sense - it's part-of-speech (PoS), the phrases they group into, and the semantic meaning the words carry. PoS refers to a word's morphology, e.g., if it is used a noun or an adjective, or the inflection of the verb. Words of the same morphology can often be grouped (we say, chunked) into phrases, such as "a noun phrase". As for the semantic meaning, text miners are usually interested in identifying the relevant entity class/type (such as person, location, date/time, ...) the word refers to. Furthermore, in NLP, words are commonly called tokens to use a name that also covers symbols and numbers, including dots, brackets, or commas. These tokens form the atomic sequence units for many statistical NLP methods.

Similarly, in bioinformatics, you might want to identify properties of biological sequences, e.g., DNA binding sites or predict locations of post-translational modifications in proteins. In general, any sequence tagger could be used to classify elements in any kind of sequence you can split into discrete units. Therefore, in bioinformatics, those units might be nucleic acids (DNA bases) or amino acids (proteins). However, the devil lays in the details: The implementations I am interested in here are for "information-sparse sequences", such as text. The difference is the element (and resulting feature) sparsity: while there are easily 20,000 different tokens contained in any average text collection (such as a book), there are only four DNA bases and twenty-something amino acids (depending on the species). All amino acids and bases will be - compared to text tokens, at least - frequently used in their respective sequences, with the obviously lowest sparsity for the four standard DNA bases (long story hidden here...). So to define some scope of this review, I am interested in learning patterns from extremely sparse data; If your data is "dense", you might be better off looking into more general algorithms, such as hidden Markov models (HMMs).

You can play around with a few example NLP taggers by following the tagging link, and you will see that depending on the system and its training data, results can vary widely. This is due to implementation details, graphical model capabilities, and the sequence features used by the particular model instance. The common approach to this kind of problem is learning a discriminative, dynamic graphical model; commonly, a maximum entropy (MaxEnt, aka. logistic regression) based decision function worked into a reverse Markov model or into the more complex Markov random field, in this particular case called a Conditional Random Field (CRF). In the former MaxEnt Markov model (MEMM) scenario, you are only allowed to use the current state (element in your sequence, including any meta-data assigned to that element) and the last tag(s) to predict the current tag (aka. label). The latter CRF model allows you to integrate features not just from the current state, but from any state in the sequence being labeled. So with CRFs you can even use states from the "future" (i.e., elements later in the sequence that the one currently being tagged) to predict the current label (aka. tag).

While outside the scope of this article, in case you are now asking yourself "Why do I then even want a MEMM instead of a CRF?": Defenders of MEMMs claim that "their" model has an edge because it does not tend to overfit the data (we say, it has a weaker "domain adaptation") as easily as a CRF can (due to the way features can be generated from any part of the sequence) and therefore produces better tagging results on input that is not similar to (of a "different domain" than) the training data. Second, due to its feature selection limitation, training of a MEMM tends to be faster than training a CRF with "long-range" features from distant positions in the sequence.

Both regular Markov models and random fields use the notion of Markov order. That is, the number of former (already assigned) labels in a linear chain (sequence) that may be used to calculate the probabilities of each possible label on the current state. To be more precise, it is the transition probability from one (first order) or more (second, third, ... order) former labels to the label on the current state that is the statistic being modeled. This "nth-order Markov" limit defines how many prior tags the model will consider when tagging the current state: For first order Markov chains you only can make use of the last tag, for second order models you get to use the last two tags, and so forth. Due to the expense of going beyond first-order models, all but one tool we will be looking at do not support more than first-order models: Linear chain models scale exponentially in the Markov order + 1, meaning that a second order model already has "number of labels"-cubed possible label transitions. Going beyond second-order Markov models is only desirable if the number of labels and (as we will see) sometimes even states is small (i.e., "dense data"). In language processing, these states normally are all the unique, observed tokens, also known as the vocabulary.

A quick, shameless self-plug: if you are not familiar with these concepts, have a look at my introduction to text mining slides - or come visit my class in the context of the Madrid summer school on Advanced Statistics and Data Mining next summer (beginning of July)! The lecture will take you from basic Bayesian statistics all the way to the dynamic, graphical models being discussed here.

Sequence tagger selection

Recently, I have become anxious to once and for all resolve my doubts about the "best" sparse sequence tagger in terms of ease of use (documentation/UI), feature modeling capabilities, training times, tagging throughput (tokens/second) and the resulting accuracy. All tools use the same optimization procedures for the learning process, that is a L-BFGS optimizer and a few light-weight gradient descent implementations as alternatives. However, implementation details, the graphical model abilities, the features the system can work with, the facilities it provides to generate them, the system's throughput, the provided documentation, and its availability (both in open source and free software terms) varies greatly between libraries. Available software for this task that I considered were CRF++ (Kudo), CRF Suite (Okazaki), Factorie (McCallum &al.), FlexCRFs (Phan, Nguyen & Nguyen), LingPipe (Carpenter &al.), MALLET (McCallum &al.), MEGAM (Daume III), [Apache] OpenNLP (Kottmann &al.), Stanford Tagger (Manning, Jurafsky & Liang), SVM Tool (Giménez & Marquez), TreeTagger (Schmidt), and Wapiti (Lavergne). There are more options around, particular in C# and for the .Net platform, but as I do not have the money to pay for the Windows tax, I did not consider them. If you know of a relevant, generic sparse sequence tagger implementation I missed (see my filtering criteria below), please contact me.

I immediately discarded the Stanford Tagger and the SVM Tool, because they are both orders of magnitude slower than most other tools considered (and the same goes for MALLET, too). It is worth mentioning that the Stanford Tagger was one of the earliest tools with the software made available for research and a very high accuracy, and as such usually serves as the performance "baseline" for newcomers. Second, CRF Suite claims to be the fastest first order CRF around and demonstrates that CRF++ is significantly slower, which lead me to discard the latter. That same benchmark claims that CRF Suite is faster than Wapiti, but not only has Lavergne developed several newer versions since then, the difference is far less pronounced, so that tool was not out of the race for me. Being a free software advocate in the sense of all its aspects - cost, freedom of usage, modifiability, and open source - I feel very uncomfortable about using software with a license that tries to restrict my freedom, including its commercial application. Therefore, I discarded FlexCRFs, LingPipe, MEGAM, and TreeTagger from the list because of their non-free nature (only being "free for research", or GPL'ed). While the GPL is not strictly out of my scope, it creates too many headaches for too many use-cases because it still poses usage restrictions (that I nonetheless support as a necessary evil given the overall copyright SNAFU). Moreover, excluding the GPL only affects FlexCRFs, which anyways is very similar to CRF++ or CRF Suite. Two of the already discarded tools would also not make it across this "free software barrier", by the way (Stanford Tagger and SVM Tool).

So this left me with CRF Suite, Factorie, OpenNLP, and Wapiti to compare against each other. Given these harsh pre-filtering criteria, to be honest, I was astonished that I was left with not just one, but four viable and completely free "tools of the trade"!

Implementation considerations

CRF Suite and Wapiti are both written in C, while Factorie is being coded in Scala, and OpenNLP is based on Java. So this makes for yet another classical "binary, platform-specific code versus the Java Virtual Machine" comparison! (Spoiler alert: it does not matter - as you should know already...) But there a real, noteworthy differences between the taggers; starting with the implemented graphical models and optimization procedures: Factorie's PoS tagger implementation only makes use of a forward learning procedure, while the other three use the more common (and more expensive) forward-backward optimization approach during training. So this difference makes up for an interesting test: will forward learning alone be good enough in terms of accuracy, and if so, how much faster will training be? (You could also read a paper about this...) Furthermore, Factorie is a library that allows you to design any kind of graphical models from basic factor classes (let that sink in if you know what I mean...), so it actually can represent any model you want (including non-linear models). For the PoS tagging, Factorie uses this forward-only learning approach to maximize a first-order linear-chain CRF. Next up, Wapiti allows you to choose between a (non-dynamic, pure) MaxEnt model, a MEMM or a CRF model. Similar to the above paper linked to, Wapiti can even do dynamic model selection, falling back on simpler models where feasible in the sequence (Note that Factorie's PoS tagger does not use dynamic model selection.) Finally, OpenNLP only provides a MEMM implementation, while CRF Suite only provides a CRF. These implementation details alone might be enough to make your decision: If you want more than a first-order linear-chain model (say, second-order, or a non-linear graph), your only choice is Factorie.

Software state and documentation

First, a quick look at the code, implementation, the documentation, and each tool's multi-processor capabilities. Two remarkable things about Wapiti are how simple and lean the interface is, and its capability of running in multi-threaded mode. While code is the typical, long spaghetti-code of C, it is clean and very well documented. The only main downside is that the documentation is a bit sparse; Everything is in there, but they could have done a bit better detailing some of the capabilities, and/or providing examples. I had to figure out myself that you always need to use both the -c -s switches when doing (feature-sparse) text labeling and it took me some time to understand how to do feature extraction (designing "patterns"). The Wapiti authors do not provide a default set of feature pattern templates, only a few pre-trained PoS models for English, German, and Arabic newswire text.

Similarly, Factorie is able to run in multi-threaded mode and claims to be scalable across machines for hyper-parameter search (which I have not tried). To use Factorie, I often have to refer back to the code-base, because few of the specifics of Factorie are entirely documented for now. This means, to use Factorie, you should better know some Scala, as it might be tough having to go through the code otherwise. In my opinion, some parts of the codebase could have been coded just as well in Java, but that only affects few regions of the library. One should also note that this opinion is purely subjective and might be of little relevance, so you might be better of judging for yourself. Finally, the documentation certainly assumes you are an expert for graphical models with plenty of background knowledge in that domain. Given its state and direction, in comparison to the other tools here, it is probably safe to judge that this library is targeted at the probabilistic programming crack with a background on graphical models, not someone looking for a quick and dirty sequence tagging solution. The main up-side is that, together with OpenNLP, this is the only library offering a full NLP pipeline (segmentation, tokenization, tagging, and parsing). But again, except for PoS tagging, the other NLP functionality has to be deduced from the code, as there is not much more documentation on the NLP pipeline.

CRF Suite comes with a nice interface, and although it does not support multi-threading, the C code is well written and very clear. The functions are short and precise, so it is a nice example how C code can look if you put some effort into it. The only code-wise downside I detected was the accompanying Python code for preparing and pruning data and benchmarking. It is clearly written by a C expert with little knowledge of idiomatic Python, no offense (see the performance issues for feature extraction below). However, I think this is a minor issue given the good documentation and high-quality C code, which in the end is the part that matters. The worst issue with CRF Suite, however, seems to be that the original author has stopped maintaining the code. The repository on GitHub has a handful of very good pull requests from serious developers that fix things like a minor memory leak and two or three other issues, but the author has never accepted the requests. Neither have there been any updates to the library. To me this means that CRF Suite development seems dead and it would have to be significantly better than the other tools here to make it worth using it.

Finally, OpenNLP has the expected high quality code found in Apache projects, and it is well documented, too. The only downsides worthwhile mentioning are that there is no built-in parallel processing support and that it only comes with a MaxEnt Markov model. As mentioned already, it is important to realize that OpenNLP offers a full NLP pipeline, unlike CRF Suite and Wapiti.

To summarize, unique capabilities of Wapiti are its built-in template-based feature extraction mechanism and its ability to quickly choose either CRF or MEMM as the target model (more on this below). A unique capability of Factorie is that it provides you with the necessary base classes to quickly code your own graphical models of any Markov order, both linear and non-linear. However, admittedly, both Wapiti and Factorie are behind CRF Suite and OpenNLP in terms of documentation and in my opinion, Factorie is not consistently using idiomatic Scala, probably due to its many different developers. Finally, both OpenNLP and Factorie include a full, documented and undocumented (respectively) NLP pipeline.

Feature modeling

This leads to the next important consideration: modeling features; for example, via templates (or "patterns") that define the features used by a model's binary indicator functions. For example, an indicator function for assigning the PoS tag "VBZ" might be triggered when observing the bigram "I went" and having already assigned the PoS label "PRP" to the token "I":

f(last_tag, current_tag, states, pos) =
    last_tag == "PRP" && current_tag == "VBZ" &&
    states[pos-1] == "I" && states[pos] == "went"

This example is called a "combined (bigram) transition/label and state/token feature" (And would be minimally encoded as b:%x[-1,0]/%x[0,0] in Wapiti's pattern template language.) It is a rather "expensive" feature template, as it can easily lead to millions of individual indicator functions: The number of indicator functions created from this template will be the squared number of PoS tags (label transitions or "bigrams") times the number of unique token bigrams in the whole training data. In general, for any template - even a constant one - there will be at least as many indicator functions as there are different tags.

Except for Wapiti, all taggers come with a pre-defined set of feature templates for common NLP tasks. Depending on your requirements, this might be either very practical or practically useless, particularly for domain-specific language (tweets, for example) and/or NER tagging. For NER tagging, your entities might have unique morphological or orthographic properties; For example, gene names might be used not just as nouns, but as adjectives, too (as in "p53-activated DNA repair") and contain Roman or Arab numbers, Greek symbols, non-standard dashes and a few other orthographic surprises. In addition, the entity tag might depend on "knowing the future", such as the up-coming head token of the noun phrase currently being tagged (e.g., the head "gene" in "the ABC transporter gene" when looking at the token "ABC").

The predefined Factorie features are, however, pretty good - so rich indeed, that they are more complete than any other set of features I used or provided myself in the experiments here (see FowardPosTagger.scala, features, for PoS tagging). That means training could be slow for Factorie, because it needs to optimize over a much larger indicator (feature) function space (turns out it is not, as we will see). As with all systems except for Wapiti, this means you need to do some coding of your own to adapt the features for NER, while it might or might not be necessary for PoS tagging and phrase chunking tasks. The real issue with Factorie is figuring out how to define your own NER tagger, as the documentation so far only covers PoS taggers. More generally speaking, the documentation on generating features and models for your taggers is rather thin in terms of "applied" examples (see User Guide - Learning).

OpenNLP, too, comes with a pre-selected list of feature templates for standard NLP tasks. To change this list, you either have to write your own Java code or, at least for NER tagging, the documentation states that you can conjure up XML configuration files to extract different features. As I am allergic to any use of XML other than its intended use-case - providing structure to unstructured data - and particularly against the use of XML as a vehicle for configuration files (hello Java/Maven world!), I did not even try this path. In other words, for OpenNLP I will be using the predefined features and have not experimented with the "XML feature configuration" option, so I cannot tell how well it works or how easy it is to use. As for PoS tagging, OpenNLP uses pretty much the de facto standard features (prefixes, suffixes, orthographic features, and a window size of 5 [-2,+2] for the n-grams; see the DefaultPOSContextGenerator class), so that seemed good enough for this test.

CRF Suite comes with a set of Python scripts to convert simple OWPL files (one word [state] per line, with sentences [sequences] separated by an extra empty line, such as the CoNLL format) into the "per-label feature list files" CRF Suite uses as input. To create different features, you modify the "template" defined inside the relevant Python script. For most cases, I think the predefined templates do a pretty good job at generating features for standard NLP tagging tasks. Additional features uniquely generated by CRF Suite and not OpenNLP or Factorie (for PoS tagging) are quadrigrams, pentagrams, and "long-range interactions". The latter are bigrams created from the current word and a word at position +/- 2 to 9 from the current word. If you commonly work with Python, you might even easily assimilate the Python feature generation process, adapting it to your own needs. CRF Suite's feature handling has an important shortcoming, however: It is impossible to work with combined "label bigrams" (1st order Markov transition features) together with other (state) features from the token stream to form more advanced indicator functions. That is, CRF Suite only models either label transition probabilities or the features from the current state, but does not allow you to create "mixed" indicator functions as described in the beginning of this section (b:%s[-1,0]/%x[0,0]). This is an important conceptual shortcoming, because it is not possible to define features that condition on both the previous label (i.e., the transition) and the current token. However, as opposed to the other tools here (that only work with discrete features), CRF Suite provides support for continuos features. For example, when using word embedding techniques, you might want to directly include the numeric word vectors, which you can simply pass on to CRF Suite. While this shortcoming is commonly is circumvented by discretizing the real-valued features if continuos feature support is unavailable, it is a noteworthy difference.

As for Wapiti, you have to figure out how to generate features for each task on your own; The authors do not provide any predefined templates for "standard" NLP tasks. But Wapiti provides you with a mechanism to define "patterns", much like CRF++' feature templates. Once you fully understand the mechanism, this is indeed quite powerful and it felt like I "missed" it in the other tools. Particularly, this means that feature extraction is done in C, so it will beat a script-based extraction process, while not requiring any C programming knowledge. To provide an even playing field, at first I defined the same features "patterns" as CRF Suite does via its Python feature generation scripts. The problem is that Wapiti uses all possible label and state combinations for its initial training matrix, not just all combinations present in the data. In other words, it is the only tool that does no feature space reduction prior to going into training. For example, if you define a pattern such as the current token (u:%[0,0]), it creates one feature for each label in you training set times the number of unique tokens in your data, no matter if a token is observed with that label in the data or not. So for token n-grams or label bigrams, the training matrix can quickly grow to extraordinary sizes. It is worth noting that you can use your own feature extraction method and just feed Wapiti with extracted features directly (i.e., strings that start with "a" or "b", depending on whether you want to use only the current state or integrate the transition, too). The advantage - I assume, at least - is that the optimizer might decide to transfer some probability mass to those zero observations. However, it is not clear to me if or how much performance Wapiti gains from such transfers, particularly when contrasting this unique feature with the greatly increased space penalty: While Wapiti does compacting and supports sparse matrices during training, as it initially starts of with all features, training becomes rather sluggish when very large feature spaces are defined. By using the same feature templates as CRF Suite, I ended up with an initial matrix containing a few hundred million features. This simply was too much for my weak dual-core i5 processor to handle in realistic time. In the end, I decided to cut down on the number of PoS feature templates with respect to CRF Suite or Factorie. In particular, I removed the feature templates only CRF Suite has otherwise, thereby reducing the initial setup to 44 million "features" (indicator functions). After compacting, Wapiti's final model contained 1.8 million indicator functions ("features functions", or worse, sometimes just called "features") for the PoS tagging trials (see below). As should be noted, that reduced set was enough to out-compete all but Factorie in terms of accuracy while using each tool's default parameter settings.

In summary, with OpenNLP, Factorie, and CRF Suite you will need to work with their respective feature generation API (in Java, Scala, and Python, respectively) to model features beyond anything but newswire PoS tagging, phrase chunking, and some basic NER. CRF Suite, similar to Factorie, has rich, pre-defined feature templates and can handle them, because unused indicator functions are dropped before the actual training starts, thereby keeping the initial feature weight matrix manageable. In addition, it is the only tool in this review that can handle real-valued features. Pre-training feature (space) reduction is done by all tools except Wapiti. Feature compaction after training is done by all tools except OpenNLP, where I could not confirm if any compaction had occurred. Wapiti provides a very powerful pattern language to define feature templates, including mixed state and transition label templates, which are otherwise only possible to generate with Factorie. While Wapiti's template (pattern) language lends to a great flexibility when modeling features, it has to be used with care unless training times are not an issue due to the maximal (non-reduced) feature matrix used during the first training cycles. At the end of the day, in terms of feature generation, once you learn how to use Wapiti's pattern "language", it will be very efficient and spares you from writing code.

Training time

Next, I looked into the training run-times to see how long each tool takes to create a PoS model. To make an equal, but simple comparison, I used the CoNLL 2000 PoS tags to train the models using the default feature templates as discussed in the last section. Both Factorie and OpenNLP needed slight, but simple modifications to the downloaded CoNLL files. For Factorie, the reversed parenthesis tags in the CoNLL files had to be fixed. The main observation here is that Factorie is not very helpful in terms of error messages to understand the problem; It just throws some obscure exception at you. This means you will have to figure out what went wrong when you get errors on your own. OpenNLP's problem was simpler to identify: as documented, it expects one sentence per line, with token-tag pairs separated by underscores instead of the de facto standard OWPL format.

To train the the taggers for Part-of-Speech, the commands I used were:

# CRF Suite
crfsuite learn -m pos.model train.txt

# Factorie
java -cp factorie-1.1-SNAPSHOT-nlp-jar-with-dependencies.jar \
     cc.factorie.app.nlp.pos.ForwardPosTrainer -Xmx2g \
     --owpl --train-file=train.txt --test-file=empty.txt \
     --model=pos.model --save-model=true

# OpenNLP
opennlp POSTaggerTrainer -type maxent -model pos.model -data train.txt

# Wapiti
wapiti train -c -s -p patterns.txt train.txt pos.model

The input data is provided in train.txt, and the models are saved to pos.model. To measure the training times, I prefixed each command with time and used the sum of user+sys as the measured, total time it took each process to complete. This means, the measurement includes all relevant CPU time (i.e., over all processor cores) that was consumed by the run. This might seem unfair to multi-threaded code, which might have an actual runtime lower than the result. However, this is entirely depended on your machine and its cores, so a direct "total CPU time" comparison seemed fair to me. In my case, it also is a rather minor issue, because I anyway only have two (hyper-threaded) cores on my laptop. To be fair, this most significantly only affects Wapiti runtime, so I report total time there, too. There are other opinions about performance measurements, e.g., that one should only measure post-warm-up training time (minus JVM, input data reading, etc.) or only a single training cycle/iteration should be measured. I think it is more practical to measure and compare whatever the "out-of-the-box" performance of each tool is. Each training process is run thrice and the shortest measured time is the one I report here.

Software   Features   Training Time
CRF Suite   3.63 M   10m 22s
Factorie   0.34 M   02m 18s
OpenNLP   ???? M   02m 03s
Wapiti   1.56 M   19m 06s
Wapiti-4*   1.56 M   10m 09s*

[* absolute runtime ("real") in multi-threaded mode using the -t 4 switch to make full use of my hyper-threaded dual-core i5 processor]

As mentioned earlier, with respect to initial feature template richness, CRF Suite and Factorie are taking the lead, with Wapiti and OpenNLP using less templates. The models' feature sizes shown here are as reported by each tool after all feature pruning steps (compaction; For Wapiti that number is calculated from "initial features" minus "removed features".) While OpenNLP does not report final feature set sizes, I assume it to be in a similar range (somewhere around a million features). So in terms of feature compaction, Factorie has a clear edge over the competition.

In terms of training times, Factorie and OpenNLP easily outpace both CRF Suite and Wapiti, but this should not be entirely surprising: OpenNLP uses a simpler model (MEMM), so it clearly must be faster. One noteworthy point is the training speed of Factorie - probably due to forward-only learning, it achieves similar training times for its CRF as OpenNLP on a MEMM. On the other end, as expected, Wapiti is by far the most resource-hungry tagger. As Wapiti's learning procedure can easily make use of multiple CPU cores with a simple switch, it gained significantly in terms of absolute ("real") training time from running in multi-threaded mode on my dual core machine, at least. This is important, because while tagging is what is called "embarrassingly parallel", learning/optimization is not. Still, this means the top model training implementation is provided by Factorie, as OpenNLP has a much simpler model to train. The remaining question in this respect will be if OpenNLP and Factorie can keep up with the accuracy of the other two CRFs and how fast they all perform their tagging.

Tagging quality

This section will resolve the final remaining question: Which implementation can provide you with the most efficient tagger? I have a SATA-3-attached SSD drive where the data is read from (but a slow i5 CPU...) and took the CoNLL 2000 test set sequences to measure accuracy on the 47,377 tokens using the models I had trained in the last step. So while my measurements do not include writing the tagged tokens back to the device and reading data should not be an issue with a SSD, my CPU isn't exactly "Speedy Gonzales..." I timed each system while tagging 100 times those 47,377 tokens in a single row, read from one file (i.e., about 200,000 sentences or roughly 30,000 scientific abstracts) to make a fair comparison of each system's token throughput, marginalizing any warm-up "penalties".

To run the the taggers on the generated models, the commands I used were:

# CRF Suite
cat test.txt | pos.py > features.txt
crfsuite tag -m pos.model -tq features.txt

# Factorie
java -cp factorie-1.1-SNAPSHOT-nlp-jar-with-dependencies.jar \
     cc.factorie.app.nlp.pos.ForwardPosTester -Xmx2g \
     --owpl --model=pos.model --test-file=test.txt

# OpenNLP
opennlp POSTaggerEvaluator -model pos.model -data test.txt

# Wapiti
wapiti label -m pos.model -c test.txt > /dev/null

And here are the results, in terms of error rate (1 - accuracy over 47k tokens) and throughput (on 201k sentences with 4.7M tokens, as measured with time, sys+user):

Software   Error Rate   Tokens/Second
CRF Suite   3.00 %   30,000
Factorie   2.19 %   23,800
OpenNLP   2.88 %   19,500
Wapiti   2.20 %   21,200

Note that for the CRF Suite, I pre-generated the features from the Python script. If not, the numbers would look quite bad, as the whole Python-based feature extraction process using the pos.py script is several times slower than the tagger itself! (A good indicator that the underlying Python code could arguably use some polish...) In terms of tagging throughput, rather to my astonishment, it seems fair to say that the libraries perform roughly equal and there are by and large no noteworthy differences. It seems that CRF Suite is faster, but then we actually cheated, because we pre-generated the label features. So at best there is a minor chance that the CRF Suite could be faster than the others, if it had a very fast feature extraction mechanism. Another remark maybe is that Factorie automatically detects the available cores and equally distributes the tagging load among them (Note that the throughput calculation is based on CPU time (sys+user), so in absolute numbers on my CPU, Factorie tags at about 47k tokens/second.) While the tagging process is an "embarrassingly parallel" problem (you could just split up the input between as many cores as you have and run your tagger with GNU parallel), it is a nice little extra thrown into the mix. Overall, in terms of raw tagging throughput, there might be no real "winner".

Regarding accuracy, you might want to know that a baseline tagger (using the majority PoS tag and tagging all unseen words as noun) already achieves an accuracy of 90% (or, an error rate of 10%) in standard PoS scenarios. Probably due to the inability to use mixed features and because it does its feature compaction prior to the training, the CRF Suite has the worst performance in terms of accuracy, closely followed by OpenNLP. OpenNLP's shortcoming most likely can be attributed to its model choice, which only really starts to shine in cross-domain experiments. So in terms of high quality tagging, at least if you have training data specifically for that domain, you will be better off with Factorie or Wapiti - not all that unexpected, given our feature modeling and model implementation insights.


If you only need to do standard PoS tagging, chunking, and/or NER, and don't mind the tagging quality or performance too much, just go with the tool in your favorite language: OpenNLP for Java developers, Factorie for Scala hackers, and Wapiti for C/C++ or Python programmers (There is a Python wrapper for Wapiti available, if Python (for both 2 and 3) is your deal.) The trade-offs are simply not big enough to make a huge (order-of-magnitude) difference. But then, if that were your case, you probably would not have read until here...

Because there is no conceivable advantage in terms of training times, tagging throughput, or accuracy, no support for mixed transition/state features, and, particularly, as the code seems unmaintained, I would not recommend the use of CRF Suite. The missing ability to scale training across computing cores is similar to OpenNLP, and another possible issue. Nonetheless, the tool (and the wrapper) has gotten some love from Mikhail Korobov, who has a commercial interest in (maintaining) it, too. As just hinted, there is a Python wrapper for CRF Suite available, too, much like Wapiti. So if you need to work with real-valued features (e.g., for adding word-vector representations) and/or do not mind the mentioned issues, CRF Suite might still be an interesting option for you. (Have a look at the discussion linked below.)

At the end of the day, if you are interested in generating high-performance quality annotations while using mixture (state + transition) features, there really is only the choice between Wapiti and Factorie: Foremost, they are the only two tools ready for the multi-core world of today. An error rate reduction of about 25% on the "solved" PoS tagging problem at no cost in throughput is not to be underestimated. Wapiti definitely is the most attractive tagger in terms of off-the-shelf usability: feature generation is simple with the patterns with no need for doing any coding, and the overall implementation complexity vs. usability balance is excellent. The only tradeoff are possibly longer training times (in single-core mode or vs. Factorie), so you will need to develop and combine your feature templates with care. Finally, if you also are looking for true flexibility and the full power of graphical models, or want to venture into higher order Markov space, I see no way around Factorie. In this case, it might be worth living with the sparse documentation and having to study Scala source code. However, the team around McCallum are very actively working on this library, and the documentation is certainly getting better and more extensive every other time I come back and a few months have passed. Another advantage is the fact that Factorie offers a (largely undocumented) full NLP pipeline, starting from sentence segmentation all the way to dependency parsing.

Nonetheless, if you'd ask me to declare a global "winner", unless you are a probabilistic programming or at least Scala expert, I'd say that honor right now still goes to Wapiti. But once Factorie fixes the mentioned issues and makes the library more accessible to a "general public", that might change, as it certainly already takes the lead in terms of model flexibility and implementation performance.

If you feel like discussing what is written here, I've posted a link to this article on Reddit.