An efficient online sequence tagger resource for GATE

tl;dr for a stressed out generation: GATE's Generic Tagger framework is a CREOLE plug-in that allows you to wrap any existing sequence tagger and use it to create annotations in your pipeline, but it is a bit slow. Therefore, I have created the Online Tagger GATE plug-in that works similarly to the Generic Tagger framework, but does not do any disk I/O for inter-process communication or launch more than one "singleton" sub-process per Processing Resource instance. This version can in some cases be several orders of magnitude faster than the built-in framework.

GATE (General Architecture for Text Engineering) has crystallized itself as my preferred tool for teaching text mining and information extraction. While anybody might argue that there are leaner and faster frameworks around, it has one pretty outstanding, unique quality: it is (mostly) GUI-based.

During any text mining course I teach, the most frequent question I get is what text mining software is around that can be used without any a priori programming skills. In other words, most of my audience is looking for a "graphical" text mining environment that can be used without first having to learn how to program. For example, to use NLTK, LingPipe, OpenNLP, StandfordNLP, UIMA, etc., you will first have to learn how to program in the chosen framework's API language. Therefore, the only entirely true answer is that only commercial tools can offer a pure "graphical user interface" and require no programming experience.

However, GATE can be used mostly without having to write code - with the exception of its "JAPE-glue". JAPE stands for "Java Annotation Patterns Engine" and is GATE's solution to make data inter-operable between different text mining resources that commonly have different I/O requirements. Furthermore, JAPE can be used to design entire rule-based annotation resources of their own right. However, JAPE "grammars" consist of rules where the left-hand side of the grammatical rule matches (existing) GATE annotations using a (clear and simple) syntax, while the right-hand side of those rules can contain Java code that will somehow modify those annotations. Therefore, GATE rids you from the need of writing code of your own, except for (small) blocks of (simple) code for the right-hand sides of JAPE's rules. Luckily, GATE's extensive documentation provides lots of examples to start with for the novice.

Overall, this makes GATE the only free open source text mining software that provides a graphical interface and requires (nearly) no programming skills to use it. As I stated initially, this fact alone makes it the best fit for my typical tutorial audiences, because most of them are neither computer scientists nor do they (want to) know how to code.

As mentioned in the introduction, "out-of-the-box" GATE isn't always the fastest solution. However, due to its open source nature that only means that if you need to go faster, you always can replace any slow pieces with whatever you consider a better fit (if you know how to program, that is...) For example, the Generic Tagger framework is a CREOLE Processing Resource that allows you to take any existing sequence tagger and use it to create annotations for a GATE pipeline. This is pretty nifty, because you can use whatever Part-of-Speech tagger or Named Entity Recognition system you like. You can even use a generic sequence tagger, train your own model, and integrate it in your text mining and information extraction pipelines, all without having to learn how to program first.

However, precisely due to the highly generic nature of the Generic Tagger framework, it is not very efficient. To create GATE annotations with it, this tagger "wrapper" operates as follows on each input:

  • A file with the input text for the tagger is written to disk.
  • A new tagger sub-process is launched by the wrapper, reading the input from the file.
  • The tagger's results are written back to disk.
  • The wrapper resource reads the result file, generates the annotations, and deletes the two temporary files.

I have highlighted "on each input", because this loop might be run for each processed document, for each sentence in each document, or, even worse, for each and every token in your documents. If you have already thought that doing this for each document is pretty bad, doing that loop for each token grinds your pipeline to a standstill. Second, if you are a programmer, the expression "a new sub-process is launched" in the second point should be alarming to you. If the tagger uses some large resources, like a dictionary (which they quite frequently do), starting up a new tagger process can be extremely expensive. In general, of all concurrent programming concepts, launching a new process is the probably most expensive resource you can create "within" a program and should be done as sparingly as possible. The reason the plug-in is designed in this peculiar way isn't because the framework was written by inexperienced programmers, however. It is that way because due to this design, it truly generic: Most I/O formats and tagger can be handled with this wrapper.

However, while the Generic Tagger is pretty cool to have on board so you can to try out "foreign" sequence taggers, the way it is implemented makes it rather useless for a "real" pipeline, i.e., beyond experimentation. For example, just tagging all gene mentions in a few thousand PubMed sentences with this wrapper takes days. But PubMed has over 24 million abstracts and (I think to recall) roughly around 100 million sentences, so go figure...

Therefore, I am releasing my own CREOLE processing resource that works similarly to the Generic Tagger, but does not do any disk I/O for inter-process communication or launch more than one "singleton" process for the entire pipeline you are designing. However, this puts some restrictions on the kinds of taggers you can use:

1. The tagger must support a streaming I/O model. That is, the tagger must be able to read from some "input stream", such as UNIX' STDIN, and write to some "output stream", commonly UNIX' STDOUT. Another way of putting this is that your tagger should be able to handle UNIX' piped command syntax, something like this: cat plain_text.txt | some_tagger > tagged_text.txt.

2. The tagger must work with POSIX' classical line-based interface. That is, the tagger must take one continuous block of text as input, terminated with a newline character. For example, it should take one token, sentences or block of text as input (not containing any newlines), and, once it receives a newline character, start tagging that input.

3. The tagger must produce one annotation per line as output, and those annotations must be in the same order as the (input) text spans which they annotate. Those annotations commonly are expected to be in the OTPL (one token per line) format. For example, the output line Nouns noun NN B-NP O might annotate the token "Noun" (verbatim, as found in the input text) with the lemma "noun", the PoS-tag "NN", the BIO-chunk "B-NP" and the BIO-NER-tag "O" ("outside" any entity mention).

If you have a tagger that follows those requirements (it turns out, most sequence taggers I know of work precisely like this), you can instead use my Online Tagger framework. What it does differently to the Generic Tagger is the following:

  • On Processing Resource initialization, you have to specify the location of the tagger and GATE launches the tagger with the supplies parameters (directory where to run the tagger and any arguments, such as dictionaries to load or command-line flags to set).
  • The Processing Resource configuration is nearly the same, but some of the defaults have been adapted to better reflect the nature of the on-line processing model.
  • Once your pipeline is running, the text is piped into the tagger sub-process and results are read from the output stream, while intermediary files are no longer created.

Please clone the tagger from GitHub (git clone into your local CREOLE user plugin directory. Then you can load my plug-in from the CREOLE Plugin Manager and once you instantiate a new GenericOnlineTagger Processing Resource, you will be asked to supply the initial configuration data to launch the tagger in its own sub-process (tagger binary path, directory to run in [if any], runtime flags and arguments [if any]).

If you run into any issue using this plug-in, please consider filing a bug report on GitHub so I can fix the problem for everybody using it. I hope the this plug-in will make you enjoy the new-found efficiency when integrating sequence taggers into your GATE pipelines!