Text Mining


A command-line tool for developing statistical models that the can be used classify text (streams). The library is based on SciKit-Learn and provides classifiers that make sense to use for this scenario: Ridge Regression, various SVMs, Random Forest, Maximum Entropy/Logistic Regression, and Naïve Bayes classifiers.


My ever evolving Python 3 API and CLI for supporting all types of text mining tasks I do. So far, this collection provides the following interfaces:

  • is a tool to develop classifiers for [NER-tagged] text using Scikit-Learn.
  • stores corpora of distinct origins in one unified JSON format.
  • is an light-weight approach to tag tokens using a dictionary.
  • scans and counts gene/protein symbols in MEDLINE.
  • is a script to do inter-annotator agreement evaluations that work well with corpora annotated using our online MyMiner corpora generation tool.
  • and form a suite to train and use an unsupervised sentence segmentation model for text via the NLTK PunktSentenceTokenizer.


A modular, command-line based text mining pipeline for biomedicine using the UIMA framework. It provides a set of Annotation Engine (AE) modules that can be combined to run more complex text mining tasks such as Gene Normalization (assigning DB IDs to gene symbols) or mining for biological events (e.g., transcription regulation interactions). My framework adds a special builder pattern implementation to make UIMA/uimaFIT AEs and Resources less of a chore to develop, maintain, and configure. In addition, my framework provides a "type-free" "one-model-fits-all" type system, that is a single text annotation type with an annotator@namespace:identifier schema to make any kind of annotations on a UIMA SOFA and thus get rid of the issues of having specific UIMA type systems for every other AE developed by other programmers. Last, it has specialized wrappers for the Apache Tika project (a library to extract text content from many different file-types) that add functionality beyond the raw extraction alone, particularly for parsing XML from important publishing houses such as Elsevier or PMC. Given how much time it takes to build useful Java tools compared to my usual Python/C/C++ combo approach, I have heavily reduced the time I spend on this project, but it provides a number of useful tools, particularly it is fully capable of doing the NLP pre-processing of arbitrary input files (Word, PDFs, XML, etc.), including (supervised) sentence segmenting, tokenization, PoS-tagging, and phrase chunking.


A (surprisingly popular - hence not under sidekicks) library and command-line tool to split sentences and tokenize words (incl. genes/chemicals, date/time, email/web addresses, etc.) using regular expressions.



A command-line tool to manage an up-to-date PubMed/MEDLINE DB mirror. This is a Python 3 program that allows you to bootstrap a MEDLINE Citation (PubMed Abstracts) repository in either SQLite or PostgreSQL (Technically, it uses SQL Alchemy, so a few other datastores seem to be OK, too.) You can import data directly from the NCBI eUtils web service or by parsing MEDLINE XML files. Content can be extracted in plain-text (full, TIAB only, or as a table) or HTML format. The easiest way to install it is from PyPi, using "pip3 install medic". To use it, check out "medic --help" and "man medic". So far, this is my project that as been most often used by fellow researchers and therefore I dare claim it is production ready. With their great help, many bugs on different platforms and using various datastores have been discovered and elminiated (And, if you find any more, I do try to have a fix ready for you in less than 24 hours.)


A command-line tool to maintain an up-to-date repository of gene and protein names together with their most important keywords (kDa, length, chromosome position, etc.) and their references to the NCBI taxonomy and to PubMed. This is particularly useful to build gene normalization systems. Beyond just dumping the data, this tool groups unique genes and proteins from different databases and links all those proteins and genes (as a n to m relation). To use it, clone it from GitHub ("git clone") and install it locally ("virtualenv-3.3 gnamed; cd gnamed; . bin/activate; python3 install"). Then, you can learn about it by using man gnamed (or, just follow the link to the tool...)


An international project, organizing community challenges in text mining for molecular biology. I in particular have been the main organizer of one of these challenges, BioCreative II.5, and have been the developer of the Django-based website the organization uses to manage these events and their participants. Another integral part of this project is my official bceval evaluation tool to measure the performance of the participating text mining systems.


An online corpus annotation tool I collaborated with (But the development was done by David Salgado.)


Load (parts of) the Transcriptional Regulatory Element Database into a PostgreSQL DB.

Sidekicks and Minor Projects


A data-structure implementing a minimal acyclic deterministic finite state automaton in Scala.


A generic finite state machine library in Java. Design your own grammar to quickly build generic pattern matching engines.


A PATRICIA tree implementation in pure Python; to install, run pip install patricia-trie.


A tiny and very efficient JavaScript/Node.js client for CouchDB.


My port of the Gensim Python 2 script to 3. word2vec in turn is a (Google) ANN to detect semantic relationships between words using word vectors.


A simply Python 2/3 tool (install via pip install progress_bar) to display progress bars in a terminal with a bar header and adapted to the width of the terminal.


Tools to expand Greek letters to Latin names and to replace non-ASCII characters with their closest ASCII-representation.


A parallel, high-throughput tokenizer using a deterministic finite state automaton written in Go ("golang").