Feb 2013

Installing a full stack Python data analysis environment on OSX

UPDATE: Installing the Scientific Python stack from "source" has become a lot simpler recently and this tutorial was updated accordingly in November 2013 to use with OSX Mavericks and, in particular, Python 3.

Installing a full-stack scientific data analysis environment on Mac OSX for Python 3 and making sure the correct, underlying Fortran and C libraries are used is (was?) not trivial. Thanks to Apple, parts of the required libraries are already on your box when you install XCode (code-named the "Accelerate Framework"), and the remaining pieces can easily be installed due to the great Homebrew project. In other words, for the BLAS optimizations this setup will use Apple's pre-installed Accelerate framework and you can choose to add the SuiteSparse and FFTW libraries via Homebrew for some extra speed when factorizing sparse matrices and doing Fourier transforms. This guide will describe how to properly install the following software stack on Mac OSX from their sources and ensuring all the relevant C/Fortran "acceleration" is available:

With this stack, it is a breeze to add other cool data analysis tools such as scikit-learn, pandas, SymPy, or PyMC in your VirtualEnv.

Preparatory Setup

First, you need to make sure you have Homebrew installed and running without any issues:

brew doctor

If that produces any other output than:

Your system is ready to brew.

you need to stop right now and fix the issues or install Homebrew first. Note that if you upgraded to OSX Mavericks, you also need to upgrade your XCode command line tools (or download them if you have not installed them) by executing:

xcode-select --install

(And this means that you will have to re-install/compile most brew libraries, too, because of a change of XCode libraries...) Once you have a clean version of Homebrew up and running, you can proceed to install the actual requirements.

First, you need to install a Fortran compiler and Python3 itself:

brew tag homebrew/science
brew install gfortran
brew install python3

All of these commands should work nicely and you should encounter no issues.

Second, it is obviously necessary to set up a minimal Python environment. This tutorial will be using distribute and pip to install Python packages:

curl -O http://python-distribute.org/distribute_setup.py
python3 distribute_setup.py
curl -O https://raw.githubusercontent.com/pypa/pip/master/contrib/get-pip.py
python3 get-pip.py

Note that you do not need to prefix sudo to any of this - because you installed Python 3 using Homebrew, you are relieved from having to "root" everything. And you should consider using VirtualEnv and nose for your Python development, too:

pip3 install virtualenv
pip3 install nose

With this setup, you have Homebrew plus Python 3000 with pip, nosetests, and virtualenv all set up. This is a great start for any kind of Python development; Normally, it is suggested to "stop" here and install all further Python packages only in each "virtual environment". However, this scientific stack you are building is quite a lot of work to set up (compile-wise), so it is a time-saver to have this stack installed globally and then make use of it via --system-site-packages when creating a new virtual environment instead of having to install it each time.

NumPy

First, download the latest stable NumPy sources from SourceForge. By installing from source, NumPy will automatically detect that you are using OSX and therefore configure itself to use the Accelerate framework for the BLAS/LAPACK optimizations:

python3 setup.py config

Below atlas_info, at the end the config output, you should see the following message:

FOUND:
  extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
  extra_compile_args = ['-msse3']
  define_macros = [('NO_ATLAS_INFO', 3)]

As NumPy recognized Accelerate, you can proceed with the installation:

python3 setup.py build
python3 setup.py install

If you installed nose (as advised), you also can test that your installation is working correctly (note that you must move to another directory than where you build NumPy before running the tests):

python3 -c "import numpy; numpy.test('full')"

All tests should pass without errors or issues.

SciPy

To use SciPy, you need to install Cython and SWIG first:

pip3 install Cython
brew install swig

Optionally, you can also install OpenBLAS, FFTW and SuiteSparse (for the AMD and UMFPACK libraries) for some extra speedups on Fourier Transform and sparse asymmetric matrix factorizations:

brew install openblas
brew install fftw --with-fortran
brew install suite-sparse --with-openblas

This step is probably recommended (although it is entirely optional).

Next you now can fetch the SciPy sources from SourceForge and build them:

python3 setup.py config
python3 setup.py build
python3 setup.py install

The config step is only there so you can make sure SciPy found the Accelerate framework and the UMFPACK/AMD SuiteSparse libraries. The FFTW library you installed earlier with Homebrew is not listed in this output, but will be used during the build, too.

As with NumPy, you can run some tests to ensure our installation is working properly after moving to another directory:

python3 -c "import scipy; scipy.test()"

None of the tests should fail (except for KNOWNFAIL and SKIP tests, naturally).

If you have come this far, congratulations! Everything from here on will be a cake-walk.

matplotlib

The next step is the installation of matplotlib:

pip3 install matplotlib

As it is trivial to install and only takes a few minutes, you might consider adding it to your virtual environments only. However, the next packge that will be installed, IPython, makes use of matplotlib and is quite a hassle to install in every virtual environment.

To ensure the plotting library is working, try this in an interpreter:

>>> from pylab import \*; plot([1,2,3]); show()

You should see a plot with a straight diagonal.

IPython

Now it is time to install a great MATLAB-like interpreter and environment. The first, optional, step is to install PyQt4 so you can use IPython's qtconsole. This is not required, but it is nice to render plots inline in a Qt terminal window, making the IPython "experience" more like MATLAB:

brew install sip --with-python3
brew install qt --HEAD # currently, on Mavericks, the --HEAD option is required

Finally, you need to download and install PyQt4 using:

python3 configure-ng.py
make && make install

Apart from PyQt4, installing IPython itself is again straightforward:

pip3 install ipython[zmq,qtconsole,notebook,test]

To make sure the installation worked, execute the newly installed iptest3 script. Again as before, there should be no failures.

From now on, instead of python3, you should be using ipython3 if you want to work in a Python interpreter and you have reached the "holy grail" of having set up a MATLAB-like scientific computing environment:

ipython3 qtconsole --pylab=inline

Additional Data Science Libraries

Finally, here is a list of mature, interesting data science libraries that all will use the stack you just installed. These could all go either into the global site-packages, or you can just add them to your projects in your virtual environments as needed. In the latter case, do not forget to enable the globabl stack with --system-site-packages when creating a new VirtualEnv.

  • scikit-learn machine learning library: pip3 install scikit-learn
  • pandas statistical data analysis: pip3 install pandas
  • SymPy symbolic computer algebra system: pip3 install sympy
  • PyMC probabilistic programming environment (see this PyMC tutorial): pip3 install pymc

Other noteworthy analytical tools include:

  • PyTables large data management: pip3 install tables
  • RPy2 Python-R interface: pip3 install rpy2 (assuming you have R installed)
  • patsy and StatsModels statistical models: pip3 install patsy && pip3 install statsmodels

E voilà - you now have a fully functioning environment for running all kinds and sorts of statistical data analyses and developing machine learning algorithms!