Published: Mon 11 February 2013
python apple data mining
UPDATE: Installing the Scientific Python stack from "source" has become a lot simpler recently and this tutorial was updated accordingly in November 2013 to use with OSX Mavericks and, in particular, Python 3.
Installing a full-stack scientific data analysis environment on Mac OSX for Python 3 and making sure the correct, underlying Fortran and C libraries are used is (was?) not trivial.
Thanks to Apple, parts of the required libraries are already on your box when you install XCode (code-named the "
Accelerate Framework"), and the remaining pieces can easily be installed due to the great Homebrew project.
In other words, for the BLAS optimizations this setup will use Apple's pre-installed Accelerate framework and you can choose to add the SuiteSparse and FFTW libraries via Homebrew for some extra speed when factorizing sparse matrices and doing Fourier transforms.
This guide will describe how to properly install the following software stack on Mac OSX from their sources and ensuring all the relevant C/Fortran "acceleration" is available:
With this stack, it is a breeze to add other cool data analysis tools such as
scikit-learn, pandas, SymPy, or PyMC in your VirtualEnv.
First, you need to make sure you have
Homebrew installed and running without any issues:
If that produces any other output than:
Your system is ready to brew.
you need to stop
right now and fix the issues or install Homebrew first.
Note that if you upgraded to OSX Mavericks, you also need to upgrade your XCode command line tools (or download them if you have not installed them) by executing:
(And this means that you will have to re-install/compile most brew libraries, too, because of a change of XCode libraries...)
Once you have a clean version of Homebrew up and running, you can proceed to install the actual requirements.
First, you need to install a Fortran compiler and
brew tag homebrew/science
brew install gfortran
brew install python3
All of these commands should work nicely and you should encounter no issues.
Second, it is obviously necessary to set up a minimal Python environment.
This tutorial will be using
distribute and pip to install Python packages:
curl -O http://python-distribute.org/distribute_setup.py
curl -O https://raw.githubusercontent.com/pypa/pip/master/contrib/get-pip.py
Note that you
do not need to prefix sudo to any of this - because you installed Python 3 using Homebrew, you are relieved from having to "root" everything.
And you should consider using VirtualEnv and nose for your Python development, too:
pip3 install virtualenv
pip3 install nose
With this setup, you have Homebrew plus Python 3000 with
pip, nosetests, and virtualenv all set up.
This is a great start for any kind of Python development;
Normally, it is suggested to "stop" here and install all further Python packages only in each "virtual environment".
However, this scientific stack you are building is quite a lot of work to set up (compile-wise), so it is a time-saver to have this stack installed globally and then make use of it via when creating a new virtual environment instead of having to install it each time. --system-site-packages
First, download the latest stable
NumPy sources from SourceForge.
By installing from source, NumPy will automatically detect that you are using OSX and therefore configure itself to use the Accelerate framework for the BLAS/LAPACK optimizations:
python3 setup.py config
atlas_info, at the end the config output, you should see the following message:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
extra_compile_args = ['-msse3']
define_macros = [('NO_ATLAS_INFO', 3)]
As NumPy recognized Accelerate, you can proceed with the installation:
python3 setup.py build
python3 setup.py install
If you installed
nose (as advised), you also can test that your installation is working correctly (note that you must move to another directory than where you build NumPy before running the tests):
python3 -c "import numpy; numpy.test('full')"
All tests should pass without errors or issues.
To use SciPy, you need to install
Cython and SWIG first:
pip3 install Cython
brew install swig
Optionally, you can also install
OpenBLAS, FFTW and SuiteSparse (for the AMD and UMFPACK libraries) for some extra speedups on Fourier Transform and sparse asymmetric matrix factorizations:
brew install openblas
brew install fftw --with-fortran
brew install suite-sparse --with-openblas
This step is probably recommended (although it is entirely optional).
Next you now can fetch the
SciPy sources from SourceForge and build them:
python3 setup.py config
python3 setup.py build
python3 setup.py install
config step is only there so you can make sure SciPy found the Accelerate framework and the UMFPACK/AMD SuiteSparse libraries.
The FFTW library you installed earlier with Homebrew is not listed in this output, but will be used during the build, too.
As with NumPy, you can run some tests to ensure our installation is working properly after moving to another directory:
python3 -c "import scipy; scipy.test()"
None of the tests should fail (except for KNOWNFAIL and SKIP tests, naturally).
If you have come this far, congratulations! Everything from here on will be a
The next step is the installation of matplotlib:
pip3 install matplotlib
As it is trivial to install and only takes a few minutes, you might consider adding it to your virtual environments only.
However, the next packge that will be installed, IPython, makes use of matplotlib and is quite a hassle to install in every virtual environment.
To ensure the plotting library is working, try this in an interpreter:
>>> from pylab import \*; plot([1,2,3]); show()
You should see a plot with a straight diagonal.
Now it is time to install a great MATLAB-like interpreter and environment.
The first, optional, step is to install PyQt4 so you can use IPython's
This is not required, but it is nice to render plots inline in a Qt terminal window, making the IPython "experience" more like MATLAB:
brew install sip --with-python3
brew install qt --HEAD # currently, on Mavericks, the --HEAD option is required
Finally, you need to
download and install PyQt4 using:
make && make install
Apart from PyQt4, installing IPython itself is again straightforward:
pip3 install ipython[zmq,qtconsole,notebook,test]
To make sure the installation worked, execute the newly installed
Again as before, there should be no failures.
From now on, instead of
python3, you should be using ipython3 if you want to work in a Python interpreter and you have reached the "holy grail" of having set up a MATLAB-like scientific computing environment:
ipython3 qtconsole --pylab=inline
Additional Data Science Libraries
Finally, here is a list of mature, interesting data science libraries that all will use the stack you just installed.
These could all go either into the global site-packages, or you can just add them to your projects in your virtual environments as needed.
In the latter case, do not forget to enable the globabl stack with
when creating a new --system-site-packages VirtualEnv.
scikit-learn machine learning library: pip3 install scikit-learn
pandas statistical data analysis: pip3 install pandas
SymPy symbolic computer algebra system: pip3 install sympy
PyMC probabilistic programming environment (see this PyMC tutorial):
pip3 install pymc
Other noteworthy analytical tools include:
PyTables large data management: pip3 install tables
RPy2 Python-R interface: pip3 install rpy2 (assuming you have R installed)
patsy and StatsModels statistical models:
pip3 install patsy && pip3 install statsmodels
E voilà - you now have a fully functioning environment for running
all kinds and sorts of statistical data analyses and developing machine