What are the best books [for programmers] to get into Data Science?


This is a question I get on a frequent basis by colleagues that are serious programmers or software developers and are planning to pick up data analytics. There are three fundamental topics in mathematics that you need to cover, assuming that you are an expert in software development and informatics already: statistics, linear algebra, and machine learning. Therefore, I will recommend one book associated to each of these topics. Even I continue to consult those books as reference material.

There might be a fourth foundation worth mentioning: calculus. I don't really have a recommendation for that, as it never has been essential to my work. As long as you can remember how to do differentiation and integration and solve an ODE, you are pretty much saddled. In other words, fresh high-school knowledge of calculus has always been good enough for me. And there is no need to make any proofs, unless you want to develop your own machine learning methods…

The R Environment

Before presenting the books, let me recommend one tool that you should pick up when doing the books' practical exercises: R. R is the de facto tool for data science today and, as I would predict, the environment that will (is?) put (putting?) S, SPSS, SAS, and a number of other commercial tools into their digital grave. Even banks and insurances are beginning to embrace R, apparently (not sure that's a good thing, though...). R provides a huge ecosystem of packages that lets you quickly explore and test models, inspect your data, develop theories, and visualize results. Any of these points can be extremely tedious with other tools, but a few free and commercial statistics, math, and/or machine learning environments (e.g., MatLab or Octave) might be viable alternatives if you already know them. Python in particular is extremely well suited (maybe even better than R) if you are interested in language processing or neural networks. Overall, I recommend working through the exercises provided by the books with your preferred interactive environment. Just as with programming and software development, only reading about the relevant material will not provide you with the experience needed to actually “do” data science.


The Statistical Sleuth by Fred Ramsey and Daniel W. Schafer. While there might be “simpler” books around, this one has the most compact and best overview of applied statistics that I have read to date (t and F distributions, ANOVA and multivariate regression, etc.). If you can only read one book or only need to work the statistics angle, then this is “the” book. Most of the machine learning stuff can be understood easily after digesting this book, and I particularly like the very applied approach that Ramsey & Schafer take.

Linear Algebra

Introduction to Linear Algebra by Gibert Strang. Contrary to my statistics recommendation, on this topic there certainly are more in-depth introductory books around, but this one covers all I ever needed. It takes you from the dot and outer product over eigenspaces to Singular Value Decomposition. And, the author is exceptionally gifted at teaching algebraic thinking (The book can be "watched" for free in form of videos of his classes.) As a bonus, the final chapters explain how this material ties in with advanced data science techniques and machine learning models.

Machine Learning

An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani. While quite a bit slower moving than the first Hastie & Tibshirani book (The Elements of Statistical Learning), I think it is the better start, and it has a ton of exercises using R. Especially the first few chapters contain discussions about basic aspects that are critical to this line of work, like the bias-variance tradeoff issue to keep in mind when developing your machine learning models. This is an excellent read if the Sleuth above is no longer a problem, as it does assume a solid background on basic statistics, distributions & regression techniques.


After studying these three books, you will be ready to extend your knowledge with domain-specific techniques. That should be geared at what kind of data you want to process: Text mining, time series, bioinformatics, signal processing, etc. and what specific models you want to apply: Neural networks, graphical models, kernel machines, structured equations, etc. In other words, the above three books will not convert you into a data science expert, but they will provide you with an extremely solid foundation so that you can effortlessly study any advanced topics and techniques that will.