What are the best books [for programmers] to get into Data Science?


This is a question I get on a frequent basis by colleagues that are serious programmers or software developers and are planning to pick up data analytics. There are three fundamental topics in mathematics that you need to cover, assuming that you are an expert in software development and informatics already: statistics, linear algebra, and machine learning. Therefore, I will recommend one book associated to each of these topics. Even I continue to consult those books as reference material.

There might be a fourth foundation worth mentioning: calculus. I don't really have a recommendation for that, as it never has been essential to my work. As long as you can remember how to do differentiation and integration and solve an ODE, you are pretty much saddled. In other words, fresh high-school knowledge of calculus has always been good enough for me. And there is no need to make any proofs, unless you want to develop your own machine learning methods…

The R and Python Environments

Before presenting the books, let me recommend one tool that you should pick up when doing the books' practical exercises: R. R is the de facto tool for statistics and data analytics today and, as I would judge, the R Studio environment has put S, SPSS, SAS, and a number of other commercial tools into their digital grave. Even big pharma, healthcare, media, banks and insurances are beginning to embrace R, apparently. R provides a huge ecosystem of packages that lets you quickly explore and test models, inspect your data, develop theories, and visualize results. Any of these points can be extremely tedious with other tools, but a few free and commercial statistics, math, and/or machine learning environments (e.g., MatLab or Octave) might be viable alternatives if you already know them. And Python is extremely well suited (maybe even better than R) if you are interested in language processing, computer vision, neural networks, or generally strive for a strong machine learning focus. Though, I consider R the undoubtedly better choice for advanced statistics & data science. In any case, I recommend working through the exercises provided by the books with your preferred interactive environment, be it R or Python - or both!


The Statistical Sleuth by Fred Ramsey and Daniel W. Schafer. While there might be “simpler” books around, this one has the most compact and best overview of applied statistics that I have read to date (t and F distributions, ANOVA and multivariate & logistic regression, etc.). It contains exercises that are very good at teaching you how to apply and think with statistics. If you can only read one book or only need to work the statistics angle, then this is “the” book. Most of the machine learning stuff can be understood easily after digesting this book, and I particularly like the very applied approach that Ramsey & Schafer take. Be warned, though, that if you have absolutely zero idea what statistics is about, e.g., if you cannot describe what variance is, this book will probably be a bit too intense and you might want to read something more elementary first.

Linear Algebra

Introduction to Linear Algebra by Gibert Strang. Contrary to my statistics recommendation, on this topic there certainly are more in-depth introductory books around, but this one covers all I ever needed. It takes you from the dot and outer product over eigenspaces to Singular Value Decomposition. And, the author is exceptionally gifted at teaching algebraic thinking (The book can be "watched" for free in form of videos of his classes.) As a bonus, the final chapters explain how this material ties in with advanced data science techniques and machine learning models.

Machine Learning

R Focus Option

An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani. While quite a bit slower moving and less detailed than the first Hastie & Tibshirani book (The Elements of Statistical Learning), I think it is the better start, and it has a ton of exercises using R. Especially the first few chapters contain discussions about basic aspects that are critical to this line of work, like the bias-variance tradeoff issue to keep in mind when developing your machine learning models. This is an excellent read if the Sleuth above is no longer a problem, as it does assume a solid background on basic statistics, distributions & regression techniques.

Python Focus Option

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, by Aurelien Gerion. This book has, for me, taken the crown of the best ML-focused introduction to Data Science. First, because the practical examples are in Python and use SciKit-Learn and TensorFlow, arguably the most important language, library and framework for ML today, respectively. Second, the extremely well thought out exercises and practical examples are an excellent source of "learning the hard way". I have never seen a better, more complete practical course than the examples in this book. Just like the ISL book (above), it has an excellent introduction on how to actually "do" machine learning. The biggest difference is a much stronger focus on deep learning than the ISL book.

Either way, if you want to learn more than just the basics and dive into the rich world of ML models, you probably anyway will have to stick your nose into Machine Learning - A Probabilistic Perspective, by Kevin P. Murphy, or the upcoming Probabilistic Machine Learning: An Introduction and the Advanced Topics by the same author.


After studying these three books, you will be ready to extend your knowledge with domain-specific techniques. That should be geared at what kind of data you want to process: Text mining, time series, bioinformatics, signal processing, etc. and what specific models you want to apply: Neural networks, graphical models, kernel machines, structured equations, etc. In other words, the above three books will not convert you into a data science expert, but they will provide you with an extremely solid foundation so that you can effortlessly study any advanced topics and techniques that will.