2018-02-15 A quick Tensorflow reference

# A quick Tensorflow reference # Introduction Given the last few years of hype around Deep Learning, knowing one of those frameworks is probably no longer an option, at least if you are a professional Machine Learning engineer. Personally, I always favor free, open source solutions, so Apache MXNet would be the natural fit. However, I must admit that for research & development, Facebook\'s PyTorch is probably the nicer API, and for operations & production, Google\'s TensorFlow is beyond doubt the best fit right now. And, as I\'ve mostly moved out of academia/research lately, and am pretty dedicated to consulting by now (and loving it!), I am mostly interested in a tooling that I can put to good use in my day-to-day work. Hence, I have been - not without remorse - focusing on TensorFlow and want to share my personal notes with you, to use as a quick reference when setting up a new tensor graph. I mostly learned how to use TensorFlow from Aurélien Géron\'s book \"[Hands-On Machine Learning with Scikit-Learn and TensorFlow](http://shop.oreilly.com/product/0636920052289.do)\"; So if you are familiar with it and my notes below remind you of it, that is not by accident: These are my modified excerpts, all taken from that book. In fact, if you really are interested in using TensorFlow, I cannot emphasize enough how much I recommend reading this book. And, if you are not familiar with the SciKit-Learn world, or simply want to learn how you can combine these two essential Machine Learning frameworks into \"one ring to rule them all\", you need to read this book. Full-stop. In fact, if you are a serious Machine Learning practitioner, you either should have read it, or at least make sure you know the techniques presented in it already. Beyond just giving you tips, the book is chock-a-block full of up-to-date examples and highly relevant exercises. Aurélien even recently [released more example code](https://github.com/ageron/handson-ml/blob/master/extra_capsnets.ipynb), to produce G. E. Hinton\'s Capsule Network architecture (using dynamic routing) with TensorFlow, and generally ensures the practical material is very much up to date with the latest TensorFlow releases. (Disclaimer: I am in no way affiliated with Aurélien, O\'Reilly, or would otherwise benefit from sales of this book!) That being said, let\'s dive right in! # Setup ## Miniconda I strongly recommend using the [Anaconda](https://www.anaconda.com/download/) distribution to set up TensorFlow, and Python in general. And, for the more expert users, do go directly to [miniconda](https://conda.io/miniconda.html) (\"don\'t go over Anaconda, and don\'t collect 200MB of (irrelevant) bytes\"). The packages you want to install (and that will also fetch all other, relevant dependencies, like Jupyter, SciPy, or NumPy) are: ``` bash conda install -c conda-forge tensorflow conda install -c conda-forge scikit-learn conda install -c conda-forge matplotlib conda install -c conda-forge jupyter_contrib_nbextensions ``` # Graph Design ## Back-propagation Basics Input is feed into TensorFlow\'s (static) computational graphs via [tf.placeholder](https://www.tensorflow.org/api_docs/python/tf/placeholder) (**placeholder**) nodes. Typically, those placeholders will be accepting a tensor `X` and some array or matrix of labels `y`. Input is part of the [io_ops](https://www.tensorflow.org/api_guides/python/io_ops) package; Input values for placeholders must be provided (see \"Feeding the Graph\") during graph evaluation, or cause exceptions if left unset. **Weights** `W` and **bias** `B` tensors are represented by [tf.Variable](https://www.tensorflow.org/api_docs/python/tf/Variable) nodes, via the [state_ops](https://www.tensorflow.org/api_guides/python/state_ops) package; **Variables** are stateful, that is, they maintain their values across session runs. **Operations** are defined by either applying standard Python operators (+, -, /, \*) to nodes, or by using the `tf.add`, `tf.matmul`, etc. functions from the TF [math_ops](https://www.tensorflow.org/api_guides/python/math_ops) packages. The final operation typically evaluates the error or cost between the predicted `y_hat`, modeled as a state node, and \[minus\] the true `y`, modeled as an io node, as shown above. Common loss functions for this task are found in TensorFlow\'s [losses](https://www.tensorflow.org/api_docs/python/tf/losses) package. The outcome of this operation (by following Aurélien nomenclature, the `cost_op` node) is the one you typically want to plot on your TensorBoard (more on that at the end of this post). ``` python X = tf.placeholder(tf.float32, shape=(None, n + 1), name="X") # n variables + 1 constant bias input y = tf.placeholder(tf.float32, shape=(None, 1), name="y") # ... graph setup with tf.Variables ... y_pred = # some last step... cost_op = y_pred - y ``` The final touch is to create and append the optimizer (e.g, `tf.train.GradientDescentOptimizer`): That creates the `training_op` (for minimizing the cost/error), which typically will be the node that gets sent to evaluations of a TensorFlow graph via a **Session** (`sess.run`): ``` python optimizer = tf.train.GradientDescentOptimizer() training_op = optimizer.minimize(cost_op) init = tf.global_variables_initializer() ``` ## Placeholder Nodes As discussed already, to supply data to your TF graph you designed, you insert [tf.placeholder](https://www.tensorflow.org/api_docs/python/tf/placeholder) nodes in the graph (e.g., in the bottom-most layer of your net). Placeholders don't perform any computation, they just output any data you tell them to, during the graph evaluation (\"execution\") phase. Optionally, you can also specify the node\'s `shape` , if you want to enforce it: And, if you furthermore specify `None` for any tensor dimension, that dimension will adapt to any size (according to the next node\'s input). In the example shown below, the placeholder `A` must be of rank 2 (i.e., two-dimensional), and the tensor must have three columns, but it can have any number of rows (which typically will be the current batch\' examples). ``` python A = tf.placeholder(tf.float32, shape=(None, 3)) ``` ## Reassignable Variables Note that it is possible to feed values into variables, too, not just to placeholders, even if it is a tad unusual. To set a variable to some value during graph evaluation, use the [tf.assign](https://www.tensorflow.org/versions/master/api_docs/python/tf/assign) state operator: ``` python # sample a random variable: x = tf.Variable(tf.random_uniform(shape=(), minval=0.0, maxval=1.0)) # or feed a variable: x_new_val = tf.placeholder(shape=(), dtype=tf.float32) x_assign = tf.assign(x, x_new_val) # now, you can feed new values into X: with tf.Session(): x_assign.eval(feed_dict={x_new_val: 0.5}) print(x.eval()) # always 0.5 ``` ## Optimizing the Graph The [tf.gradients](https://www.tensorflow.org/versions/master/api_docs/python/tf/gradients) function takes a cost operator (e.g., to calculate the MSE) and a list of variables to optimize, and creates a list of ops (one per variable) to compute the gradients of each op with regard to each variable, returning the desired list of gradients (aka. performa *reverse-mode autodiff*): ``` python [var1grad, var2grad, ...] = tf.gradients(cost_op, [var1, var2, ...]) ``` However, TensorFlow provides a good number of [optimizers](https://www.tensorflow.org/api_guides/python/train#Optimizers) right out of the box, for example, the Adam optimizer, saving you the need to explicitly calculate gradients or update variables yourself: ``` python optimizer = tf.train.AdamOptimizer() training_op = optimizer.minimize(cost_op) ``` ## Adding Regularization Regularization (typically, L1 or L2) prevents overfitting and therefore allows you to train your model for more epochs. Simply add the appropriate operations to your graph, to get to a regularized cost (or `loss`): ``` python ... # construct the neural network base_loss = tf.reduce_mean(xentropy, name="avg_xentropy") reg_losses = tf.reduce_sum(tf.abs(weights1)) + tf.reduce_sum(tf.abs(weights2)) # L1 loss = tf.add(base_loss, scale * reg_losses, name="loss") ``` However, if you have many layers, this approach quickly becomes inconvenient. Instead, most TensorFlow functions in the [tf.contrib.layers](https://www.tensorflow.org/api_docs/python/tf/contrib/layers) package that create variables accept a `..._regularizer` argument for their weights and biases. Those arguments need to be functions that takes the weights as their argument, and return the regularization losses of that layer. ``` python with arg_scope( [fully_connected], weights_regularizer=tf.contrib.layers.l1_regularizer(scale=0.01)): hidden1 = fully_connected(X, n_hidden1, scope="hidden1") hidden2 = fully_connected(hidden1, n_hidden2, scope="hidden2") logits = fully_connected(hidden2, n_outputs, activation_fn=None,scope="out") ``` TensorFlow automatically adds these nodes to a special collection containing all the regularization losses. Then, when calculating the final loss, you add them up to find your overall, final loss: ``` python reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES) loss = tf.add_n([base_loss] + reg_losses, name="loss") ``` **Max-norm regularization** (clipping the L2-norm at some threshold) has become quite popular, yet TensorFlow does not provide an off-the-shelf max-norm regularizer. The following code creates a node `clip_weights` that will clip your `weights` along their second axis ( `axes=1` ), so that every resulting row vector will have a max-norm of 1.0: ``` python threshold = 1.0 clipped_weights = tf.clip_by_norm(weights, clip_norm=threshold, axes=1) clip_weights = tf.assign(weights, clipped_weights) ``` The issue then becomes accessing the weights of a [tf.contrib.layers](https://www.tensorflow.org/api_docs/python/tf/contrib/layers) module; A better solution therefore is to create a function equivalent to the [l1_regularizer](https://www.tensorflow.org/api_docs/python/tf/contrib/layers/l1_regularizer) found in the layers module: ``` python def max_norm_regularizer(threshold, axes=1, name="max_norm", collection="max_norm"): def max_norm(weights): clipped = tf.clip_by_norm(weights, clip_norm=threshold, axes=axes) clip_weights = tf.assign(weights, clipped, name=name) tf.add_to_collection(collection, clip_weights) return None # there is no regularization loss term return max_norm ``` Now, this regularization function can be used as an argument like any other regularizer would be: ``` python hidden1 = fully_connected(X, n_hidden1, scope="hidden1", weights_regularizer=max_norm_regularizer(threshold=1.0)) ``` And to actually clip weights using max-norm during session evaluation, finally, add this to your execution phase: ``` python clip_all_weights = tf.get_collection("max_norm") # note we used "max_norm" above with tf.Session() as sess: [...] for epoch in range(n_epochs): [...] for X_batch, y_batch in zip(X_batches, y_batches): sess.run(training_op, feed_dict={X: X_batch, y: y_batch}) sess.run(clip_all_weights) ``` ## Scopes, Modules, and Shared Variables Variables and nodes created within a named scope are prefixed with the scopes\' name, and such scopes are collapsed into single nodes on the TensorBoard: ``` python with tf.name_scope("loss") as scope: error = y_pred - y mse = tf.reduce_mean(tf.square(error), name="mse") ``` ``` python >>> print(error.op.name) loss/sub >>> print(mse.op.name) loss/mse ``` E.g., to define a ReLU within a named scope (just in case: normally, you would use `tf.nn.relu` ): ``` python def relu(X): with tf.name_scope("relu"): w_shape = (int(X.get_shape()[1]), 1) w = tf.Variable(tf.random_normal(w_shape), name="weights") b = tf.Variable(0.0, name="bias") z = tf.add(tf.matmul(X, w), b, name="z") return tf.maximum(z, 0., name="relu") ``` If you want to share a variable between various components of your graph, (e.g., thresholds, biases, etc.), TensorFlow provides the [tf.get_variable](https://www.tensorflow.org/api_docs/python/tf/get_variable) function, that creates a shared variable if it does not exist, or reuses it, if it does. The desired behavior (creating or reusing) is controlled by an attribute of the current [tf.variable_scope](https://www.tensorflow.org/versions/master/api_docs/python/tf/variable_scope), `reuse` . Note that [tf.get_variable](https://www.tensorflow.org/api_docs/python/tf/get_variable) raises an exception if `reuse` is `False` or `scope.reuse_variables()` has not been set. ``` python # as attribute: with tf.variable_scope("relu", reuse=True): threshold = tf.get_variable("threshold") # as function: with tf.variable_scope("relu") as scope: scope.reuse_variables() threshold = tf.get_variable("threshold") ``` ## Xavier and He Initialization of Model Weights You need to use an initializer to avoid the uniform initialization of weights to all the same values. By default, TensorFlow\'s layers are initialized using Xavier initialization: The fully connected layers in `tf.contrib.layers` use Xavier initialization, with a normal dist. using µ = `0`, sigma = `sqrt[ 2 / (n_in + n_out) ]` or a uniform dist. using `+/- sqrt[ 6 / (n_in + n_out) ]` , where the `n`\'s are the sizes of the input/output connections. To use He initialization instead (which is mostly a matter of preference, but has been made popular with ResNet), you can use variance scaling initialization: ``` python from tf.contrib.layers import fully_connected, variance_scaling_initializer initializer = variance_scaling_initializer(mode="FAN_AVG") hidden1 = fully_connected(X, n_hidden1, weights_initializer=initializer, scope="h1") ``` ## Implementing a Learning Rate Scheduler Normally, it is not necessary to add a learning rate scheduler, because the AdaGrad, RMSProp, and Adam optimizers automatically reduce the learning rate for you during training. Yet, implementing a learning rate scheduler is fairly straightforward with TensorFlow; Typically, exponential decay is recommended, because it is easy to tune and will converge (slightly) faster than the optimal solution. Here, we adapt Momentum to use a dynamic learning rate: Note how the decay depends on the current global step that is set by the optimizer\'s minimization function. ``` python initial_learning_rate = 0.1 decay_steps = 10000 decay_rate = 1/10 global_step = tf.Variable(0, trainable=False) learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step, decay_steps, decay_rate) optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9) training_op = optimizer.minimize(cost_op, global_step=global_step) ``` # Interactive Sessions Before we get to the actual evaluation of TF graphs with sessions, let me add in a few tips that come in handy when working in interactive Python sessions. ## Resetting the Default Graph ``` python tf.get_default_graph() ``` In Jupyter (or if using TF in a Python shell), it is common to run the same commands more than once while you are experimenting. As a result, you may end up with a [default graph](https://www.tensorflow.org/api_docs/python/tf/get_default_graph) containing many duplicate nodes. One solution is to restart the Jupyter kernel (or the Python shell), but a more convenient solution is to just reset the default graph by running: ``` python tf.reset_default_graph() ``` In single-process TensorFlow, multiple sessions do not share any state, even if they reuse the same graph (and each session gets its own copy of every variable). Beware though, that in distributed TensorFlow, variable state is stored on the servers, not in the sessions, so multiple sessions can *share* the same *variables* (actually, that is a good, desired thing, obviously). ## Using TensorFlow\'s \[Interactive\] Sessions Use `InteractiveSession` in notebooks to automatically set a default session, relieving you from the need of a `with` block for the evaluation/execution phase. But do remember to close the session manually when you are done with it! ``` python sess = tf.InteractiveSession() init = tf.global_variables_initializer() ... # do graph setup init.run() ... # do evaluation sess.close() ``` When running TensorFlow **locally**, the sessions manage your variable values. So if you create a graph, then start two threads, and open a local session in either thread, both will use the same graph, yet each session will have its *own* copy of the variables. However, in **distributed** TensorFlow sessions, variable values are stored in containers managed by the TF cluster (see [tf.get_variable](https://www.tensorflow.org/api_docs/python/tf/get_variable)). So if both sessions connect to the same cluster and use the same container, then they will share the same variable value for w. # Model Evaluation ## Scaling Variables When using a Gradient Descent method, remember that it is important to **normalize** the input feature vectors, or else training may progress much slower. You can do this using TensorFlow, NumPy, Scikit-Learn's StandardScaler, or any other solution you prefer. In fact, with NumPy arrays, this is pretty straightforward: ``` python import numpy as np scaled = data / np.max(np.abs(data), 0) ``` ## Running the Graph Once the graph has been designed (incl. a `training_op` node) and the initializer (an `init` node) has been set up (see Graph Design), a typical snippet for the (batched) executing of the training phase is: ``` python with tf.Session() as sess: init.run() for epoch in range(n_epochs): for iteration in range(data.train.num_examples // batch_size): X_batch, y_batch = next_batch(batch_size) sess.run(training_op, feed_dict={X: X_batch, y: y_batch}) ``` ## Feeding the Graph with Data When you evaluate the graph, you pass a feed dictionary (`feed_dict`) to the target (\"output\") node\'s `eval()` method. And you specify the value of the placeholder (input) node by using the node itself as the key of the feed dictionary. ``` python A = tf.placeholder(tf.float32, shape=(None, 3)) ... # more graph setup, down to the training_op training_op.eval(feed_dict={A: data}) # <- feeding ``` Note that you can feed data into *any* kind of node, not just placeholders. Note that when using other nodes, TensorFlow will not evaluate their operations; If fed to, TF uses the values you feed to that node, only (see Reassignable Variables in Graph Design for more info). ## Mini-batching with TensorFlow Instead of feeding all data at once, you typically will mini-batch your data as follows: 1. Create a session (`with tf.Session() as sess`) 2. Run the variable initializer (`sess.run(init)`) 3. Loop over the epochs and batches, feeding each mini-batch to the session (`sess.run(training_op, feed_dict={X: X_batch, y: y_batch})`) 4. Optionally: Write a summary every n mini-batches, to visualize the progress on your TensorBoard. ``` python def get_batch(epoch, batch_index, batch_size): # somehow fetch data and labels (numpy arrays) to feed... return X_batch, y_batch sess.run(init) for epoch in range(n_epochs): for batch_index in range(n_batches): X_batch, y_batch = get_batch(epoch, batch_index, batch_size) sess.run(training_op, feed_dict={X: X_batch, y: y_batch}) ``` ## Saving and Restoring Models Create a Saver node at the end of the construction phase (after all variable nodes are created); Then, during the execution phase, call the node\'s `save()` method whenever you want to save the model, passing it the session and path of the **checkpoint** file to create: ``` python init = tf.global_variables_initializer() saver = tf.train.Saver() checkpoint_path = "/tmp/my_classifier.tfckpt" checkpoint_epoch_path = checkpoint_path + ".tfepoch" final_model_path = "./my_classifier.tfmodel" best_loss = np.infty with tf.Session() as sess: if os.path.isfile(checkpoint_epoch_path): # if the checkpoint file exists, restore the model and load the epoch number with open(checkpoint_epoch_path, "rb") as f: start_epoch = int(f.read()) print("Training was interrupted. Continuing at epoch", start_epoch) saver.restore(sess, checkpoint_path) else: start_epoch = 0 sess.run(init) for epoch in range(start_epoch, n_epochs): if epoch % 100 == 0: # checkpoint every 100 epochs saver.save(sess, checkpoint_path) with open(checkpoint_epoch_path, "wb") as f: f.write(b"%d" % (epoch + 1)) loss_val = sess.run(training_op) if loss_val < best_loss: saver.save(sess, final_model_path) best_loss = loss_val # best_parameters = parameters.eval() ``` To use a trained model *in production*, restoring a model is just as easy: You create a Saver node at the end of the graph, just like before, but then, when beginning the execution phase, instead of initializing the variables using the typical `init` node, you call the `restore()` method of the Saver object: ``` python with tf.Session() as sess: saver.restore(sess, "./my_model_final.ckpt") X_unseen = [...] # some unseen (scaled) data y_pred = y.eval(feed_dict={X: X_unseen}) ``` # Monitoring with TensorBoard One of the biggest advantages of TensorFlow over many other frameworks is the TensorBoard. It allows you to visualize the progression of any variable in your graph. ## Writing Session Summaries To provide TensorBoard with data, you need to write TF\'s graph definition and some training stats (like the cost/loss) to a log directory that TensorBoard reads from. You need to use a different log directory on every run, to avoid that TensorBoard will merge the output of different runs. The solution to this here will be to include a timestamp in the log directory name. ``` python from datetime import datetime root_logdir = "tf_logs" now = datetime.utcnow().strftime("%Y%m%d%H%M%S") logdir = "{}/run-{}/".format(root_logdir, now) ``` Next, add a **summary node** and attach a **file writer** to the node you wish to visualize on your TensorBoard; The `FileWriter` shown below will create any missing directories for you: ``` python mse_summary = tf.summary.scalar('MSE', mse) file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph()) ``` The first line creates a node in the graph that will evaluate the MSE value and write it to a TensorBoard-compatible binary log string called a **summary**. The second line creates a `FileWriter` that you will use to write summaries into the log directory. The second (optional) parameter is the graph you want to visualize. Upon creation, the FileWriter creates the directory path if it does not exist, and writes the graph definition in a binary log file called an **events** file. Next, you need to update the execution phase, to evaluate the summary node regularly during training, and you should not forget to close the writer after training: ``` python with tf.Session() as sess: [...] update_summary = lambda n: n % 10 == 0 for batch_index in range(n_batches): X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size) if update_summary(batch_index): summary_str = mse_summary.eval(feed_dict={X: X_batch, y: y_batch}) step = epoch * n_batches + batch_index file_writer.add_summary(summary_str, step) sess.run(training_op, feed_dict={X: X_batch, y: y_batch}) [...] file_writer.close() ``` Finally, you now can visualize the stats you are recording by starting the TensorBoard server and pointing it at the log directory: ``` bash $ tensorboard --logdir tf_logs/ ``` # Epilogue As I already advised in the beginning, if you want to learn more, I can warmly recommend you get Aurélien Géron\'s fantastic book \"[Hands-On Machine Learning with Scikit-Learn and TensorFlow](http://shop.oreilly.com/product/0636920052289.do)\"; The more advanced topics covered (and that would explode this blog post\...) are transfer learning, distributed training, designing Recurrent Networks and Auto-encoders, and even a \"beginner\'s guide\" to Deep Reinforcement Learning. Yet, I hope, this tiny taste of the book\'s contents, spiced up with a bit of my own \"opinionated\" modifications, will provide you with a handy quick-reference when building and using TensorFlow graphs. (And, that I don\'t get sued by him or O\'Reilly for plagiarism! :-) Please just contact me if this post is an issue \-- I have no problem taking the post down again, if it is problematic.)