## Introduction

Given the last few years of hype around Deep Learning, knowing one of those frameworks is probably no longer an option, at least if you are a professional Machine Learning engineer. Personally, I always favor free, open source solutions, so Apache MXNet would be the natural fit. However, I must admit that for research & development, Facebook's PyTorch is probably the nicer API, and for operations & production, Google's TensorFlow is beyond doubt the best fit right now. And, as I've mostly moved out of academia/research lately, and am pretty dedicated to consulting by now (and loving it!), I am mostly interested in a tooling that I can put to good use in my day-to-day work. Hence, I have been - not without remorse - focusing on TensorFlow and want to share my personal notes with you, to use as a quick reference when setting up a new tensor graph.

I mostly learned how to use TensorFlow from Aurélien Géron's book "Hands-On Machine Learning with Scikit-Learn and TensorFlow"; So if you are familiar with it and my notes below remind you of it, that is not by accident: These are my modified excerpts, all taken from that book. In fact, if you really are interested in using TensorFlow, I cannot emphasize enough how much I recommend reading this book. And, if you are not familiar with the SciKit-Learn world, or simply want to learn how you can combine these two essential Machine Learning frameworks into "one ring to rule them all", you need to read this book. Full-stop. In fact, if you are a serious Machine Learning practitioner, you either should have read it, or at least make sure you know the techniques presented in it already. Beyond just giving you tips, the book is chock-a-block full of up-to-date examples and highly relevant exercises. Aurélien even recently released more example code, to produce G. E. Hinton's Capsule Network architecture (using dynamic routing) with TensorFlow, and generally ensures the practical material is very much up to date with the latest TensorFlow releases. (Disclaimer: I am in no way affiliated with Aurélien, O'Reilly, or would otherwise benefit from sales of this book!)

That being said, let's dive right in!

## Setup

### Miniconda

I strongly recommend using the Anaconda distribution to set up TensorFlow, and Python in general. And, for the more expert users, do go directly to miniconda ("don't go over Anaconda, and don't collect 200MB of (irrelevant) bytes"). The packages you want to install (and that will also fetch all other, relevant dependencies, like Jupyter, SciPy, or NumPy) are:

```
conda install -c conda-forge tensorflow
conda install -c conda-forge scikit-learn
conda install -c conda-forge matplotlib
conda install -c conda-forge jupyter_contrib_nbextensions
```

## Graph Design

### Back-propagation Basics

Input is feed into TensorFlow's (static) computational graphs via tf.placeholder (**placeholder**) nodes. Typically, those placeholders will be accepting a tensor `X` and some array or matrix of labels `y`. Input is part of the io_ops package; Input values for placeholders must be provided (see "Feeding the Graph") during graph evaluation, or cause exceptions if left unset.

**Weights** `W` and **bias** `B` tensors are represented by tf.Variable nodes, via the state_ops package; **Variables** are stateful, that is, they maintain their values across session runs.
**Operations** are defined by either applying standard Python operators (+, -, /, *) to nodes, or by using the `tf.add`, `tf.matmul`, etc. functions from the TF math_ops packages.

The final operation typically evaluates the error or cost between the predicted `y_hat`, modeled as a state node, and [minus] the true `y`, modeled as an io node, as shown above.
Common loss functions for this task are found in TensorFlow's losses package.
The outcome of this operation (by following Aurélien nomenclature, the `cost_op` node) is the one you typically want to plot on your TensorBoard (more on that at the end of this post).

```
X = tf.placeholder(tf.float32, shape=(None, n + 1), name="X") # n variables + 1 constant bias input
y = tf.placeholder(tf.float32, shape=(None, 1), name="y")
# ... graph setup with tf.Variables ...
y_pred = # some last step...
cost_op = y_pred - y
```

The final touch is to create and append the optimizer (e.g, `tf.train.GradientDescentOptimizer`):
That creates the `training_op` (for minimizing the cost/error), which typically will be the node that gets sent to evaluations of a TensorFlow graph via a **Session** (`sess.run`):

```
optimizer = tf.train.GradientDescentOptimizer()
training_op = optimizer.minimize(cost_op)
init = tf.global_variables_initializer()
```

### Placeholder Nodes

As discussed already, to supply data to your TF graph you designed, you insert tf.placeholder nodes in the graph (e.g., in the bottom-most layer of your net). Placeholders don’t perform any computation, they just output any data you tell them to, during the graph evaluation ("execution") phase. Optionally, you can also specify the node's `shape` , if you want to enforce it: And, if you furthermore specify `None` for any tensor dimension, that dimension will adapt to any size (according to the next node's input).

In the example shown below, the placeholder `A` must be of rank 2 (i.e., two-dimensional), and the tensor must have three columns, but it can have any number of rows (which typically will be the current batch' examples).

```
A = tf.placeholder(tf.float32, shape=(None, 3))
```

### Reassignable Variables

Note that it is possible to feed values into variables, too, not just to placeholders, even if it is a tad unusual. To set a variable to some value during graph evaluation, use the tf.assign state operator:

```
# sample a random variable:
x = tf.Variable(tf.random_uniform(shape=(), minval=0.0, maxval=1.0))
# or feed a variable:
x_new_val = tf.placeholder(shape=(), dtype=tf.float32)
x_assign = tf.assign(x, x_new_val)
# now, you can feed new values into X:
with tf.Session():
x_assign.eval(feed_dict={x_new_val: 0.5})
print(x.eval()) # always 0.5
```

### Optimizing the Graph

The tf.gradients function takes a cost operator (e.g., to calculate the MSE) and a list of variables to optimize, and creates a list of ops (one per variable) to compute the gradients of each op with regard to each variable, returning the desired list of gradients (aka. performa *reverse-mode autodiff*):

```
[var1grad, var2grad, ...] = tf.gradients(cost_op, [var1, var2, ...])
```

However, TensorFlow provides a good number of optimizers right out of the box, for example, the Adam optimizer, saving you the need to explicitly calculate gradients or update variables yourself:

```
optimizer = tf.train.AdamOptimizer()
training_op = optimizer.minimize(cost_op)
```

### Adding Regularization

Regularization (typically, L1 or L2) prevents overfitting and therefore allows you to train your model for more epochs. Simply add the appropriate operations to your graph, to get to a regularized cost (or `loss`):

```
... # construct the neural network
base_loss = tf.reduce_mean(xentropy, name="avg_xentropy")
reg_losses = tf.reduce_sum(tf.abs(weights1)) + tf.reduce_sum(tf.abs(weights2)) # L1
loss = tf.add(base_loss, scale * reg_losses, name="loss")
```

However, if you have many layers, this approach quickly becomes inconvenient. Instead,
most TensorFlow functions in the tf.contrib.layers package that create variables accept a `..._regularizer` argument for their weights and biases.
Those arguments need to be functions that takes the weights as their argument, and return the regularization losses of that layer.

```
with arg_scope(
[fully_connected],
weights_regularizer=tf.contrib.layers.l1_regularizer(scale=0.01)):
hidden1 = fully_connected(X, n_hidden1, scope="hidden1")
hidden2 = fully_connected(hidden1, n_hidden2, scope="hidden2")
logits = fully_connected(hidden2, n_outputs, activation_fn=None,scope="out")
```

TensorFlow automatically adds these nodes to a special collection containing all the regularization losses. Then, when calculating the final loss, you add them up to find your overall, final loss:

```
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
loss = tf.add_n([base_loss] + reg_losses, name="loss")
```

**Max-norm regularization** (clipping the L2-norm at some threshold) has become quite popular, yet TensorFlow does not provide an off-the-shelf max-norm regularizer. The following code creates a node `clip_weights` that will clip your `weights` along their second axis ( `axes=1` ), so that every resulting row vector will have a max-norm of 1.0:

```
threshold = 1.0
clipped_weights = tf.clip_by_norm(weights, clip_norm=threshold, axes=1)
clip_weights = tf.assign(weights, clipped_weights)
```

The issue then becomes accessing the weights of a tf.contrib.layers module; A better solution therefore is to create a function equivalent to the l1_regularizer found in the layers module:

```
def max_norm_regularizer(threshold, axes=1, name="max_norm",
collection="max_norm"):
def max_norm(weights):
clipped = tf.clip_by_norm(weights, clip_norm=threshold, axes=axes)
clip_weights = tf.assign(weights, clipped, name=name)
tf.add_to_collection(collection, clip_weights)
return None # there is no regularization loss term
return max_norm
```

Now, this regularization function can be used as an argument like any other regularizer would be:

```
hidden1 = fully_connected(X, n_hidden1, scope="hidden1",
weights_regularizer=max_norm_regularizer(threshold=1.0))
```

And to actually clip weights using max-norm during session evaluation, finally, add this to your execution phase:

```
clip_all_weights = tf.get_collection("max_norm") # note we used "max_norm" above
with tf.Session() as sess:
[...]
for epoch in range(n_epochs):
[...]
for X_batch, y_batch in zip(X_batches, y_batches):
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
sess.run(clip_all_weights)
```

### Xavier and He Initialization of Model Weights

You need to use an initializer to avoid the uniform initialization of weights to all the same values.

By default, TensorFlow's layers are initialized using Xavier initialization: The fully connected layers in `tf.contrib.layers` use Xavier initialization, with a normal dist. using µ = `0`, sigma = `sqrt[ 2 / (n_in + n_out) ]` or a uniform dist. using `+/- sqrt[ 6 / (n_in + n_out) ]` , where the `n`'s are the sizes of the input/output connections.

To use He initialization instead (which is mostly a matter of preference, but has been made popular with ResNet), you can use variance scaling initialization:

```
from tf.contrib.layers import fully_connected, variance_scaling_initializer
initializer = variance_scaling_initializer(mode="FAN_AVG")
hidden1 = fully_connected(X, n_hidden1, weights_initializer=initializer, scope="h1")
```

### Implementing a Learning Rate Scheduler

Normally, it is not necessary to add a learning rate scheduler, because the AdaGrad, RMSProp, and Adam optimizers automatically reduce the learning rate for you during training. Yet, implementing a learning rate scheduler is fairly straightforward with TensorFlow; Typically, exponential decay is recommended, because it is easy to tune and will converge (slightly) faster than the optimal solution. Here, we adapt Momentum to use a dynamic learning rate: Note how the decay depends on the current global step that is set by the optimizer's minimization function.

```
initial_learning_rate = 0.1
decay_steps = 10000
decay_rate = 1/10
global_step = tf.Variable(0, trainable=False)
learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step,
decay_steps, decay_rate)
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
training_op = optimizer.minimize(cost_op, global_step=global_step)
```

## Interactive Sessions

Before we get to the actual evaluation of TF graphs with sessions, let me add in a few tips that come in handy when working in interactive Python sessions.

### Resetting the Default Graph

```
tf.get_default_graph()
```

In Jupyter (or if using TF in a Python shell), it is common to run the same commands more than once while you are experimenting. As a result, you may end up with a default graph containing many duplicate nodes. One solution is to restart the Jupyter kernel (or the Python shell), but a more convenient solution is to just reset the default graph by running:

```
tf.reset_default_graph()
```

In single-process TensorFlow, multiple sessions do not share any state, even if they reuse the same graph (and each session gets its own copy of every variable). Beware though, that in distributed TensorFlow, variable state is stored on the servers, not in the sessions, so multiple sessions can *share* the same *variables* (actually, that is a good, desired thing, obviously).

### Using TensorFlow's [Interactive] Sessions

Use `InteractiveSession` in notebooks to automatically set a default session, relieving you from the need of a `with` block for the evaluation/execution phase. But do remember to close the session manually when you are done with it!

```
sess = tf.InteractiveSession()
init = tf.global_variables_initializer()
... # do graph setup
init.run()
... # do evaluation
sess.close()
```

When running TensorFlow **locally**, the sessions manage your variable values. So if you create a graph, then start two threads, and open a local session in either thread, both will use the same graph, yet each session will have its *own* copy of the variables.

However, in **distributed** TensorFlow sessions, variable values are stored in containers managed by the TF cluster (see tf.get_variable). So if both sessions connect to the same cluster and use the same container, then they will share the same variable value for w.

## Model Evaluation

### Scaling Variables

When using a Gradient Descent method, remember that it is important to **normalize** the input feature vectors, or else training may progress much slower.
You can do this using TensorFlow, NumPy, Scikit-Learn’s StandardScaler, or any other solution you prefer. In fact, with NumPy arrays, this is pretty straightforward:

```
import numpy as np
scaled = data / np.max(np.abs(data), 0)
```

### Running the Graph

Once the graph has been designed (incl. a `training_op` node) and the initializer (an `init` node) has been set up (see Graph Design), a typical snippet for the (batched) executing of the training phase is:

```
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for iteration in range(data.train.num_examples // batch_size):
X_batch, y_batch = next_batch(batch_size)
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
```

### Feeding the Graph with Data

When you evaluate the graph, you pass a feed dictionary (`feed_dict`) to the target ("output") node's `eval()` method. And you specify the value of the placeholder (input) node by using the node itself as the key of the feed dictionary.

```
A = tf.placeholder(tf.float32, shape=(None, 3))
... # more graph setup, down to the training_op
training_op.eval(feed_dict={A: data}) # <- feeding
```

Note that you can feed data into *any* kind of node, not just placeholders. Note that when using other nodes, TensorFlow will not evaluate their operations; If fed to, TF uses the values you feed to that node, only (see Reassignable Variables in Graph Design for more info).

### Mini-batching with TensorFlow

Instead of feeding all data at once, you typically will mini-batch your data as follows:

- Create a session (
`with tf.Session() as sess`) - Run the variable initializer (
`sess.run(init)`) - Loop over the epochs and batches, feeding each mini-batch to the session (
`sess.run(training_op, feed_dict={X: X_batch, y: y_batch})`) - Optionally: Write a summary every n mini-batches, to visualize the progress on your TensorBoard.

```
def get_batch(epoch, batch_index, batch_size):
# somehow fetch data and labels (numpy arrays) to feed...
return X_batch, y_batch
sess.run(init)
for epoch in range(n_epochs):
for batch_index in range(n_batches):
X_batch, y_batch = get_batch(epoch, batch_index, batch_size)
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
```

### Saving and Restoring Models

Create a Saver node at the end of the construction phase (after all variable nodes are created); Then, during the execution phase, call the node's `save()` method whenever you want to save the model, passing it the session and path of the **checkpoint** file to create:

```
init = tf.global_variables_initializer()
saver = tf.train.Saver()
checkpoint_path = "/tmp/my_classifier.tfckpt"
checkpoint_epoch_path = checkpoint_path + ".tfepoch"
final_model_path = "./my_classifier.tfmodel"
best_loss = np.infty
with tf.Session() as sess:
if os.path.isfile(checkpoint_epoch_path):
# if the checkpoint file exists, restore the model and load the epoch number
with open(checkpoint_epoch_path, "rb") as f:
start_epoch = int(f.read())
print("Training was interrupted. Continuing at epoch", start_epoch)
saver.restore(sess, checkpoint_path)
else:
start_epoch = 0
sess.run(init)
for epoch in range(start_epoch, n_epochs):
if epoch % 100 == 0: # checkpoint every 100 epochs
saver.save(sess, checkpoint_path)
with open(checkpoint_epoch_path, "wb") as f:
f.write(b"%d" % (epoch + 1))
loss_val = sess.run(training_op)
if loss_val < best_loss:
saver.save(sess, final_model_path)
best_loss = loss_val
# best_parameters = parameters.eval()
```

To use a trained model *in production*, restoring a model is just as easy: You create a Saver node at the end of the graph, just like before, but then, when beginning the execution phase, instead of initializing the variables using the typical `init` node, you call the `restore()` method of the Saver object:

```
with tf.Session() as sess:
saver.restore(sess, "./my_model_final.ckpt")
X_unseen = [...] # some unseen (scaled) data
y_pred = y.eval(feed_dict={X: X_unseen})
```

## Monitoring with TensorBoard

One of the biggest advantages of TensorFlow over many other frameworks is the TensorBoard. It allows you to visualize the progression of any variable in your graph.

### Writing Session Summaries

To provide TensorBoard with data, you need to write TF's graph definition and some training stats (like the cost/loss) to a log directory that TensorBoard reads from. You need to use a different log directory on every run, to avoid that TensorBoard will merge the output of different runs. The solution to this here will be to include a timestamp in the log directory name.

```
from datetime import datetime
root_logdir = "tf_logs"
now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
logdir = "{}/run-{}/".format(root_logdir, now)
```

Next, add a **summary node** and attach a **file writer** to the node you wish to visualize on your TensorBoard; The `FileWriter` shown below will create any missing directories for you:

```
mse_summary = tf.summary.scalar('MSE', mse)
file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())
```

The first line creates a node in the graph that will evaluate the MSE value and write it to a TensorBoard-compatible binary log string called a **summary**. The second line creates a `FileWriter` that you will use to write summaries into the log directory. The second (optional) parameter is the graph you want to visualize. Upon creation, the FileWriter creates the directory path if it does not exist, and writes the graph definition in a binary log file called an **events** file.

Next, you need to update the execution phase, to evaluate the summary node regularly during training, and you should not forget to close the writer after training:

```
with tf.Session() as sess:
[...]
update_summary = lambda n: n % 10 == 0
for batch_index in range(n_batches):
X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)
if update_summary(batch_index):
summary_str = mse_summary.eval(feed_dict={X: X_batch, y: y_batch})
step = epoch * n_batches + batch_index
file_writer.add_summary(summary_str, step)
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
[...]
file_writer.close()
```

Finally, you now can visualize the stats you are recording by starting the TensorBoard server and pointing it at the log directory:

```
$ tensorboard --logdir tf_logs/
```

## Epilogue

As I already advised in the beginning, if you want to learn more, I can warmly recommend you get Aurélien Géron's fantastic book "Hands-On Machine Learning with Scikit-Learn and TensorFlow"; The more advanced topics covered (and that would explode this blog post...) are transfer learning, distributed training, designing Recurrent Networks and Auto-encoders, and even a "beginner's guide" to Deep Reinforcement Learning. Yet, I hope, this tiny taste of the book's contents, spiced up with a bit of my own "opinionated" modifications, will provide you with a handy quick-reference when building and using TensorFlow graphs. (And, that I don't get sued by him or O'Reilly for plagiarism! :-) Please just contact me if this post is an issue -- I have no problem taking the post down again, if it is problematic.)