A Practical guide to Autoencoders using Keras
Usually in a conventional neural network, one tries to predict a target vector y
from input
vectors x
. In an autoencoder network, one tries to predict x
from x
. It is trivial to learn
a mapping from x
to x
if the network has no constraints, but if the network is constrained the
learning process becomes more interesting.
In this article, we are going to take a detailed look at the mathematics of different types of autoencoders (with different constraints) along with a sample implementation of it using Keras, with a tensorflow backend.
Basic Autoencoders
The simplest AutoEncoder (AE) has an MLPlike (Multi Layer Perceptron) structure:
 Input Layer
 Hidden Layer, and
 Output Layer
However, unlike MLP, autoencoders do not require any target data. As the network is trying to learn $x$ itself, the learning algorithm is a special case of unsupervised learning.
Mathematically, lets define:
 Input vector: $x \in \Big[ 0, 1 \Big]^d$
 Activation function: $a(h)$ applied to very nuron of layer $h$
 $W_i \in \mathbb{R}^{I_{di} \times O_{di}}$, the parameter matrix of $i$th layer, projecting a $I_{di}$ dimensional input in a $O_{di}$dimensional space
 $b_i \in \mathbb{R}^{O_{di}}$ bias vector
The simplest AE can then be summarized as:
$$ \begin{aligned} z &= a(x W_1 + b_1) \cr x' &= a(z W_2 + b_2) \end{aligned} $$
The AE model tries to minimize the reconstruction error between the input value $x$ and the reconstructed value $x'$. A typical definition of the reconstruction error is the $L_p$ distance (like $L_2$ norm) between the $x$ and $x'$ vectors:
$$ \min \mathcal{L} = \min E(x, x') \stackrel{e.g.}{=} \min  x  x' _p $$
Another common variant of loss function (especially images) for AE is the cross entropy function.
$$ \mathcal{L}(x, x') = \sum_{c=1}^{M} x'_c \log (x_c) $$
where $M$ is the dimensionality of the input data $x$ (for eg. no. of pixels in an image).
Autoencoders in Practice
The above example of autoencoder is too simplistic for any real use case. It can be easily noticed that if the number of units in the hidden layer is greater than or equal to the number of input units, the network will learn the identity function easily. Hence, the simplest constraint used in reallife autoencoders is the number of hidden units ($z$) should be less than the dimensions ($d$) of the input ($z < d$).
By limiting the amount of information that can flow through the network, AE model can learn the most important attributes of the input data and how to best reconstruct the original input from an “encoded” state. Ideally, this encoding will learn and describe latent attributes of the input data. Dimensionality reduction using AEs leads to better results than classical dimensionality reduction techniques such as PCA due to the nonlinearity and the type of constraints applied.
PCA and Autoencoders
If we were to construct a linear network (i.e. without the use of nonlinear activation functions at each layer) we would observe a similar dimensionality reduction as observed in PCA. See Geoffrey Hinton’s discussion.
A practical autoencoder network consists of an encoding function (encoder), and a decoding function (decoder). Following is an example architecture for the reconstruction of images.
In this article we will build different types of autoencoders for the fashion MNIST dataset. In stead of using more common MNIST dataset, I prefer to use fashion MNIST dataset for the reasons described here.
For example using MNIST data, please have a look at the article by Francois Chollet, the creator of Keras. The code below is heavily adapted from his article.
We’ll start simple, with a single fullyconnected neural layer as encoder and as decoder.


We will also create separate encoding and decoding functions, that can be used to extract latent features at test time.


We can now set the optimizer and the loss function before training the autoencoder model.
autoencoder.compile(optimizer='rmsprop', loss='binary_crossentropy')
Next, we need to get the [fashion MNIST] data and normalize it for training. Furthermore, we will flatten the $28\times28$ images to a vector of size 784. Please note that running the code below for the first time will download the full dataset and hence might take few minutes.


Output: (60000, 784) (10000, 784)
We can now train our model for 100 epochs:


This will print per epoch training and validation loss. But we can plot the loss history during training using the history object.


Output:
After 100 epochs, the autoencoder reaches a stable train/text loss value of about 0.282. Let us look visually how good of reconstruction this simple model does!


The top row is the original image, while bottom row is the reconstructed image. We can see that we are loosing a lot of fine details.
Sparsity Constraint
We can add an additional constraint to the above AE model, a sparsity constraints on the latent variables. Mathematically, this is achieved by adding a sparsity penalty $\Omega(\mathbf{h})$ on the bottleneck layer $\mathbf{h}$.
$$ \min \mathcal{L} = \min E(x, x') + \Omega(h) $$
where, $\mathbf{h}$ is the encoder output.
Sparsity is a desired characteristic for an autoencoder, because it allows to use a greater number of hidden units (even more than the input ones) and therefore gives the network the ability of learning different connections and extract different features (w.r.t. the features extracted with the only constraint on the number of hidden units). Moreover, sparsity can be used together with the constraint on the number of hidden units: an optimization process of the combination of these hyperparameters is required to achieve better performance.
In Keras, sparsity constraint can be achieved by adding an activity_regularizer to our Dense layer:


Similar to the previous model, we can train this as well for 150 epochs. Using a regularizer is less likely to overfit and hence can be trained for longer.


We get a very similar loss as the previous example. Here is a plot of loss values during training.
As expected, the reconstructed images too look quite similar as before.


Deep Autoencoders
We have been using only single layers for encoders and decoders. Given we have large enough data, there is nothing that stops us from building deeper networks for encoders and decoders.


We can train this model, same as before.
autoencoder.compile(optimizer='rmsprop', loss='binary_crossentropy')
history = autoencoder.fit(x_train, x_train,
epochs=150,
batch_size=256,
shuffle=True,
validation_data=(x_test, x_test))
plot_train_history_loss(history)
The average loss is now 0.277, as compared to ~0.285 before! We can also see that visually all reconstructed images too look slightly better.
decoded_imgs = autoencoder.predict(x_test)
display_reconstructed(x_test, decoded_imgs, 10)
Convolutional Autoencoders
Since our inputs are images, it makes sense to use convolution neural networks (convnets) as encoders and decoders. In practical settings, autoencoders applied to images are always convolution autoencoders –they simply perform much better.
The encoder will consist of a stack of Conv2D
and MaxPooling2D
layers (max pooling being used
for spatial downsampling), while the decoder will consist of a stack of Conv2D
and
UpSampling2D
layers. We will also be using BatchNormalization
. One major difference between
this network and prior ones is that now we have 256 (4x4x16) elements in the bottleneck layer as
opposed to just 32 before!
You can read more about convolutionbased autoencoders in further details here.


To train it, we will use the original fashion MNIST digits with shape (samples, 1, 28, 28), and we will just normalize pixel values between 0 and 1.


Similar to before, we can train this model for 150 epochs. However, unlike before, we will checkpoint the model during training to save the best model, based on the validation loss minima.


We find the lowest validation loss now is 0.265, significantly lower than the previous best value of 0.277. We will first load the saved best model weights, and then plot the original and the reconstructed images from the test dataset.
autoencoder.load_weights('weightsae1460.266.hdf5')
decoded_imgs = autoencoder.predict(x_test)
display_reconstructed(x_test, decoded_imgs, 10)
At first glance, it seems not much of improvement over the deep autoencoders result. However, if you notice closely, we start to see small feature details to appear on the reconstructed images. In order to improve these models further, we will likely have to go for deeper and more complex convolution network.
Denoising Autoencoders
Another common variant of AE networks is the one that learns to remove noise from the input. Mathematically, this is achieved by modifying the reconstruction error of the loss function.
Traditionally, autoencoders minimize some loss function:
$$ L\Big(x, g\big(f(x)\big)\Big) $$
where, $L$ is a loss function penalizing reconstructed input $g\big(f(x)\big)$ for being dissimilar to the input $x$. Also, $g(.)$ is the decoder and $f(.)$ is the encoder. A Denoising autoencoders (DAE) instead minimizes,
$$ L\Big(x, g\big(f(\hat{x})\big)\Big) $$
where, $\hat{x}$ is a copy of $x$ that has been corrupted by some form of noise. DAEs must therefore undo this corruption rather than simply copying their input. Training of DAEs forces $f(.)$, the encoder and $g(.)$, the decoder to implicitly learn the structure of $p_{data}(x),$ the distribution of the input data $x$. Please refer to the works of Alain and Bengio (2013) and Bengio et al. (2013).
For a example, we will first introduce noise to our train and test data by applying a guassian noise matrix and clipping the images between 0 and 1.


Here is how the corrupted images look now. They are barely recognizable now!
display_reconstructed(x_test_noisy, None)
We will use a slightly modified version of the previous convolution autoencoder, the one with larger number of filters in the intermediate layers. This increases the capacity of our model.


We can now train this for 150 epochs. Notice the change in the training data!


The loss has converged to a value of 0.287. Let’s take a look at the results, top row are noisy images and the bottom row are the reconstructed images from the DAE.
autoencoder.load_weights('weightsdae1460.287.hdf5')
decoded_imgs = autoencoder.predict(x_test_noisy)
display_reconstructed(x_test_noisy, decoded_imgs, 10)
Variational Autoencoders (VAE)
Variational autoencoders (VAE) are stochastic version of the regular autoencoders. It’s a type of autoencoder with added constraints on the encoded representations being learned. More precisely, it is an autoencoder that learns a latent variable model for its input data. So instead of letting your neural network learn an arbitrary function, you are learning the parameters of a probability distribution modeling your data. If you sample points from this distribution, you can generate new input data samples: a VAE is a “generative model”. The cartoon on the side shows a typical architecture of a VAE model. Please refer to the research papers by Kingma et al. and Rezende et al. for a thorough mathematical analysis.
In the probability model framework, a variational autoencoder contains a specific probability model of data $x$ and latent variables $z$ (most commonly assumed as Guassian). We can write the joint probability of the model as $p(x, z) = p(x \vert z)p(z)$. The generative process can be written as, for each data point $i$:
 Draw latent variables $z_i \sim p(z)$
 Draw data point $x_i \sim p(x\vert z)$
In terms of an implementation of VAE, the latent variables are generated by the encoder and the data points are drawn by the decoder. The latent variable hence is a random variable drawn from a posterior distribution, $p(z)$. To implement the encoder and the decoder as a neural network, you need to backpropogate through random sampling and that is a problem because backpropogation cannot flow through a random node. To overcome this, the reparameterization trick is used. Most commonly, the true posterior distribution for the latent space is assumed to be Guassian. Since our posterior is normally distributed, we can approximate it with another normal distribution, $\mathcal{N}(0, 1)$.
$$ p(z) \sim \mu + L \mathcal{N}(0, 1) $$
Here $\mu$ and $L$ are the output of the encoder. Therefore while backpropogation, all we need is partial derivatives w.r.t. $\mu$, $L$. In the cartoon above, $\mu$ represents the mean vector latent variable and $L$ represents the standard deviation latent variable.
You can read more about VAE models at Reference 1, Reference 2, Reference 3 and Reference 4.
In more practical terms, VAEs represent latent space (bottleneck layer) as a Guassian random variable (enabled by a constraint on the loss function). Hence, the loss function for the VAEs consist of two terms: a reconstruction loss forcing the decoded samples to match the initial inputs (just like in our previous autoencoders), and the KL divergence between the learned latent distribution and the prior distribution, acting as a regularization term.
$$ \begin{aligned} \min \mathcal{L}(x, x') &= \min E(x, x') \cr &+ KL\big(q(z\vert x)\vert \vert p(z)\big) \end{aligned} $$
Here, the first term is the reconstruction loss as before (in a typical autoencoder). The second term is the KullbackLeibler divergence between the encoderâ€™s distribution, $q(z\vert x)$ and the true posterior $p(z)$, typically a Guassian.
As typically (especially for images) the binary crossentropy is used as the reconstruction loss term, the above loss term for the VAEs can be written as,
$$ \begin{aligned} \min{\mathcal{L}(x, x')} &=  \min{\mathbf{E}_{z\sim q(z\vert x)}}\big[ \log p(x' \vert z)\big] \cr &+ KL\big(q(z\vert x) \vert \vert p(z)\big) \end{aligned} $$
To summarize a typical implementation of a VAE, first, an encoder network turns the input samples
$x$ into two parameters in a latent space, z_mean
and z_log_sigma
. Then, we randomly sample
similar points $z$ from the latent normal distribution that is assumed to generate the data, via
$z$ = z_mean
+ exp(z_log_sigma)
* $\mathbf{\epsilon}$, where $\mathbf{\epsilon}$ is a random
normal tensor. Finally, a decoder network maps these latent space points back to the original input
data.
We can now implement VAE for the fashion MNIST data. To demonstrate its generalization, we will generate two versions: one with MLP and the other with the use of convolution and deconvolution layers.
In the first implementation below, we will be using a simple 2layer deep encoder and a 2layer
deep decoder. Note the use of the reparameterization trick via the sampling()
method and a
Lambda
layer.


As described above, we need to include two loss terms, binary cross entropy as before and the KL divergence between the encoder latent variable distribution (calculated using the reparameterization trick) and the true posterior distribution, a normal distribution!
# Compute VAE loss
xent_loss = original_dim * metrics.binary_crossentropy(x, x_decoded_mean)
kl_loss =  0.5 * K.sum(1 + z_log_var  K.square(z_mean)  K.exp(z_log_var), axis=1)
vae_loss = K.mean(xent_loss + kl_loss)
vae.add_loss(vae_loss)
vae.compile(optimizer='rmsprop')
we can now load the fashion MNIST dataset, normalize it and reshape it to correct dimensions so that it can be used with our VAE model.


We will now train our model for 100 epochs.
history = vae.fit(x_train,
shuffle=True,
epochs=epochs,
batch_size=batch_size,
validation_data=(x_test, None))
plot_train_history_loss(history)
Below is the loss for the training and the validation datasets during training epochs. We find that loss has converged in 100 epochs without any sign of over fitting.
Because our latent space is twodimensional, there are a few cool visualizations that can be done at this point. One is to look at the neighborhoods of different classes on the latent 2D plane:


Each of these colored clusters is a type of the fashion item. Close clusters are items that are structurally similar (i.e. items that share information in the latent space). We cal also look at this plot from a different perspective: the better our VAE model, the separation between very dissimilar fashion items would be larger among their clusters!
Because the VAE is a generative model (as described above), we can also use it to generate new images! Here, we will scan the latent plane, sampling latent points at regular intervals, and generating the corresponding image for each of these points. This gives us a visualization of the latent manifold that “generates” the fashion MNIST images.


We find our model has done only a soso job in generating new images. Still, given the simplicity and very small amount of simple code we had to write, this is still quite incredible.
We can next build a more realistic VAE using conv
and deconv
layers. Below is the full code to
build and train the model.


Similar to the case of simple VAE model, we can look at the neighborhoods of different classes on the latent 2D plane:


We can now see that the separation between different class of images are larger than the simple MLP based VAE model.
Finally, we can now generate new images from our, hopefully, better VAE model.


Usage of Autoencoders
Most common uses of Autoenoders are:
 Dimensionality Reduction: Dimensionality reduction was one of the first applications of representation learning and deep learning. Lowerdimensional representations can improve performance on many tasks, such as classification. Models of smaller spaces consume less memory and runtime. The hints provided by the mapping to the lowerdimensional space aid generalization. Due to nonlinear nature, autoencoders tend to perform better than traditional techniques like PCA, kernel PCA etc.
 Denoising and Transformation: You can distort the data and add some noise in it before feeding it to DAEs. This can help in generalizing over the test set. AEs are also useful in image transformation tasks, eg. document cleaning, applying color to images, medical image segmentation using Unet, a variant of autoencoders etc.
Information Retrieval: the task of finding entries in a database that resemble a query entry. This task derives the usual benefits from dimensionality reduction that other tasks do, but also derives the additional benefit that search can become extremely efficient in certain kinds of lowdimensional spaces.
In Natural Language Processing
 Word Embeddings
 Machine Translation
 Document Clustering
 Sentiment Analysis
 Paraphrase Detection
Image/data Generation: We saw theoretical details of generative nature of VAEs above. See this blog post by openAI for a detailed review of image generation.
Anamoly detection: For highly imbalanced data (like credit card fraud detection, defects in manufacturing etc.) you may have sufficient data for the positive class and very few or no data for the negative class. In such situations, you can train an AE on your positive data and learn features and then compute reconstruction error on the training set to find a threshold. During testing, you can use this threshold to reject those test instances whose values are greater than this threshold. However, optimizing the threshold that can generalize well on the unseen test cases is challenging. VAEs have been used as alternative for this task, where reconstruction error is probabilistic and hence easier to generalize. See this article by FICO where they use autoencoders for detecting anomalies in credit scores.
This is it! its been quite a long article. Hope this is helpful to some of you. Please let me know via comments below if any particular issues/concepts you would like me to go over in more details. I would also love to know if any particular topic in machine/deep learning you would like me to cover in future posts.