11 minute read

A Practical guide to Autoencoders using Keras

Usually in a conventional neural network, one tries to predict a target vector y from input vectors x. In an auto-encoder network, one tries to predict x from x. It is trivial to learn a mapping from x to x if the network has no constraints, but if the network is constrained the learning process becomes more interesting.

In this article, we are going to take a detailed look at the mathematics of different types of autoencoders (with different constraints) along with a sample implementation of it using Keras, with a tensorflow back-end.

Basic Autoencoders

The simplest AutoEncoder (AE) has an MLP-like (Multi Layer Perceptron) structure:

  • Input Layer
  • Hidden Layer, and
  • Output Layer

However, unlike MLP, autoencoders do not require any target data. As the network is trying to learn xx itself, the learning algorithm is a special case of unsupervised learning.

A Typical Autoencoder Network

Mathematically, lets define:

  • Input vector: x[0,1]dx \in \Big[ 0, 1 \Big]^d
  • Activation function: a(h)a(h) applied to very nuron of layer hh
  • WiRIdi×OdiW_i \in \mathbb{R}^{I_{di} \times O_{di}}, the parameter matrix of ii-th layer, projecting a IdiI_{di} dimensional input in a OdiO_{di}dimensional space
  • biROdib_i \in \mathbb{R}^{O_{di}} bias vector

The simplest AE can then be summarized as:

z=a(xW1+b1)x=a(zW2+b2)\begin{aligned} z &= a(x W_1 + b_1) \cr x' &= a(z W_2 + b_2) \end{aligned}

The AE model tries to minimize the reconstruction error between the input value xx and the reconstructed value xx'. A typical definition of the reconstruction error is the LpL_p distance (like L2L_2 norm) between the xx and xx' vectors:

minL=minE(x,x)=e.g.minxxp\min \mathcal{L} = \min E(x, x') \stackrel{e.g.}{=} \min || x - x' ||_p

Another common variant of loss function (especially images) for AE is the cross entropy function.

L(x,x)=c=1Mxclog(xc)\mathcal{L}(x, x') = -\sum_{c=1}^{M} x'_c \log (x_c)

where MM is the dimensionality of the input data xx (for eg. no. of pixels in an image).

Autoencoders in Practice

The above example of auto-encoder is too simplistic for any real use case. It can be easily noticed that if the number of units in the hidden layer is greater than or equal to the number of input units, the network will learn the identity function easily. Hence, the simplest constraint used in real-life autoencoders is the number of hidden units (zz) should be less than the dimensions (dd) of the input (z<dz < d).

By limiting the amount of information that can flow through the network, AE model can learn the most important attributes of the input data and how to best reconstruct the original input from an "encoded" state. Ideally, this encoding will learn and describe latent attributes of the input data. Dimensionality reduction using AEs leads to better results than classical dimensionality reduction techniques such as PCA due to the non-linearity and the type of constraints applied.

A practical auto-encoder network consists of an encoding function (encoder), and a decoding function (decoder). Following is an example architecture for the reconstruction of images.

In this article we will build different types of autoencoders for the fashion MNIST dataset. In stead of using more common MNIST dataset, I prefer to use fashion MNIST dataset for the reasons described here.

For example using MNIST data, please have a look at the article by Francois Chollet, the creator of Keras. The code below is heavily adapted from his article.

We'll start simple, with a single fully-connected neural layer as encoder and as decoder.

We will also create separate encoding and decoding functions, that can be used to extract latent features at test time.

We can now set the optimizer and the loss function before training the auto-encoder model.

Next, we need to get the [fashion MNIST] data and normalize it for training. Furthermore, we will flatten the 28×2828\times28 images to a vector of size 784. Please note that running the code below for the first time will download the full dataset and hence might take few minutes.

Output: (60000, 784) (10000, 784)

We can now train our model for 100 epochs:

This will print per epoch training and validation loss. But we can plot the loss history during training using the history object.


After 100 epochs, the auto-encoder reaches a stable train/text loss value of about 0.282. Let us look visually how good of reconstruction this simple model does!

The top row is the original image, while bottom row is the reconstructed image. We can see that we are loosing a lot of fine details.

Sparsity Constraint

We can add an additional constraint to the above AE model, a sparsity constraints on the latent variables. Mathematically, this is achieved by adding a sparsity penalty Ω(h)\Omega(\mathbf{h}) on the bottleneck layer h\mathbf{h}.

minL=minE(x,x)+Ω(h)\min \mathcal{L} = \min E(x, x') + \Omega(h)

where, h\mathbf{h} is the encoder output.

Sparsity is a desired characteristic for an auto-encoder, because it allows to use a greater number of hidden units (even more than the input ones) and therefore gives the network the ability of learning different connections and extract different features (w.r.t. the features extracted with the only constraint on the number of hidden units). Moreover, sparsity can be used together with the constraint on the number of hidden units: an optimization process of the combination of these hyper-parameters is required to achieve better performance.

In Keras, sparsity constraint can be achieved by adding an activity_regularizer to our Dense layer:

Similar to the previous model, we can train this as well for 150 epochs. Using a regularizer is less likely to overfit and hence can be trained for longer.

We get a very similar loss as the previous example. Here is a plot of loss values during training.

As expected, the reconstructed images too look quite similar as before.

Deep Autoencoders

We have been using only single layers for encoders and decoders. Given we have large enough data, there is nothing that stops us from building deeper networks for encoders and decoders.

We can train this model, same as before.

The average loss is now 0.277, as compared to ~0.285 before! We can also see that visually all reconstructed images too look slightly better.

Convolutional Autoencoders

Since our inputs are images, it makes sense to use convolution neural networks (conv-nets) as encoders and decoders. In practical settings, autoencoders applied to images are always convolution autoencoders --they simply perform much better.

The encoder will consist of a stack of Conv2D and MaxPooling2D layers (max pooling being used for spatial down-sampling), while the decoder will consist of a stack of Conv2D and UpSampling2D layers. We will also be using BatchNormalization. One major difference between this network and prior ones is that now we have 256 (4x4x16) elements in the bottleneck layer as opposed to just 32 before!

You can read more about convolution-based auto-encoders in further details here.

To train it, we will use the original fashion MNIST digits with shape (samples, 1, 28, 28), and we will just normalize pixel values between 0 and 1.

Similar to before, we can train this model for 150 epochs. However, unlike before, we will checkpoint the model during training to save the best model, based on the validation loss minima.

We find the lowest validation loss now is 0.265, significantly lower than the previous best value of 0.277. We will first load the saved best model weights, and then plot the original and the reconstructed images from the test dataset.

At first glance, it seems not much of improvement over the deep autoencoders result. However, if you notice closely, we start to see small feature details to appear on the reconstructed images. In order to improve these models further, we will likely have to go for deeper and more complex convolution network.

Denoising Autoencoders

Another common variant of AE networks is the one that learns to remove noise from the input. Mathematically, this is achieved by modifying the reconstruction error of the loss function.

Traditionally, autoencoders minimize some loss function:

L(x,g(f(x)))L\Big(x, g\big(f(x)\big)\Big)

where, LL is a loss function penalizing reconstructed input g(f(x))g\big(f(x)\big) for being dissimilar to the input xx. Also, g(.)g(.) is the decoder and f(.)f(.) is the encoder. A Denoising autoencoders (DAE) instead minimizes,

L(x,g(f(x^)))L\Big(x, g\big(f(\hat{x})\big)\Big)

where, x^\hat{x} is a copy of xx that has been corrupted by some form of noise. DAEs must therefore undo this corruption rather than simply copying their input. Training of DAEs forces f(.)f(.), the encoder and g(.)g(.), the decoder to implicitly learn the structure of pdata(x),p_{data}(x), the distribution of the input data xx. Please refer to the works of Alain and Bengio (2013) and Bengio et al. (2013).

For a example, we will first introduce noise to our train and test data by applying a guassian noise matrix and clipping the images between 0 and 1.

Here is how the corrupted images look now. They are barely recognizable now!

We will use a slightly modified version of the previous convolution autoencoder, the one with larger number of filters in the intermediate layers. This increases the capacity of our model.

We can now train this for 150 epochs. Notice the change in the training data!

The loss has converged to a value of 0.287. Let's take a look at the results, top row are noisy images and the bottom row are the reconstructed images from the DAE.

::: callout-pink Sequence-to-Sequence Autoencoders

If your inputs are sequences, rather than 2D images, then you may want to use as encoder and decoder a type of model that can capture temporal structure, such as a LSTM. To build a LSTM-based auto-encoder, first use a LSTM encoder to turn your input sequences into a single vector that contains information about the entire sequence, then repeat this vector nn times (where nn is the number of time steps in the output sequence), and run a LSTM decoder to turn this constant sequence into the target sequence.


Variational Autoencoders (VAE)

Variational autoencoders (VAE) are stochastic version of the regular autoencoders. It's a type of autoencoder with added constraints on the encoded representations being learned. More precisely, it is an autoencoder that learns a latent variable model for its input data. So instead of letting your neural network learn an arbitrary function, you are learning the parameters of a probability distribution modeling your data. If you sample points from this distribution, you can generate new input data samples: a VAE is a "generative model". The cartoon on the side shows a typical architecture of a VAE model. Please refer to the research papers by Kingma et al. and Rezende et al. for a thorough mathematical analysis.

In the probability model framework, a variational autoencoder contains a specific probability model of data xx and latent variables zz (most commonly assumed as Guassian). We can write the joint probability of the model as p(x,z)=p(xz)p(z)p(x, z) = p(x \vert z)p(z). The generative process can be written as, for each data point ii:

  • Draw latent variables zip(z)z_i \sim p(z)
  • Draw data point xip(xz)x_i \sim p(x\vert z)

In terms of an implementation of VAE, the latent variables are generated by the encoder and the data points are drawn by the decoder. The latent variable hence is a random variable drawn from a posterior distribution, p(z)p(z). To implement the encoder and the decoder as a neural network, you need to backpropogate through random sampling and that is a problem because backpropogation cannot flow through a random node. To overcome this, the reparameterization trick is used. Most commonly, the true posterior distribution for the latent space is assumed to be Guassian. Since our posterior is normally distributed, we can approximate it with another normal distribution, N(0,1)\mathcal{N}(0, 1).

p(z)μ+LN(0,1)p(z) \sim \mu + L \mathcal{N}(0, 1)

Here μ\mu and LL are the output of the encoder. Therefore while backpropogation, all we need is partial derivatives w.r.t. μ\mu, LL. In the cartoon above, μ\mu represents the mean vector latent variable and LL represents the standard deviation latent variable.

You can read more about VAE models at Reference 1, Reference 2, Reference 3 and Reference 4.

In more practical terms, VAEs represent latent space (bottleneck layer) as a Guassian random variable (enabled by a constraint on the loss function). Hence, the loss function for the VAEs consist of two terms: a reconstruction loss forcing the decoded samples to match the initial inputs (just like in our previous autoencoders), and the KL divergence between the learned latent distribution and the prior distribution, acting as a regularization term.

minL(x,x)=minE(x,x)+KL(q(zx)p(z))\begin{aligned} \min \mathcal{L}(x, x') &= \min E(x, x') \cr &+ KL\big(q(z\vert x)\vert \vert p(z)\big) \end{aligned}

Here, the first term is the reconstruction loss as before (in a typical auto-encoder). The second term is the Kullback-Leibler divergence between the encoder’s distribution, q(zx)q(z\vert x) and the true posterior p(z)p(z), typically a Guassian.

As typically (especially for images) the binary cross-entropy is used as the reconstruction loss term, the above loss term for the VAEs can be written as,

minL(x,x)=minEzq(zx)[logp(xz)]+KL(q(zx)p(z))\begin{aligned} \min{\mathcal{L}(x, x')} &= - \min{\mathbf{E}_{z\sim q(z\vert x)}}\big[ \log p(x' \vert z)\big] \cr &+ KL\big(q(z\vert x) \vert \vert p(z)\big) \end{aligned}

To summarize a typical implementation of a VAE, first, an encoder network turns the input samples xx into two parameters in a latent space, z_mean and z_log_sigma. Then, we randomly sample similar points zz from the latent normal distribution that is assumed to generate the data, via zz = z_mean + exp(z_log_sigma) * ϵ\mathbf{\epsilon}, where ϵ\mathbf{\epsilon} is a random normal tensor. Finally, a decoder network maps these latent space points back to the original input data.

We can now implement VAE for the fashion MNIST data. To demonstrate its generalization, we will generate two versions: one with MLP and the other with the use of convolution and deconvolution layers.

In the first implementation below, we will be using a simple 2-layer deep encoder and a 2-layer deep decoder. Note the use of the reparameterization trick via the sampling() method and a Lambda layer.

As described above, we need to include two loss terms, binary cross entropy as before and the KL divergence between the encoder latent variable distribution (calculated using the reparameterization trick) and the true posterior distribution, a normal distribution!

we can now load the fashion MNIST dataset, normalize it and reshape it to correct dimensions so that it can be used with our VAE model.

We will now train our model for 100 epochs.

Below is the loss for the training and the validation datasets during training epochs. We find that loss has converged in 100 epochs without any sign of over fitting.

Because our latent space is two-dimensional, there are a few cool visualizations that can be done at this point. One is to look at the neighborhoods of different classes on the latent 2D plane:

Each of these colored clusters is a type of the fashion item. Close clusters are items that are structurally similar (i.e. items that share information in the latent space). We cal also look at this plot from a different perspective: the better our VAE model, the separation between very dissimilar fashion items would be larger among their clusters!

Because the VAE is a generative model (as described above), we can also use it to generate new images! Here, we will scan the latent plane, sampling latent points at regular intervals, and generating the corresponding image for each of these points. This gives us a visualization of the latent manifold that "generates" the fashion MNIST images.

We find our model has done only a so-so job in generating new images. Still, given the simplicity and very small amount of simple code we had to write, this is still quite incredible.

We can next build a more realistic VAE using conv and deconv layers. Below is the full code to build and train the model.

Similar to the case of simple VAE model, we can look at the neighborhoods of different classes on the latent 2D plane:

We can now see that the separation between different class of images are larger than the simple MLP based VAE model.

Finally, we can now generate new images from our, hopefully, better VAE model.

Usage of Autoencoders

Most common uses of Autoenoders are:

  • Dimensionality Reduction: Dimensionality reduction was one of the first applications of representation learning and deep learning. Lower-dimensional representations can improve performance on many tasks, such as classification. Models of smaller spaces consume less memory and runtime. The hints provided by the mapping to the lower-dimensional space aid generalization. Due to non-linear nature, autoencoders tend to perform better than traditional techniques like PCA, kernel PCA etc.
  • Denoising and Transformation: You can distort the data and add some noise in it before feeding it to DAEs. This can help in generalizing over the test set. AEs are also useful in image transformation tasks, eg. document cleaning, applying color to images, medical image segmentation using U-net, a variant of autoencoders etc.
  • Information Retrieval: the task of finding entries in a database that resemble a query entry. This task derives the usual benefits from dimensionality reduction that other tasks do, but also derives the additional benefit that search can become extremely efficient in certain kinds of low-dimensional spaces.

  • In Natural Language Processing

    • Word Embeddings
    • Machine Translation
    • Document Clustering
    • Sentiment Analysis
    • Paraphrase Detection
  • Image/data Generation: We saw theoretical details of generative nature of VAEs above. See this blog post by openAI for a detailed review of image generation.

  • Anamoly detection: For highly imbalanced data (like credit card fraud detection, defects in manufacturing etc.) you may have sufficient data for the positive class and very few or no data for the negative class. In such situations, you can train an AE on your positive data and learn features and then compute reconstruction error on the training set to find a threshold. During testing, you can use this threshold to reject those test instances whose values are greater than this threshold. However, optimizing the threshold that can generalize well on the unseen test cases is challenging. VAEs have been used as alternative for this task, where reconstruction error is probabilistic and hence easier to generalize. See this article by FICO where they use autoencoders for detecting anomalies in credit scores.

This is it! its been quite a long article. Hope this is helpful to some of you. Please let me know via comments below if any particular issues/concepts you would like me to go over in more details. I would also love to know if any particular topic in machine/deep learning you would like me to cover in future posts.


RELATED POSTS |Deep Learning, Machine Learning, Algorithms

Get in touch 👋

Feel free to email me about anything. Want some advice? Give some feedback?

You can also reach me around the web: GitHub, Twitter, LinkedIn