Skip to content

A Practical guide to Autoencoders using Keras

DeepLearning, Python, DataScience11 min read

Usually in a conventional neural network, one tries to predict a target vector yy from input vectors xx. In an auto-encoder network, one tries to predict xx from xx. It is trivial to learn a mapping from xx to xx if the network has no constraints, but if the network is constrained the learning process becomes more interesting. In this article, we are going to take a detailed look at the mathematics of different types of autoencoders (with different constraints) along with a sample implementation of it using Keras, with a tensorflow back-end.

Please not that this post has been written using Tensorflow 1.x version of Keras

Basic Autoencoders

The simplest AutoEncoder (AE) has an MLP-like (Multi Layer Perceptron) structure:

  • Input Layer
  • Hidden Layer, and
  • Output Layer

However, unlike MLP, autoencoders do not require any target data. As the network is trying to learn xx itself, the learning algorithm is a special case of unsupervised learning.


Mathematically, lets define:

  • Input vector: x[0,1]dx \in \Big[ 0, 1 \Big]^d
  • Activation function: a(h)a(h) applied to very nuron of layer hh
  • WiRIdi×OdiW_i \in \mathbb{R}^{I_{di} \times O_{di}}, the parameter matrix of ii-th layer, projecting a IdiI_{di} dimensional input in a OdiO_{di}dimensional space
  • biROdib_i \in \mathbb{R}^{O_{di}} bias vector

The simplest AE can then be summarized as:

z=a(xW1+b1)x=a(zW2+b2)\begin{aligned} z &= a(x W_1 + b_1) \\ x' &= a(z W_2 + b_2) \end{aligned}

The AE model tries to minimize the reconstruction error between the input value xx and the reconstructed value xx'. A typical definition of the reconstruction error is the LpL_p distance (like L2L_2 norm) between the xx and xx' vectors:

minL=minE(x,x)=e.g.minxxp\min \mathcal{L} = \min E(x, x') \stackrel{e.g.}{=} \min || x - x' ||_p

Another common variant of loss function (especially images) for AE is the cross entropy function.

L(x,x)=c=1Mxclog(xc)\mathcal{L}(x, x') = -\sum_{c=1}^{M} x'_c \log (x_c)

where MM is the dimensionality of the input data xx (for eg. no. of pixels in an image).

Autoencoders in Practice

The above example of auto-encoder is too simplistic for any real use case. It can be easily noticed that if the number of units in the hidden layer is greater than or equal to the number of input units, the network will learn the identity function easily. Hence, the simplest constraint used in real-life autoencoders is the number of hidden units (zz) should be less than the dimensions (dd) of the input (z<dz < d).

By limiting the amount of information that can flow through the network, AE model can learn the most important attributes of the input data and how to best reconstruct the original input from an "encoded" state. Ideally, this encoding will learn and describe latent attributes of the input data. Dimensionality reduction using AEs leads to better results than classical dimensionality reduction techniques such as PCA due to the non-linearity and the type of constraints applied.

PCA and Autoencoders

If we were to construct a linear network (i.e. without the use of nonlinear activation functions at each layer) we would observe a similar dimensionality reduction as observed in PCA. See Geoffrey Hinton's discussion.

A practical auto-encoder network consists of an encoding function (encoder), and a decoding function (decoder). Following is an example architecture for the reconstruction of images.


In this article we will build different types of autoencoders for the fashion MNIST dataset. In stead of using more common MNIST dataset, I prefer to use fashion MNIST dataset for the reasons described here.

For example using MNIST data, please have a look at the article by Francois Chollet, the creator of Keras. The code below is heavily adapted from his article.

We'll start simple, with a single fully-connected neural layer as encoder and as decoder.

1from keras.layers import Input, Dense
2from keras.models import Model
3import numpy as np
5# size of bottleneck latent space
6encoding_dim = 32
7# input placeholder
8input_img = Input(shape=(784,))
9# encoded representation
10encoded = Dense(encoding_dim, activation='relu')(input_img)
11# lossy reconstruction
12decoded = Dense(784, activation='sigmoid')(encoded)
14# full AE model: map an input to its reconstruction
15autoencoder = Model(input_img, decoded)

We will also create separate encoding and decoding functions, that can be used to extract latent features at test time.

1# encoder: map an input to its encoded representation
2encoder = Model(input_img, encoded)
3# placeholder for an encoded input
4encoded_input = Input(shape=(encoding_dim,))
5# last layer of the autoencoder model
6decoder_layer = autoencoder.layers[-1]
7# decoder
8decoder = Model(encoded_input, decoder_layer(encoded_input))

We can now set the optimizer and the loss function before training the auto-encoder model.

1autoencoder.compile(optimizer='rmsprop', loss='binary_crossentropy')

Next, we need to get the [fashion MNIST] data and normalize it for training. Furthermore, we will flatten the 28×2828\times28 images to a vector of size 784. Please note that running the code below for the first time will download the full dataset and hence might take few minutes.

1from keras.datasets import fashion_mnist
3(x_train, _), (x_test, _) = fashion_mnist.load_data()
4x_train = x_train.astype('float32') / 255.
5x_test = x_test.astype('float32') / 255.
6x_train = x_train.reshape((len(x_train),[1:])))
7x_test = x_test.reshape((len(x_test),[1:])))
8print(x_train.shape, x_test.shape)

Output: (60000, 784) (10000, 784)

We can now train our model for 100 epochs:

1history =, x_train,
2 epochs=100,
3 batch_size=256,
4 shuffle=True,
5 validation_data=(x_test, x_test))

This will print per epoch training and validation loss. But we can plot the loss history during training using the history object.

1import matplotlib.pyplot as plt
3def plot_train_history_loss(history):
4 # summarize history for loss
5 plt.plot(history.history['loss'])
6 plt.plot(history.history['val_loss'])
7 plt.title('model loss')
8 plt.ylabel('loss')
9 plt.xlabel('epoch')
10 plt.legend(['train', 'test'], loc='upper right')


After 100 epochs, the auto-encoder reaches a stable train/text loss value of about 0.282. Let us look visually how good of reconstruction this simple model does!

1# encode and decode some images from test set
2encoded_imgs = encoder.predict(x_test)
3decoded_imgs = decoder.predict(encoded_imgs)
5def display_reconstructed(x_test, decoded_imgs, n=10):
6 plt.figure(figsize=(20, 4))
7 for i in range(n):
8 # display original
9 ax = plt.subplot(2, n, i + 1)
10 plt.imshow(x_test[i].reshape(28, 28))
11 plt.gray()
12 ax.get_xaxis().set_visible(False)
13 ax.get_yaxis().set_visible(False)
15 if decoded_imgs is not None:
16 # display reconstruction
17 ax = plt.subplot(2, n, i + 1 + n)
18 plt.imshow(decoded_imgs[i].reshape(28, 28))
19 plt.gray()
20 ax.get_xaxis().set_visible(False)
21 ax.get_yaxis().set_visible(False)
24display_reconstructed(x_test, decoded_imgs, 10)

The top row is the original image, while bottom row is the reconstructed image. We can see that we are loosing a lot of fine details.

ae basic fm

Sparsity Constraint

We can add an additional constraint to the above AE model, a sparsity constraints on the latent variables. Mathematically, this is achieved by adding a sparsity penalty Ω(h)\Omega(\mathbf{h}) on the bottleneck layer h\mathbf{h}.

minL=minE(x,x)+Ω(h)\min \mathcal{L} = \min E(x, x') + \Omega(h)

where, h\mathbf{h} is the encoder output.

Sparsity is a desired characteristic for an auto-encoder, because it allows to use a greater number of hidden units (even more than the input ones) and therefore gives the network the ability of learning different connections and extract different features (w.r.t. the features extracted with the only constraint on the number of hidden units). Moreover, sparsity can be used together with the constraint on the number of hidden units: an optimization process of the combination of these hyper-parameters is required to achieve better performance.

In Keras, sparsity constraint can be achieved by adding an activity_regularizer to our Dense layer:

1from keras import regularizers
3encoding_dim = 32
5input_img = Input(shape=(784,))
6# add a Dense layer with a L1 activity regularizer
7encoded = Dense(encoding_dim, activation='relu',
8 activity_regularizer=regularizers.l1(1e-8))(input_img)
9decoded = Dense(784, activation='sigmoid')(encoded)
11autoencoder = Model(input_img, decoded)

Similar to the previous model, we can train this as well for 150 epochs. Using a regularizer is less likely to overfit and hence can be trained for longer.

1autoencoder.compile(optimizer='rmsprop', loss='binary_crossentropy')
2history =, x_train,
3 epochs=150,
4 batch_size=256,
5 shuffle=True,
6 validation_data=(x_test, x_test))

We get a very similar loss as the previous example. Here is a plot of loss values during training.

As expected, the reconstructed images too look quite similar as before.

1decoded_imgs = autoencoder.predict(x_test)
2display_reconstructed(x_test, decoded_imgs, 10)

ae sparsity fm

Deep Autoencoders

We have been using only single layers for encoders and decoders. Given we have large enough data, there is nothing that stops us from building deeper networks for encoders and decoders.

1input_img = Input(shape=(784,))
2encoded = Dense(128, activation='relu')(input_img)
3encoded = Dense(64, activation='relu')(encoded)
4encoded = Dense(32, activation='relu')(encoded)
6decoded = Dense(64, activation='relu')(encoded)
7decoded = Dense(128, activation='relu')(decoded)
8decoded = Dense(784, activation='sigmoid')(decoded)
10autoencoder = Model(input_img, decoded)

We can train this model, same as before.

1autoencoder.compile(optimizer='rmsprop', loss='binary_crossentropy')
2history =, x_train,
3 epochs=150,
4 batch_size=256,
5 shuffle=True,
6 validation_data=(x_test, x_test))

The average loss is now 0.277, as compared to ~0.285 before! We can also see that visually all reconstructed images too look slightly better.

1decoded_imgs = autoencoder.predict(x_test)
2display_reconstructed(x_test, decoded_imgs, 10)

ae deep fm

Convolutional Autoencoders

Since our inputs are images, it makes sense to use convolution neural networks (conv-nets) as encoders and decoders. In practical settings, autoencoders applied to images are always convolution autoencoders --they simply perform much better.

The encoder will consist of a stack of Conv2D and MaxPooling2D layers (max pooling being used for spatial down-sampling), while the decoder will consist of a stack of Conv2D and UpSampling2D layers. We will also be using BatchNormalization. One major difference between this network and prior ones is that now we have 256 (4x4x16) elements in the bottleneck layer as opposed to just 32 before!

You can read more about convolution-based auto-encoders in further details here.

1from keras.layers import Input, Dense, Conv2D, MaxPooling2D, UpSampling2D, BatchNormalization
2from keras.models import Model
3from keras import backend as K
5input_img = Input(shape=(28, 28, 1))
7x = Conv2D(32, (3, 3), activation='relu', padding='same', use_bias=False)(input_img)
8x = BatchNormalization(axis=-1)(x)
9x = MaxPooling2D((2, 2), padding='same')(x)
10x = Conv2D(16, (3, 3), activation='relu', padding='same', use_bias=False)(x)
11x = BatchNormalization(axis=-1)(x)
12x = MaxPooling2D((2, 2), padding='same')(x)
13x = Conv2D(16, (3, 3), activation='relu', padding='same', use_bias=False)(x)
14x = BatchNormalization(axis=-1)(x)
15encoded = MaxPooling2D((2, 2), padding='same')(x)
17x = Conv2D(16, (3, 3), activation='relu', padding='same', use_bias=False)(encoded)
18x = BatchNormalization(axis=-1)(x)
19x = UpSampling2D((2, 2))(x)
20x = Conv2D(16, (3, 3), activation='relu', padding='same', use_bias=False)(x)
21x = BatchNormalization(axis=-1)(x)
22x = UpSampling2D((2, 2))(x)
23x = Conv2D(32, (3, 3), activation='relu', padding='valid', use_bias=False)(x)
24x = BatchNormalization(axis=-1)(x)
25x = UpSampling2D((2, 2))(x)
26decoded = Conv2D(1, (3, 3), activation='sigmoid', padding='same', use_bias=False)(x)
28autoencoder = Model(input_img, decoded)
29autoencoder.compile(optimizer='rmsprop', loss='binary_crossentropy')

To train it, we will use the original fashion MNIST digits with shape (samples, 1, 28, 28), and we will just normalize pixel values between 0 and 1.

1(x_train, _), (x_test, _) = fashion_mnist.load_data()
3x_train = x_train.astype('float32') / 255.
4x_test = x_test.astype('float32') / 255.
5x_train = np.reshape(x_train, (len(x_train), 28, 28, 1))
6x_test = np.reshape(x_test, (len(x_test), 28, 28, 1))

Similar to before, we can train this model for 150 epochs. However, unlike before, we will checkpoint the model during training to save the best model, based on the validation loss minima.

1from keras.callbacks import ModelCheckpoint
3fpath = "weights-ae-{epoch:02d}-{val_loss:.3f}.hdf5"
4callbacks = [ModelCheckpoint(fpath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')]
5history =, x_train,
6 epochs=150,
7 batch_size=256,
8 shuffle=True,
9 validation_data=(x_test, x_test),
10 callbacks=callbacks)

We find the lowest validation loss now is 0.265, significantly lower than the previous best value of 0.277. We will first load the saved best model weights, and then plot the original and the reconstructed images from the test dataset.

2decoded_imgs = autoencoder.predict(x_test)
3display_reconstructed(x_test, decoded_imgs, 10)

ae conv fm

At first glance, it seems not much of improvement over the deep autoencoders result. However, if you notice closely, we start to see small feature details to appear on the reconstructed images. In order to improve these models further, we will likely have to go for deeper and more complex convolution network.

Denoising Autoencoders

Another common variant of AE networks is the one that learns to remove noise from the input. Mathematically, this is achieved by modifying the reconstruction error of the loss function.

Traditionally, autoencoders minimize some loss function:

L(x,g(f(x)))L\Big(x, g\big(f(x)\big)\Big)

where, LL is a loss function penalizing reconstructed input g(f(x))g\big(f(x)\big) for being dissimilar to the input xx. Also, g(.)g(.) is the decoder and f(.)f(.) is the encoder. A Denoising autoencoders (DAE) instead minimizes,

L(x,g(f(x^)))L\Big(x, g\big(f(\hat{x})\big)\Big)

where, x^\hat{x} is a copy of xx that has been corrupted by some form of noise. DAEs must therefore undo this corruption rather than simply copying their input. Training of DAEs forces f(.)f(.), the encoder and g(.)g(.), the decoder to implicitly learn the structure of pdata(x),p_{data}(x), the distribution of the input data xx. Please refer to the works of Alain and Bengio (2013) and Bengio et al. (2013).

For a example, we will first introduce noise to our train and test data by applying a guassian noise matrix and clipping the images between 0 and 1.

1(x_train, _), (x_test, _) = fashion_mnist.load_data()
3x_train = x_train.astype('float32') / 255.
4x_test = x_test.astype('float32') / 255.
5x_train = np.reshape(x_train, (len(x_train), 28, 28, 1))
6x_test = np.reshape(x_test, (len(x_test), 28, 28, 1))
8noise_factor = 0.5
9x_train_noisy = x_train + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_train.shape)
10x_test_noisy = x_test + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_test.shape)
12x_train_noisy = np.clip(x_train_noisy, 0., 1.)
13x_test_noisy = np.clip(x_test_noisy, 0., 1.)

Here is how the corrupted images look now. They are barely recognizable now!

1display_reconstructed(x_test_noisy, None)

ae noisy sample

We will use a slightly modified version of the previous convolution autoencoder, the one with larger number of filters in the intermediate layers. This increases the capacity of our model.

1input_img = Input(shape=(28, 28, 1))
3x = Conv2D(32, (3, 3), activation='relu', padding='same', use_bias=False)(input_img)
4x = BatchNormalization(axis=-1)(x)
5x = MaxPooling2D((2, 2), padding='same')(x)
6x = Conv2D(32, (3, 3), activation='relu', padding='same', use_bias=False)(x)
7x = BatchNormalization(axis=-1)(x)
8x = MaxPooling2D((2, 2), padding='same')(x)
9x = Conv2D(32, (3, 3), activation='relu', padding='same', use_bias=False)(x)
10x = BatchNormalization(axis=-1)(x)
11encoded = MaxPooling2D((2, 2), padding='same')(x)
13x = Conv2D(32, (3, 3), activation='relu', padding='same', use_bias=False)(encoded)
14x = BatchNormalization(axis=-1)(x)
15x = UpSampling2D((2, 2))(x)
16x = Conv2D(32, (3, 3), activation='relu', padding='same', use_bias=False)(x)
17x = BatchNormalization(axis=-1)(x)
18x = UpSampling2D((2, 2))(x)
19x = Conv2D(32, (3, 3), activation='relu', padding='valid', use_bias=False)(x)
20x = BatchNormalization(axis=-1)(x)
21x = UpSampling2D((2, 2))(x)
22decoded = Conv2D(1, (3, 3), activation='sigmoid', padding='same', use_bias=False)(x)
24autoencoder = Model(input_img, decoded)
25autoencoder.compile(optimizer='rmsprop', loss='binary_crossentropy')

We can now train this for 150 epochs. Notice the change in the training data!

1fpath = "weights-dae-{epoch:02d}-{val_loss:.3f}.hdf5"
2callbacks = [ModelCheckpoint(fpath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')]
3history =, x_train,
4 epochs=150,
5 batch_size=256,
6 shuffle=True,
7 validation_data=(x_test_noisy, x_test),
8 callbacks=callbacks)

The loss has converged to a value of 0.287. Let's take a look at the results, top row are noisy images and the bottom row are the reconstructed images from the DAE.

2decoded_imgs = autoencoder.predict(x_test_noisy)
3display_reconstructed(x_test_noisy, decoded_imgs, 10)

dae conv fm

Sequence-to-Sequence Autoencoders

If your inputs are sequences, rather than 2D images, then you may want to use as encoder and decoder a type of model that can capture temporal structure, such as a LSTM. To build a LSTM-based auto-encoder, first use a LSTM encoder to turn your input sequences into a single vector that contains information about the entire sequence, then repeat this vector nn times (where nn is the number of time steps in the output sequence), and run a LSTM decoder to turn this constant sequence into the target sequence.

Variational Autoencoders (VAE)

Variational autoencoders (VAE) are stochastic version of the regular autoencoders. It's a type of autoencoder with added constraints on the encoded representations being learned. More precisely, it is an autoencoder that learns a latent variable model for its input data. So instead of letting your neural network learn an arbitrary function, you are learning the parameters of a probability distribution modeling your data. If you sample points from this distribution, you can generate new input data samples: a VAE is a "generative model". The cartoon on the side shows a typical architecture of a VAE model. Please refer to the research papers by Kingma et al. and Rezende et al. for a thorough mathematical analysis.


In the probability model framework, a variational autoencoder contains a specific probability model of data xx and latent variables zz (most commonly assumed as Guassian). We can write the joint probability of the model as p(x,z)=p(xz)p(z)p(x, z) = p(x \vert z)p(z). The generative process can be written as, for each data point ii:

  • Draw latent variables zip(z)z_i \sim p(z)
  • Draw data point xip(xz)x_i \sim p(x\vert z)

In terms of an implementation of VAE, the latent variables are generated by the encoder and the data points are drawn by the decoder. The latent variable hence is a random variable drawn from a posterior distribution, p(z)p(z). To implement the encoder and the decoder as a neural network, you need to backpropogate through random sampling and that is a problem because backpropogation cannot flow through a random node. To overcome this, the reparameterization trick is used. Most commonly, the true posterior distribution for the latent space is assumed to be Guassian. Since our posterior is normally distributed, we can approximate it with another normal distribution, N(0,1)\mathcal{N}(0, 1).

p(z)μ+LN(0,1)p(z) \sim \mu + L \mathcal{N}(0, 1)

Here μ\mu and LL are the output of the encoder. Therefore while backpropogation, all we need is partial derivatives w.r.t. μ\mu, LL. In the cartoon above, μ\mu represents the mean vector latent variable and LL represents the standard deviation latent variable.

You can read more about VAE models at Reference 1, Reference 2, Reference 3 and Reference 4.

In more practical terms, VAEs represent latent space (bottleneck layer) as a Guassian random variable (enabled by a constraint on the loss function). Hence, the loss function for the VAEs consist of two terms: a reconstruction loss forcing the decoded samples to match the initial inputs (just like in our previous autoencoders), and the KL divergence between the learned latent distribution and the prior distribution, acting as a regularization term.

minL(x,x)=minE(x,x)+KL(q(zx)p(z))\min \mathcal{L}(x, x') = \min E(x, x') \\ + KL\big(q(z\vert x)\vert \vert p(z)\big)

Here, the first term is the reconstruction loss as before (in a typical auto-encoder). The second term is the Kullback-Leibler divergence between the encoder’s distribution, q(zx)q(z\vert x) and the true posterior p(z)p(z), typically a Guassian.

As typically (especially for images) the binary cross-entropy is used as the reconstruction loss term, the above loss term for the VAEs can be written as,

minL(x,x)=minEzq(zx)[logp(xz)]+KL(q(zx)p(z))\min{\mathcal{L}(x, x')} = - \min{\mathbf{E}_{z\sim q(z\vert x)}}\big[ \log p(x' \vert z)\big] \\ + KL\big(q(z\vert x) \vert \vert p(z)\big)

To summarize a typical implementation of a VAE, first, an encoder network turns the input samples xx into two parameters in a latent space, z_mean and z_log_sigma. Then, we randomly sample similar points zz from the latent normal distribution that is assumed to generate the data, via zz = z_mean + exp(z_log_sigma) * ϵ\mathbf{\epsilon}, where ϵ\mathbf{\epsilon} is a random normal tensor. Finally, a decoder network maps these latent space points back to the original input data.

We can now implement VAE for the fashion MNIST data. To demonstrate its generalization, we will generate two versions: one with MLP and the other with the use of convolution and deconvolution layers.

In the first implementation below, we will be using a simple 2-layer deep encoder and a 2-layer deep decoder. Note the use of the reparameterization trick via the sampling() method and a Lambda layer.

1from scipy.stats import norm
3from keras.layers import Input, Dense, Lambda, Flatten, Reshape
4from keras.layers import Conv2D, Conv2DTranspose
5from keras.models import Model
6from keras import backend as K
7from keras import metrics
9batch_size = 128
10original_dim = 784
11latent_dim = 2
12intermediate_dim = 256
13epochs = 100
14epsilon_std = 1.0
17x = Input(shape=(original_dim,))
18h = Dense(intermediate_dim, activation='relu')(x)
19z_mean = Dense(latent_dim)(h)
20z_log_var = Dense(latent_dim)(h)
23def sampling(args):
24 z_mean, z_log_var = args
25 epsilon = K.random_normal(shape=(K.shape(z_mean)[0], latent_dim), mean=0.,
26 stddev=epsilon_std)
27 return z_mean + K.exp(z_log_var / 2) * epsilon
29# note that "output_shape" isn't necessary with the TensorFlow backend
30z = Lambda(sampling, output_shape=(latent_dim,))([z_mean, z_log_var])
32# to reuse these later
33decoder_h = Dense(intermediate_dim, activation='relu')
34decoder_mean = Dense(original_dim, activation='sigmoid')
35h_decoded = decoder_h(z)
36x_decoded_mean = decoder_mean(h_decoded)
38# instantiate VAE model
39vae = Model(x, x_decoded_mean)

As described above, we need to include two loss terms, binary cross entropy as before and the KL divergence between the encoder latent variable distribution (calculated using the reparameterization trick) and the true posterior distribution, a normal distribution!

1# Compute VAE loss
2xent_loss = original_dim * metrics.binary_crossentropy(x, x_decoded_mean)
3kl_loss = - 0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
4vae_loss = K.mean(xent_loss + kl_loss)

we can now load the fashion MNIST dataset, normalize it and reshape it to correct dimensions so that it can be used with our VAE model.

1# train the VAE on fashion MNIST images
2(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
4x_train = x_train.astype('float32') / 255.
5x_test = x_test.astype('float32') / 255.
6x_train = x_train.reshape((len(x_train),[1:])))
7x_test = x_test.reshape((len(x_test),[1:])))

We will now train our model for 100 epochs.

1history =,
2 shuffle=True,
3 epochs=epochs,
4 batch_size=batch_size,
5 validation_data=(x_test, None))

Below is the loss for the training and the validation datasets during training epochs. We find that loss has converged in 100 epochs without any sign of over fitting.

Because our latent space is two-dimensional, there are a few cool visualizations that can be done at this point. One is to look at the neighborhoods of different classes on the latent 2D plane:

1# build a model to project inputs on the latent space
2encoder = Model(x, z_mean)
4# display a 2D plot of the digit classes in the latent space
5def plot_latentSpace(encoder, x_test, y_test, batch_size):
6 x_test_encoded = encoder.predict(x_test, batch_size=batch_size)
7 plt.figure(figsize=(6, 6))
8 plt.scatter(x_test_encoded[:, 0], x_test_encoded[:, 1], c=y_test, cmap='tab10')
9 plt.colorbar()
12plot_latentSpace(encoder, x_test, y_test, batch_size)

Each of these colored clusters is a type of the fashion item. Close clusters are items that are structurally similar (i.e. items that share information in the latent space). We cal also look at this plot from a different perspective: the better our VAE model, the separation between very dissimilar fashion items would be larger among their clusters!

Because the VAE is a generative model (as described above), we can also use it to generate new images! Here, we will scan the latent plane, sampling latent points at regular intervals, and generating the corresponding image for each of these points. This gives us a visualization of the latent manifold that "generates" the fashion MNIST images.

1# generator that can sample from the learned distribution
2decoder_input = Input(shape=(latent_dim,))
3_h_decoded = decoder_h(decoder_input)
4_x_decoded_mean = decoder_mean(_h_decoded)
5generator = Model(decoder_input, _x_decoded_mean)
7def plot_generatedImages(generator):
8 # D manifold of the fashion images
9 n = 15 # figure with 15x15 images
10 image_size = 28
11 figure = np.zeros((image_size * n, image_size * n))
12 # linearly spaced coordinates on the unit square were transformed through the # inverse CDF (ppf) of the Gaussian
13 # to produce values of the latent variables z, since the prior of the latent
14 # space is Gaussian
15 grid_x = norm.ppf(np.linspace(0.005, 0.995, n))
16 grid_y = norm.ppf(np.linspace(0.005, 0.995, n))
18 for i, yi in enumerate(grid_x):
19 for j, xi in enumerate(grid_y):
20 z_sample = np.array([[xi, yi]])
21 x_decoded = generator.predict(z_sample)
22 digit = x_decoded[0].reshape(image_size, image_size)
23 figure[i * image_size: (i + 1) * image_size,
24 j * image_size: (j + 1) * image_size] = digit
26 plt.figure(figsize=(10, 10))
27 plt.imshow(figure, cmap='Greys_r')

We find our model has done only a so-so job in generating new images. Still, given the simplicity and very small amount of simple code we had to write, this is still quite incredible.

We can next build a more realistic VAE using conv and deconv layers. Below is the full code to build and train the model.

1# input image dimensions
2img_rows, img_cols, img_chns = 28, 28, 1
3# number of convolutional filters to use
4filters = 64
5# convolution kernel size
6num_conv = 3
8batch_size = 128
9if K.image_data_format() == 'channels_first':
10 original_img_size = (img_chns, img_rows, img_cols)
12 original_img_size = (img_rows, img_cols, img_chns)
13latent_dim = 2
14intermediate_dim = 128
15epsilon_std = 1.0
16epochs = 150
18x = Input(shape=original_img_size)
19conv_1 = Conv2D(img_chns,
20 kernel_size=(2, 2),
21 padding='same', activation='relu')(x)
22conv_2 = Conv2D(filters,
23 kernel_size=(2, 2),
24 padding='same', activation='relu',
25 strides=(2, 2))(conv_1)
26conv_3 = Conv2D(filters,
27 kernel_size=num_conv,
28 padding='same', activation='relu',
29 strides=1)(conv_2)
30conv_4 = Conv2D(filters,
31 kernel_size=num_conv,
32 padding='same', activation='relu',
33 strides=1)(conv_3)
34flat = Flatten()(conv_4)
35hidden = Dense(intermediate_dim, activation='relu')(flat)
37z_mean = Dense(latent_dim)(hidden)
38z_log_var = Dense(latent_dim)(hidden)
41def sampling(args):
42 z_mean, z_log_var = args
43 epsilon = K.random_normal(shape=(K.shape(z_mean)[0], latent_dim),
44 mean=0., stddev=epsilon_std)
45 return z_mean + K.exp(z_log_var) * epsilon
47# note that "output_shape" isn't necessary with the TensorFlow backend
48# so you could write `Lambda(sampling)([z_mean, z_log_var])`
49z = Lambda(sampling, output_shape=(latent_dim,))([z_mean, z_log_var])
51# we instantiate these layers separately so as to reuse them later
52decoder_hid = Dense(intermediate_dim, activation='relu')
53decoder_upsample = Dense(filters * 14 * 14, activation='relu')
55if K.image_data_format() == 'channels_first':
56 output_shape = (batch_size, filters, 14, 14)
58 output_shape = (batch_size, 14, 14, filters)
60decoder_reshape = Reshape(output_shape[1:])
61decoder_deconv_1 = Conv2DTranspose(filters,
62 kernel_size=num_conv,
63 padding='same',
64 strides=1,
65 activation='relu')
66decoder_deconv_2 = Conv2DTranspose(filters,
67 kernel_size=num_conv,
68 padding='same',
69 strides=1,
70 activation='relu')
71if K.image_data_format() == 'channels_first':
72 output_shape = (batch_size, filters, 29, 29)
74 output_shape = (batch_size, 29, 29, filters)
75decoder_deconv_3_upsamp = Conv2DTranspose(filters,
76 kernel_size=(3, 3),
77 strides=(2, 2),
78 padding='valid',
79 activation='relu')
80decoder_mean_squash = Conv2D(img_chns,
81 kernel_size=2,
82 padding='valid',
83 activation='sigmoid')
85hid_decoded = decoder_hid(z)
86up_decoded = decoder_upsample(hid_decoded)
87reshape_decoded = decoder_reshape(up_decoded)
88deconv_1_decoded = decoder_deconv_1(reshape_decoded)
89deconv_2_decoded = decoder_deconv_2(deconv_1_decoded)
90x_decoded_relu = decoder_deconv_3_upsamp(deconv_2_decoded)
91x_decoded_mean_squash = decoder_mean_squash(x_decoded_relu)
93# instantiate VAE model
94vae = Model(x, x_decoded_mean_squash)
96# define the loss function
97xent_loss = img_rows * img_cols * metrics.binary_crossentropy(
98 K.flatten(x),
99 K.flatten(x_decoded_mean_squash))
100kl_loss = - 0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
101vae_loss = K.mean(xent_loss + kl_loss)
106# load the data
107(x_train, _), (x_test, y_test) = fashion_mnist.load_data()
109x_train = x_train.astype('float32') / 255.
110x_train = x_train.reshape((x_train.shape[0],) + original_img_size)
111x_test = x_test.astype('float32') / 255.
112x_test = x_test.reshape((x_test.shape[0],) + original_img_size)
114# train the VAE model
115history =,
116 shuffle=True,
117 epochs=epochs,
118 batch_size=batch_size,
119 validation_data=(x_test, None))

Similar to the case of simple VAE model, we can look at the neighborhoods of different classes on the latent 2D plane:

1# project inputs on the latent space
2encoder = Model(x, z_mean)
3plot_latentSpace(encoder, x_test, y_test, batch_size)

We can now see that the separation between different class of images are larger than the simple MLP based VAE model.

Finally, we can now generate new images from our, hopefully, better VAE model.

1# generator that can sample from the learned distribution
2decoder_input = Input(shape=(latent_dim,))
3_hid_decoded = decoder_hid(decoder_input)
4_up_decoded = decoder_upsample(_hid_decoded)
5_reshape_decoded = decoder_reshape(_up_decoded)
6_deconv_1_decoded = decoder_deconv_1(_reshape_decoded)
7_deconv_2_decoded = decoder_deconv_2(_deconv_1_decoded)
8_x_decoded_relu = decoder_deconv_3_upsamp(_deconv_2_decoded)
9_x_decoded_mean_squash = decoder_mean_squash(_x_decoded_relu)
10generator = Model(decoder_input, _x_decoded_mean_squash)

vae conv gen

Usage of Autoencoders

Most common uses of Autoenoders are:

  • Dimensionality Reduction: Dimensionality reduction was one of the first applications of representation learning and deep learning. Lower-dimensional representations can improve performance on many tasks, such as classification. Models of smaller spaces consume less memory and runtime. The hints provided by the mapping to the lower-dimensional space aid generalization. Due to non-linear nature, autoencoders tend to perform better than traditional techniques like PCA, kernel PCA etc.
  • Denoising and Transformation: You can distort the data and add some noise in it before feeding it to DAEs. This can help in generalizing over the test set. AEs are also useful in image transformation tasks, eg. document cleaning, applying color to images, medical image segmentation using U-net, a variant of autoencoders etc.
  • Information Retrieval: the task of finding entries in a database that resemble a query entry. This task derives the usual benefits from dimensionality reduction that other tasks do, but also derives the additional benefit that search can become extremely efficient in certain kinds of low-dimensional spaces.

  • In Natural Language Processing

    • Word Embeddings
    • Machine Translation
    • Document Clustering
    • Sentiment Analysis
    • Paraphrase Detection
  • Image/data Generation: We saw theoretical details of generative nature of VAEs above. See this blog post by openAI for a detailed review of image generation.

  • Anamoly detection: For highly imbalanced data (like credit card fraud detection, defects in manufacturing etc.) you may have sufficient data for the positive class and very few or no data for the negative class. In such situations, you can train an AE on your positive data and learn features and then compute reconstruction error on the training set to find a threshold. During testing, you can use this threshold to reject those test instances whose values are greater than this threshold. However, optimizing the threshold that can generalize well on the unseen test cases is challenging. VAEs have been used as alternative for this task, where reconstruction error is probabilistic and hence easier to generalize. See this article by FICO where they use autoencoders for detecting anomalies in credit scores.

This is it! its been quite a long article. Hope this is helpful to some of you. Please let me know via comments below if any particular issues/concepts you would like me to go over in more details. I would also love to know if any particular topic in machine/deep learning you would like me to cover in future posts.

Similar Articles

Interactive Data Visualization in Python Using Bokeh

DataScience, Python