Proof of Neural Ordinary Differential Equation’s augmented state

In this blog post, my objective is to clarify the proofs of Appendix B.2 of this paper. This is a follow up of my previous article about Neural ODEs detailing the maths but without much proofs. Here, I will rewrite the proofs of the derivative of the augmented adjoint state, that is, the gradient of the loss L w.r.t. z(t), \theta(t), and t.


  1. f(z(t), \theta(t), t) = \dfrac{dz(t)}{dt}
  2. Z = \begin{bmatrix} z(t), & \theta(t), & t \end{bmatrix}
  3. a(t) = \dfrac{dL}{dZ(t)} = \begin{bmatrix} \dfrac{dL}{dz(t)}, & \dfrac{dL}{d\theta(t)}, & \dfrac{dL}{dt} \end{bmatrix} = \begin{bmatrix} a_z(t), & a_\theta(t), & a_t(t) \end{bmatrix}
  4. g(Z(t), t) = \dfrac{dZ(t)}{dt} = \begin{bmatrix} \dfrac{dz(t)}{dt}, & \dfrac{d\theta(t)}{dt}, & \dfrac{dt}{dt} \end{bmatrix} = \begin{bmatrix} f, & 0_\textbf{E}, & 1 \end{bmatrix} where 0_\textbf{E} is a null matrix of appropriate size

Then, we can define
Z(t+\varepsilon) = Z(t) + \int_{t}^{t+\varepsilon}{g(Z(t), t)dt} = T_\varepsilon(Z(t), t)
notice that
a(t) = \dfrac{dL}{dZ(t)} = \dfrac{dL}{dZ(t+\varepsilon)}\dfrac{dZ(t+\varepsilon)}{dZ(t)} = a(t+\varepsilon)\dfrac{\partial T_\varepsilon(Z(t), t)}{\partial Z(t)}
Thus we can derive a(t):

\frac{d a(t)}{d t} = \lim_{\varepsilon \rightarrow 0^{+}} \frac{a(t+\varepsilon)-a(t)}{\varepsilon} \\= \lim_{\varepsilon \rightarrow 0^{+}} \frac{a(t+\varepsilon)-a(t+\varepsilon) \frac{\partial}{\partial Z(t)} T_{\varepsilon}(Z(t), t)}{\varepsilon} \\=\lim_{\varepsilon \rightarrow 0^{+}} \frac{a(t+\varepsilon)-a(t+\varepsilon) \frac{\partial}{\partial Z(t)}\left(Z(t)+\varepsilon g(Z(t), t)+\mathcal{O}\left(\varepsilon^{2}\right)\right)}{\varepsilon}\\=\lim_{\varepsilon \rightarrow 0^{+}} \frac{a(t+\varepsilon)-a(t+\varepsilon)\left(I+\varepsilon \frac{\partial g(Z(t), t)}{\partial Z(t)}+\mathcal{O}\left(\varepsilon^{2}\right)\right)}{\varepsilon}\\=\lim_{\varepsilon \rightarrow 0^{+}} \frac{-\varepsilon a(t+\varepsilon) \frac{\partial g(Z(t), t)}{\partial Z(t)}+\mathcal{O}(\varepsilon^2)}{\varepsilon}\\=\lim_{\varepsilon \rightarrow 0^{+}} -a(t+\varepsilon) \frac{\partial g(Z(t), t)}{\partial Z(t)}+\mathcal{O}(\varepsilon)\\=-a(t) \frac{\partial g(Z(t), t)}{\partial Z(t)}

As you can see, the proof of appendix B.1 holds for any vector Z. What’s more interesting is the way we can exploit this to get \dfrac{da_\theta(t)}{dt}:

=\begin{bmatrix} a_z(t), & a_\theta(t), & a_t(t) \end{bmatrix} \cdot \begin{pmatrix}\dfrac{df(z(t), \theta(t), t)}{dz} & \dfrac{df(z(t), \theta(t), t)}{d\theta} & \dfrac{df(z(t), \theta(t), t)}{dt} \\ \dfrac{d\theta/dt}{dz} & \dfrac{d\theta/dt}{d\theta} & \dfrac{d\theta/dt}{dt} \\ \dfrac{dt/dt}{dz} & \dfrac{dt/dt}{d\theta} & \dfrac{dt/dt}{dt} \end{pmatrix}
But as mentioned in the paper, \dfrac{d\theta(t)}{dt} = 0 and \dfrac{dt}{dt} = 1 which means that the second and third rows are null (derivatives of constants). Hence:
=\begin{bmatrix} a_z(t), & a_\theta(t), & a_t(t) \end{bmatrix} \cdot \begin{pmatrix} \dfrac{df}{dz} & \dfrac{df}{d\theta} & \dfrac{df}{dt} \\ 0_\textbf{E} & 0_\textbf{E} & 0_\textbf{E} \\ 0_\textbf{E} & 0_\textbf{E} & 0_\textbf{E} \end{pmatrix} \\= -\begin{bmatrix}a_z(t)\dfrac{df}{dz} + a_\theta 0_\textbf{E} + a_t 0_\textbf{E} \\ a_z(t)\dfrac{df}{d\theta} + a_\theta 0_\textbf{E} + a_t 0_\textbf{E} \\ a_z(t)\dfrac{df}{dt} + a_\theta 0_\textbf{E} + a_t 0_\textbf{E} \end{bmatrix} = -\begin{bmatrix}a_z(t)\dfrac{df}{dz} \\ a_z(t)\dfrac{df}{d\theta} \\ a_z(t)\dfrac{df}{dt}\end{bmatrix}

Now, we’ve got:
\dfrac{da(t)}{dt} = - \begin{bmatrix} a_z(t)\dfrac{df}{dz}, & a_z(t)\dfrac{df}{d\theta}, & a_z(t)\dfrac{df}{dt} \end{bmatrix} = \dfrac{d}{dt}\begin{bmatrix} \dfrac{dL}{dz(t)}, & \dfrac{dL}{d\theta(t)}, & \dfrac{dL}{dt} \end{bmatrix}
\dfrac{d}{dt}\left(\dfrac{dL}{d\theta(t)}\right) = -a_z(t)\dfrac{df}{d\theta}\\ \implies \dfrac{dL}{d\theta(t)} = -\int{a_z(t)\dfrac{df}{d\theta}dt}


The PDF version of this is available at this link (with better formatting)

Neural Ordinary Differential Equations

This paper is rather not for the mathematically faint hearted, hence I’d advise a quick linear algebra and differential multivariable calculus review beforehand. To follow along the reviewing, you can toy with this amazing visualisation. The goal of this article is to get data scientists and programmers up to speed with the necessary mathematical background to understand this paper, not to merely restate what the paper says.


Layer paradigm

This paper presents a novel view of the whole neural network paradigm.
Up to that point, networks consisted mainly of assembling (sequentially or not) layers, which outputs are noted h_t. These layers are in reality functions depending on parameters (noted \theta_t) influencing their output in diverse manners. You’ve hopefully heard of the classical Dense/Logistic/Linear/Fully Connected layer which are usually followed by a non-linear activation function (sigmoïd, tanh, ReLU and many others), but also the convolutional layers used in CNN.

In the classical networks, these assembled layers are trained using gradient descent (and its variants) which is an algorithm that tries to optimise the loss function (noted L(h_{t+T}) ) by tuning the parameters of these layers.

To do that, it first computes the derivative of the loss (written \dfrac{dL(h_{t+T})}{dh_{t+T}} ) with respect to each intermediate layer’s output {h_t, h_{t+1}, h_{t+2}, ..., h_{t+T}} and then the derivative of the loss with respect to each parameter (\dfrac{dL(h_{t+T})}{d\theta_{t+T}} = \dfrac{dL(h_{t+T})}{dh_{t+T}} \dfrac{dh_{t+T}}{d\theta_{t+T}}), this is called back propagation (a.k.a. reverse mode derivation). The gradient descend algorithm then tunes the parameters by these derivatives.

To make predictions using those layers, we perform forward propagation, which is just applying consecutively each layer operation onto our input to get to our output. Formally, for one layer (the f function) with tuned parameters \theta_t this can be written like so:

h_{t+1} = f(h_t, \theta_t)

However, recently, a new kind of convolutional neural network has outperform significantly the state of the art by adding a simple little tweak: bypassing the layers. Their idea is as follows: once a sufficient depth is reached and the network judges it needn’t go deeper, it should be able to bypass himself and directly skip useless layers. To do that quickly, they simply added the previous input (the residual) to the output, hence were born Residual Neural Networks.
Mathematically, that tweak looks like this:

h_{t+1} = \bm{h_t} + f(h_t, \theta_t)

Differential Equations

To dive nose first into the mathematical requirements, let’s talk a bit about differential equations. Differential equations are mathematical objects which study dates back to the invention of calculus itself (chapter 2 of Newton’s 1671 work “Methodus fluxionum et Serierum Infinitarum”). They are equations relating a function to its derivative(s). This method allows physician to describe complex dynamical systems’ evolutions through time, like the Navier-Stokes equations allowing us to model weather somewhat accuratly.

To enter the mathematical formalities, what we call Ordinary Differential Equations (ODE) of the n-th order are equations of the following form:

a_1(t)z(t) + a_2(t)\dfrac{dz(t)}{dt} + a_3(t)\dfrac{d^{2}z(t)}{dt^2} + ... + a_n(t)\dfrac{d^{n}z(t)}{dt^n} = 0

This equation can be rewritten under the more digest and familiar form:

\dfrac{dz(t)}{dt} = f(z(t), t)

(where f(z(t), t) = -\frac{1}{a_2(t)}\left(a_1(t)z(t) + a_3(t)\dfrac{d^{2}z(t)}{dt^2} + ... + a_n(t)\dfrac{d^{n}z(t)}{dt^n}\right) but that’s beside the point.)

Notice that the above form doesn’t care about the ODE’s order, meaning any ODE can be written in that form, nay, any differential equation if we change f(z(t), t) for something else !

ODEs also allow to predict the future state(s) of a system given its current state. This kind of problem is called an Initial Value Problem and is very well studied in mathematics. A fundamental theorem in this kind of problem is the Picard-Lindelöf theorem (or Cauchy-Lipschitz theorem as I was taught). This theorem tells us that if f(z(t), t) meets certains conditions, the future states are unique for a given time.

It’s all well and good but what do ODE have to do with neural networks ?

That’s what you might be wondering at this point, and the authors address that question at the very beginning of their work but understanding that link requires understand both ODE and residual neural networks as described above.

To solve ODE, there are two methods: analytically and numerically. The former is what you’re usually taught in college, you use calculus and arithemtic tricks to get an analytical solution to the equation that you can then write on you test to get points. This usually ressemble z(t) = e^u(A\cos(vt) + B\sin(vt)), but sometimes reality is messy and maths can’t describe it exactly in a few complicated symbols. In that case we’re forced to use the latter: we compute a numerical solution that is essentially a list of values for a list of given times you want to know the value of z(t) at. To do that, we use solvers that use linear algebra (matrices and stuff) to simulate the passing of time from you initial condition.

The connection between that and neural network is the following:

z(t+1) = z(t) + f(z(t), \theta(t)) \\ \Leftrightarrow z(t+1) - z(t) = f(z(t), \theta(t)) \\ \approx \dfrac{dz(t)}{dt} = f(z(t), \theta(t))

Here, the derivative arises from its definition (the difference quotient):

\dfrac{dz(t)}{dt} = \lim_{\varepsilon\to 0}\frac{z(t+\varepsilon) - z(t)}{\varepsilon}

If we consider an approximation of \varepsilon = 1, then we get the Residual Neural Network’s formula where \frac{z(t+\varepsilon) - z(t)}{\varepsilon} = f(z(t), \theta(t)) but when \varepsilon \to 0, we get the derivative. We’re therefore left with a differential equation !

But the careful reader will notice the missing t in the function’s definition. Indeed, our layer still doesn’t know what time it is, so our derivative can’t change with respect to time. We can therefore give t to f(z(t), \theta(t)) to get the ODE that appears in the papers’ first page:

\dfrac{dz(t)}{dt} = f(z(t), t, \theta(t))

We changed h_t and \theta_t to z(t) and \theta(t) to signify that they’re not values at discrete timesteps t \in [\![0,T]\!] \subset \mathbb{N} but in a continuous range t \in [0 ,T] \subset \mathbb{R}.

This change of step from \varepsilon = 1 to \varepsilon \to 0 is illustrated in the paper’s Figure 1 :

NeuralODE figure 1 discrete vs continue

Figure 1: Left: A Residual network defines a discrete sequence of finite transformations. Right: A ODE network defines a vector field, which continuously transforms the state. Both: Circles represent evaluation locations.

With that realisation under our belt, we can start to devise strategies to optimise the way we train such neural networks.

How to get the gradients ?

To perform gradient descent, we need two gradients: \dfrac{\partial L}{\partial z(t_0)} and \dfrac{\partial L}{\partial \theta(t_0)}. Normally, you’d back propagate through the ODE solver’s operations to know how to tune \theta, but the authors have a better idea: let’s use a 1962 Lenin Prize of Science and Technology winning technique to compute it with constant memory cost and linear complexity !

Here is what they suggest. Suppose we have 3 values:

\left[\begin{matrix} z(t_1) & \dfrac{\partial L}{\partial z(t_1)} & \dfrac{\partial L}{\partial \theta(t_1)}\end{matrix}\right] (eventhough \dfrac{\partial L}{\partial \theta(t_1)} = 0, c.f. Appendix B.2 between eq 50 and 51)

Since we know how z(t) evolves in time (f(z(t), t, \theta)), and how \dfrac{\partial L}{\partial z(t)} and \dfrac{\partial L}{\partial \theta(t)} evolve in time too (you will see how we found that later I promise), we can give an ODE solver those 3 initial conditions (that we obtained at the end of the forward pass) and those 3 derivatives and ask it to compute back in time what these 3 quantities were at t_0, that is:

\left[\begin{matrix} z(t_0) & \dfrac{\partial L}{\partial z(t_0)} & \dfrac{\partial L}{\partial \theta(t_0)}\end{matrix}\right]

The adjoint method defines a function of time \mathbf{a}(t) = \dfrac{\partial L}{\partial z(t)} and the authors prove in Appendix B.1 that its derivative is given by:

\dfrac{d\mathbf{a}(t)}{dt} = - \mathbf {a}(t)^{\top}\frac{\partial f(z(t), t, \theta (t))}{\partial z}

The authors also generalises the proof in Appendix B.2 to define \mathbf{a_\theta}(t) = \dfrac{\partial L}{\partial \theta(t)} with derivative \dfrac{d\mathbf{a_\theta}(t)}{dt} = - \mathbf {a}(t)^{\top}\frac{\partial f(z(t), t, \theta (t))}{\bm{\partial \theta}}

In which \mathbf{a}(t) is still the same function \mathbf{a}(t) = \dfrac{\partial L}{\partial z(t)} (this and the above derivative puzzled me at first, here is the proof of appendix B.2 rewritten for clarity ) !

I won’t venture into the Contiunous Normalizing flow part of the paper here since it’d require an article of its own but there are some quite heavy maths involved as well. They are mostly well explained on this blog article series.

The most challenging part of these last two derivative are that \frac{\partial f(z(t), t, \theta (t))}{\partial z(t)} and \frac{\partial f(z(t), t, \theta (t))}{\partial \theta(t)} since both represent Jacobian Matrices.

Jacobian matrices are what link differential calculus and linear algebra together: they allow us to locally characterise the behaviour of a vector valued function, i.e. know what linear transformation is applied at each point in space, how each change in value along one input dimension affects each and every value on the output dimensions. The mathematical expression of the jacobian matrix is:

\frac { \partial f(z(t),t,\theta (t)) }{ \partial z(t) } =\left[ \begin{array}{ccc} { \frac { \partial f(z(t),t,\theta (t))_{ 1 } }{ \partial z(t)_{ 1 } } } & { \cdots } & { \frac { \partial f(z(t),t,\theta (t))_{ 1 } }{ \partial z(t)_{ { n } } } } \\ { \vdots } & { \ddots } & { \vdots } \\ { \frac { \partial f(z(t),t,\theta (t))_{ m } }{ \partial z(t)_{ { 1 } } } } & { \cdots } & { \frac { \partial f(z(t),t,\theta (t))_{ m } }{ \partial z(t)_{ { n } } } } \end{array} \right] \in \mathbb{R^{m \times n}}

This is only a concern for vector valued function such as here since
f : \mathbb{R}^{n_z} \times \mathbb{R} \times \mathbb{R}^{n_\theta} \to \mathbb{R}^{n_z}

Now that you’re familliar with jacobian matrices, you understand why -\mathbf {a}(t)^{\top}\frac{\partial f(z(t), t, \theta (t))}{\partial z} is refered in the paper as a Vector-Jacobian Product (named vjp in the code). More information can be found on this amazing blog post.


Ok, first things first, the authors provide two things for us right off the bat: a pseudo-code example (Appendix C) detailing the high level working of their code. Secondly, in Appendix D they also give us a rather large code snippet (implemented on Autograd in Python) which runs an ODE solver backward in time starting from a given augmented state (a vector) which contains \left[\begin{matrix} z(t_1) & \dfrac{\partial L}{\partial z(t_1)} & \dfrac{\partial L}{\partial \theta(t_1)}\end{matrix}\right] and return a list of vectors of gradients at the requested timesteps \left[\begin{matrix} \dfrac{\partial L}{\partial z(t_0)} & \dfrac{\partial L}{\partial \theta(t_0)}\end{matrix}\right]

Lastly and more importantly, the authors actively maintain a GitHub repo which host a Python package which makes using the adjoint method onto an arbitrary function f a breeze (as long as you’re comfortable using PyTorch, but it’s not really hard to pick up).

To those already familliar with that framework, to use the torchdiffeq library in your project, you simply need to define a new torch.nn.Module class and implement the forward method which ought to take two arguments: the first is the time t at which the solver evaluates the function, the second is the state (z(t)) for which the solver wants to know the derivative. Then, you can take these two inputs, combine them as you please and return something that will learn to model your ODE.

To train it, you can use torchdiffeq‘s odeint function in another module’s forward method and return its output as the Module’s output. Then, in the main training loop, just use any old optimiser and loss function you like and when you call loss.backward() pytorch’s autograd will call torchdiffeq‘s .backward() overloading in the process and the magic will happen behind the scene and the adjoint method will automatically compute and return its gradient which can then be chained with other traditional layers from torch.nn. All of this will happen several times per second on high ends GPUs and your ODE Model will train to fit the actual ODE (described only partially by your data).

To get familiar with the usage of the torchdiffeq library, I’d suggest you to read through and rewrite from the the ground up their example as it covers the basic usage of the library’s odeint function. Don’t hesitate to also look in the library’s workings itself too !


Grad-CAM is a method dating back to October 2016 that helps explain why Convolutional Neural Network (CNN) based models perform as they do. This revolutionary technology defies the efficiency/explainability trade-off and allows the developer to verify what feature the model pays attention to.


When classifying, CNN-based models like VGG perform a series of convolutions operation, that is, it hovers a number of filters (or kernels) over the entire input image and multiply each pixel value by each value in the kernel to produce a (usually) smaller image.

This operation is repeated with different filters for each convolution layer in the model and then the convolved image is pooled and normalized to obtain an n-dimensional vector called y. This vector is then fed to a fully connected neural network which task is to point out what class (dog, cat, lama etc.) this is matching to.

Architecture of VGG16

Let’s say we want to know what the convolution layer A “looks at” in the input image when it classifies it as belonging to class c.

Exploring the gradient

First we need to know what impact each of the d filters in A have on our class prediction y^c. To do that, we compute the gradient of y^c with respect to each component in the k-th filter of our convolutional layer (A^k). For the filter component at (i,j), the gradient is:

\dfrac{\partial y^c}{\partial A^{k}_{i,j}}

This could be represented as the sentence

” What does a small change in the (i, j) component of my k-th filter does on my prediction y^c ? “

How much each filter matter

That is therefore a direct measurement of the role of each component feature in the k-th feature map of A. We can average these mesurements to get the importance of the k-th feature map by doing a Global Average Pooling (i.e. an average over (i, j) ).
We therefore get the scalar \alpha_k^c:

\alpha_k^c = \frac{1}{n\times m}\sum^n_{i=1}\sum^m_{j=1}\dfrac{\partial y^c}{\partial A^{k}_{i,j}}

Displaying their importance

Then, the heatmap indicating where the network “looks” (called L_{GradCAM}^c, for localisation map) can be obtain by weighing each k-th feature map’s values by these \alpha_k^c coefficients and passing them through a ReLU activation function (to get rid of negative values):

L_{GradCAM}^c = ReLU\left(\sum^d_{k=1}\alpha_k^c A^k\right)

We can then simply scale up that map of importance thanks to a rather nice property of convolutional networks: filters work locally. This means that the location of detected features is conserved through each layer of our CNN.

After scaling up and superimposing L_{GradCAM}^c to our initial picture, we get:


You can download the working implementation from this repository by running:

git clone
pip3 install -r GradCAM-Keras/requirements.txt
cd GradCAM-Keras

The actual code can be found in, let’s look at it:

First we need to load the input image file into memory, resize it to (224, 224) since that’s what the model expects as input.
We also need to preprocess it, in this case we’re using a VGG19 model pretrained on ImageNet. This means the training images were preprocessed by subtracting the mean of each colour channel. Hopefully Keras takes care of that for us, so we can just use the preprocess_input function. Before doing that, we also need to add a dimension to the image: the batch dimension. Indeed, the model expects to run on a batch of images, however we only got one image here, so we need to reshape it from (224, 224, 3) to (1, 224, 224, 3) using the np.expand_dims function.

image = np.array(load_img(args.input, target_size=(224, 224)), dtype=np.uint8)
image_processed = preprocess_input(np.expand_dims(image, axis=0))

Once our input is treated, we can instantiate the model and run it on the image:

model = VGG19(include_top=True, input_shape=(224, 224, 3))
prediction = model.predict(image_processed)
predicted_class = np.argmax(prediction)
predicted_class_name = decode_predictions(prediction, top=1)[0][0][1]

Specifying the include_top flag tells Keras we want the model to predict classes and not just extract features from our image. This is what constraints the image shape to (224, 224, 3), since the fully connected network expects a certain fixed shape. Otherwise we could just run the series of convolution upon any image shape considering the filters for the convolutions don’t need to be changed.

We then run the model onto the image and look which class is most activated (predicted_class is our c from earlier), we also get the class label to display it later (can be “llama”, “sunglasses” etc).
We then extract the tensors we’re interested in from the model:

y_c_tensor = model.output[0, predicted_class]
A_tensor = model.get_layer(args.layer).output
gradient_tensor = K.gradients(y_c_tensor, A_tensor)[0]
run_graph = K.function([model.input], [A_tensor, gradient_tensor])

In TensorFlow, models are defined by their static graph (tensors linked with operations) first and then the input values “flow” into it (hence the name “tensor flow”), this mean that all the objects we’re manipulating don’t yet contain any value, but will eventually when we run the model. That’s why I called every tensor object with the _tensor suffix.

Here, we link our model’s input (the image) to the tensors we want it to output for us, namely the gradient of the output w.r.t. the filters’ output, as well as the filters’ output itself.

A, gradient = run_graph([image_processed])
A, gradient = A[0], gradient[0]  # Gets the result of the first batch dimension

We can make our input image flow through this new function run_graph and get the actual values. Notice the _tensor suffix is now gone since those are not tensors placeholders but outputted values.

Then, we can compute our \vec{\alpha^c} vector by taking the mean value of each filter’s output:

alpha_c = np.mean(gradient, axis=(0, 1))

Finally, we need to compute the dot product of vector alpha_c and the tensor A:

L_c =, alpha_c)

And there we go ! However that L_c isn’t at the same resolution as our image: since we use 'valid' padding, the width and height diminished gradually after each convolution, we therefore need to upscale it a bit by zooming it by a factor of  224/L_c.shape[0] to end up with a (224, 224) grayscale image.

Using a bit of SciPy and Matplotlib magic we can output the superimposed heat map:

L_c = zoom(L_c, 224/L_c.shape[0])

plt.subplots(nrows=1, ncols=2, dpi=160, figsize=(7, 4))
plt.subplots_adjust(left=0.01, bottom=0.0, right=0.99, top=0.96, wspace=0.11, hspace=0.2)
plt.title("Original image")
plt.title("{}th dimension ({}) \nw.r.t layer {}".format(predicted_class, predicted_class_name, args.layer))
plt.imshow(L_c, alpha=0.5, cmap="jet")


What is a Convolution Neural Network

A Convolutional Neural Network, or CNN is a type of neural network that applies a series of convolutions onto an input image to produce an output image. However, contrary to more classic image filtering techniques the coefficients of the filters (or kernels) applied onto the image can be tuned using gradient descent or any other optimisation algorithm.


To perform a convolution, a CNN hovers a number of filters (below in yellow) over the entire input image (below in green) and multiply each pixel value by each value in the kernel to produce a (usually) smaller image (below in pink) as shown below:


The filters are moved with a given stride in each direction, here 1×1 which is classic, but to accelerate the convolution it can move in stride of 2×2 etc. This operation is repeated with different filters for each convolution layer in the model.

The number of trainable parameter N can be computed using the following formulae:

N = K_1 \times K_2 \times F \times C + F

For a convolutional layer with F filters of kernel shape K_1 \times K_2 applied to an input of shape (h, w, C).

Hence, if the RGB image in the beginning was of shape (200, 200, 3), after a 10 filters convolution with kernel size 5×5 and stride 1, the shape will be (196, 196, 10) and that layer will have 5\times5\times10\times3 + 10 = 760 trainable parameters (the filter have shape (5, 5, 3) ) as demonstrated in Keras:

from tensorflow.python import keras
from keras.models import Sequential
from keras.layers import Conv2D

    Conv2D(input_shape=(200, 200, 3), filters=10, kernel_size=5, strides=1)
Layer (type)                 Output Shape              Param #
conv2d_2 (Conv2D)            (None, 196, 196, 10)      760
Total params: 760
Trainable params: 760
Non-trainable params: 0


Using François Chollet’s Keras framework, a convolutional layer can be used in a model using the Conv2D class.
Its constructor with the most common arguments is as follow:

keras.layers.Conv2D(filters, kernel_size, strides=(1, 1), padding='valid', activation=None)


  • filters is the number of filter to apply on with this layer (i.e. F)
  • kernel_size is either a tuple of integers or an integer specifying the size of the kernel to hover, this has to be odd numbers since the kernel’s centre will determine the position of the output value in the output image.
  • strides is either a tuple of integers or an integer specifying the pace at which the kernel should hover
  • padding is a string with either 'valid' or 'same', defaults to 'valid'.
    • 'valid' padding in Keras (and TensorFlow) means no padding, i.e. the kernel will stop hovering near the borders to avoid falling over the edge and have missing values. This means that the border pixel wont have a convolution value and hence that the output is a tad smaller in height and width (in the above example, the padding was 'valid' and so the 200×200 image became 196×196).
    • 'same' padding means padding the edge with zeros so that the output image has the same width and height as the input image (hence the name). However this is not the default in Keras as it means the border get corrupted with fake data.
  • activation is either a string like 'tanh', 'relu', 'softmax' etc (list here) or an object of type keras.activations. By default this will be the linear activation function f(x)=x. Each output value of the layer will be passed through this function before being passed to the next layer.