On the type of the gradient

We focus on differentiating a scalar valued function f :: V -> R, where V is a vector space over real numbers R.

Reverse mode automatic differentiation computes gradients for all intermediate variables. The gradient is usually given the same type as the variable itself. This is obvious thing to do when you see all variables as numbers or arrays of numbers. Yet we can do better by making good use of the type system.

Mixing variables and gradients generally makes no sense. Code is more readable and easier to write if we give them different types. As we flip the edges of computational graph, variables and gradients kind of swap roles. That would be incredibly confusing if compiler didn’t help tracking which is which.

Dimensional analysis

Units make a great example of why we might want to keep distinction between variables and their gradients. Take an expression to compute mass ratio of a rocket as example:

dry, fuel :: Kg
ratio = (dry+fuel) / dry :: R

Here dry is the mass of empty rocket, fuel is the mass of the fuel, both measured in kilograms. Mass ratio, as well as its gradient, are dimensionless real numbers. It was intentionally chosen so, in order to avoid circular reasoning. Let’s compute gradients of dry and fuel, starting backpropagation from ratio:

d_ratio = 1 :: R
d_fuel = 1/dry :: Kg^(-1)
d_dry = -fuel/dry^2 :: Kg^(-1)

See? Variables have units of $\mathrm{kg}$, while their gradients are measured in $\frac{1}{\mathrm{kg}}$. That’s because variables are contravariant vectors, while gradients are covariant vectors. Their units are inverse of each other.

Superficial Hilbert space

Use of the same type V for both variables gradients is typically justified by Riesz representation theorem.

Gradient of a variable x :: V is fundamentally a linear function of type V -> R. Of course, such a representation is unworkable – we need to store gradients as numbers and actually compute sums during reverse accumulation. Otherwise we will run into combinatorial explosion of repeated computations.

Now if V happens to be Hilbert space, then V -> R is isomorphic to V. For our purposes Hilbert space requirement boils down to existance of a well behaved inner product (<.>) :: V -> V -> R. Seems to be a benign requirement – any finite dimensional vector space over real numbers can be equipped with one.

We have a solid theory behind using V type for gradients. Nonetheless, we saw units don’t match. How come?

Turns out we can’t take inner product for granted. Record with variety of units shows this clearly:

data Rocket = Rocket
  { rocketMass :: Kg
  , rocketVelocity :: Meter/Second
  }

-- InnerSpace comes from vector-space package.
instance InnerSpace Rocket where
  Rocket m1 v1 <.> Rocket m2 v2 = (m1 <.> m2) ^+^ (v1 <.> v2)

We are adding $\frac{m^2}{s^2}$ to $\mathrm{kg}^2$ and expecting to get a dimensionless scalar! Although it is possible to provide an InnerSpace Rocket instance – all we need to do is to drop units – but we’d very much rather not to. Isomorphism grounded on such an instance is akin to isomorphism between kilograms and meters – it does indeed exist, but we wouldn’t want to invoke it automatically.

Gradient descent

Having established the rule of no mixing of variables and gradients, we see that gradient descent is in outright violation of this rule: $$ \mathbf{a}_{n+1} = \mathbf{a}_n-\gamma\nabla F(\mathbf{a}_n) $$

We need a metric tensor to relate gradients and variables here. It’s logical when you think about it – gradient descent moves in the direction of the steepest descent. Ability to measure distances is needed to make any sense of steepness.

Metric tensor brings different units into commensurable quantities. It also plays the role of preconditioner. Therefore, we opt to passing metric tensor explicitly instead of demanding existance of canonical one via Hilber space.