ReQU might not be the activation you want

The book The Principles of Deep Learning Theory proposes in chapter 5 a method to study the quality of activation functions. The network they study is a fully connected network. They use the assumption that a desirable property for a network is that the covariance of their activations remains constant through the network, neither exploding nor imploding. They use this assumption/desiderata, to be able to use fixed point analysis to study how well a given activations encourages such representations.

To understand stuff better, I like to reproduce the steps proposed in their methodology, making a small twist, that will encourage me to be more careful in the reading. Here I follow their methodology using an activation that they don’t use in their book, and I choose one that is simple enough so I know won’t make the math unbearably harder. ReLU squared, known as ReQU, is defined as $\sigma(z) = \max (z, 0)^2$. Equation 5.10 from the book implies a parallel susceptibility of

\[\begin{align} \chi_{\parallel}(K) =& C_W\frac{d}{dK}\langle\sigma(z)\sigma(z)\rangle_K\\ =& C_W \frac{d}{dK}\Big[\frac{1}{\sqrt{2\pi K}}\int_{-\infty}^{\infty}dz e^{-\frac{z^2}{2K}}\sigma(z)\sigma(z)\Big]\\ =& C_W \frac{d}{dK}\Big[\frac{1}{\sqrt{2\pi K}}\int_{0}^{\infty}dz e^{-\frac{z^2}{2K}}z^4\Big]\\ =& C_W \frac{d}{dK}\Big[\frac{1}{\sqrt{2\pi K}}\Big[ -e^{-\frac{z^2}{2 K}} K z (3 K + z^2) + 3 K^{5/2} \sqrt{\frac{\pi}{2}} erf\Big(\frac{z}{\sqrt{2K}}\Big) \Big]_{0}^{\infty}\Big]\\ =& C_W \frac{d}{dK}\Big[\frac{1}{\sqrt{2\pi K}}\Big[ 3 K^{5/2} \sqrt{\frac{\pi}{2}} erf\Big(\frac{z}{\sqrt{2K}}\Big) \Big]_{0}^{\infty}\Big]\\ =& C_W \frac{d}{dK}\Big[\frac{1}{\sqrt{2\pi K}}\Big[ 3 K^{5/2} \sqrt{\frac{\pi}{2}} \Big]\Big] = \frac{3}{2}C_W \frac{d}{dK}\Big[K^{2} \Big] = 3C_W K \\ \end{align}\]

Perpendicular susceptibility is given by equation 5.51, considering as the derivative of ReQU to be $\sigma’(z) = ReQU’(z) = 2ReLU(z)$

\[\begin{align*} \chi_{\bot}(K) =& C_W\langle\sigma'(z)\sigma'(z)\rangle_K\\ =& C_W \Big[\frac{1}{\sqrt{2\pi K}}\int_{-\infty}^{\infty}dz e^{-\frac{z^2}{2K}}\sigma'(z)\sigma'(z)\Big]\\ =& C_W \Big[\frac{4}{\sqrt{2\pi K}}\int_{0}^{\infty}dz e^{-\frac{z^2}{2K}}z^2\Big]\\ =& C_W \Big[\frac{4}{\sqrt{2\pi K}}\Big[ -e^{-\frac{z^2}{2 K}} K z + K^\frac{3}{2} \sqrt{\frac{\pi}{2}} erf\Big(\frac{z}{\sqrt{2K}}\Big) \Big]_{0}^{\infty}\Big]\\ =& C_W \Big[\frac{4}{\sqrt{2\pi K}}\Big[ K^\frac{3}{2} \sqrt{\frac{\pi}{2}} \Big]= 2C_W K\\ \end{align*}\]

and $h(K) = \frac{1}{2}\frac{d}{dK}\chi_{\bot}(K)$.

Since their proposed definition of criticality holds when

\[\begin{align*} \chi_{\parallel}(K) =1 \quad \quad \quad \quad \chi_{\bot}(K) =1 \end{align*}\]

as far as I understand, it can’t be made to stay to criticality with a constant definition of $C_W$ and $C_b$. I think only activations that are asymptotically linear can be tuned to their def of criticality.

Let’s try another take

\[\begin{align*} g(K) = \langle\sigma(z)\sigma(z)\rangle_K = \frac{3}{2}K^2 \end{align*}\]

Plugging that into 5.7 gives

\[\begin{align*} K = C_b + C_W \frac{3}{2}K^2 \\ C_W \frac{3}{2}K^2 - K + C_b = 0 \\ K = \frac{1 \pm \sqrt{1-6C_WC_b}}{3C_W} \end{align*}\]

but I don’t think there’s a way to solve that equation without introducing a dependency of $C_W$ and $C_b$ on $K$. However, some considerations: e.g. if $C_b=0$, $C_W=1$, then we get $K=\frac{3}{2}K^2$, and fixed points at $K=0,\frac{2}{3}$. This wouldn’t hold the criticality for every choice of $K$, but $K$ can be controlled normalizing the input data. Phrasing it differently, if we can control the variance of the input via data normalization, to be $K=1$, the optimal choice of $C_W$ and $C_b$ will have to satisfy

\[\begin{align*} 1 = C_b + C_W \frac{3}{2} \end{align*}\]

e.g. $C_b=0$ and $C_W=\frac{2}{3}$.

Written on October 2, 2021