Deep learning made easy with TERSE networks

Sebastiaan Vermeulen

EIPC 2020

Why deep learning in finance / econ?

  • More accurate predictions (Cao, Chen & Hull (2019); Gu & Kelly (2019); Kim (2019); Tashiro et al. (2019); ...)

Why not?

  • Models are hard/impossible to comprehend/interpret/summarize
  • arbitrary
  • SLOOOOOOOW
  • No definition of optimality
  • And still, theoretically suboptimal

Econometrics: Can we recognize \(f\)?

Machine learning: Can we approximate \(f\)?

\[y=f(x)+\epsilon\]

\[\hat y=\hat f(x)\]

\[\mathbb E (\hat y - y)^2 = \mathbb E\epsilon^2 + \mathbb E (f(x) - \hat f(x))^2\]

Neural Networks

\[ h_1{( x_t)} = \begin{bmatrix} u\left(a_{11} + x_t' b_{11}\right) \\ \dots \\ u\left(a_{n_1 1} + x_t' b_{n_1 1}\right) \end{bmatrix} \quad h_k{( x_t)} = \begin{bmatrix} u\left({a_{1k} + h_{k-1}{( x_t)}}' b_{1k}\right) \\ \dots \\ u\left({a_{n_k k} + h_{k-1}{(x_t)}}' b_{n_k k}\right) \end{bmatrix} \quad y_t = h_k{( x_t)}'\beta \]

Activation functions

Partitioning the 'predictor space'

\[(x_2-x_1)^+ - (x_2-0.5)^+\]

Redundancy in the network formula

  • Every node is scale-invariant
  • Every layer is permutation-invariant

\[ h_1{( x_t)} = \begin{bmatrix} u\left(a_{11} + x_t' b_{11}\right) \\ \dots \\ u\left(a_{n_1 1} + x_t' b_{n_1 1}\right) \end{bmatrix} \]\[ h_k{( x_t)} = \begin{bmatrix} u\left({a_{1k} + h_{k-1}{( x_t)}}' b_{1k}\right) \\ \dots \\ u\left({a_{n_k k} + h_{k-1}{(x_t)}}' b_{n_k k}\right) \end{bmatrix} \]\[ y_t = h_k{( x_t)}'\beta \]

Example: Regions `found' on the volatility surface

TERSEnet: devoid of superfluity

Fix the partitioning

Ensure every region has identifiable regression

TEsselated Regression Simplices Encoding network

TERSEnet is like a linear model:

\[ h_{1,i+dj}{( x_t)} = {(j - n{(x_{t,i})}^+)}^+ \]\[ h_{2,k}{( x_t)} = \left(1-\sum_{i=1}^d h_{1,i+dj_i}{( x_t)} \right)^+, \quad j_i = w(k,i) \] \( \hat y_t = h_{2}{( x_t)}'\) \(\beta\)

(Regularized) least squares with OLS or LARS

Performance (theoretical)

  • TERSEnet is parameter efficient for \(C^2\) functions \[\|f-g\|_\infty=O(W^{-2/d})\]

  • TERSEnet parameters have unique global (easy) optimum

Performance (practical)

TERSEnet is

  • interpretable
  • robust
  • fast

What about high-dimensional data?

All approximation models suffer from curse of dimensionality

100 variables, 10 linear segments ⇒ \(10^{100}\) regression parameters

TerseNet can be used with LARS algorithm: Use only the most relevant regions

Time series needs work

Classification needs work