Restless 24 Seven: Optimization using Newton-Raphson with Taylor series approximations

This post is the logical extension from previous two where I checked if taylor series approximations of derivatives can be used for optimization using steepest descent. I found that the derivatives computed this way provided good-enough approximations for use in optimization routine, the next post was investigation into how the perturbation size (or step size) must be chosen for computing the derivatives. This post will focus on optimization using Newton-Raphson method with derivatives computed using taylor series approximation. Again, the aim is to find a that minimizes F(a). In Newton-Raphson's method, a is updated using
(eq1) $a_{n+1} = a_n - \mu \frac{F' (a_n) }{F'' (a_n) }$

Cost function at (n+1) step is
(eq2) $F(a_{n+1}) = F(a_{n}) - \mu \frac{F' (a_n) }{F'' (a_n) } F' (a_n) + O(\mu^2) $
(eq3) $F(a_{n+1}) = F(a_{n}) - \mu \frac{F' (a_n) ^2}{F'' (a_n) } + O(\mu^2) $

Errors in eq2 are of second order of learning rate, and it can be seen from Eq3, that Newton-Raphson method will result in reduction of error in each step, as long as learning rate and initial guesses are chosen appropriately.

I am writing this post to investigate what errors will result from approximating derivatives in (eq4) with taylor series approximations. Now, the Newton-Raphson update rule becomes

(eq5) $a_{n+1} = a_n - \mu \frac{\bar{F}' (a_n) }{\bar{F}'' (a_n) }$.

As before, let h quantify the perturbation (step) size. Therefore, first and second derivatives approximations have the following errors,

(eq6) $\bar{F}'(a_{n}) = F(a_{n}) + \frac{ha_n^2}{2}F''(a_n) + O(h^2) $
(eq7) $\bar{F}''(a_{n}) = F''(a_n) + O(h^2) $

Neglecting the second order terms, and substituting these in eq3 as derivative estimates,

(eq8) $F(a_{n+1} ) = F(a^n ) - \mu \frac{F(a_{n}) + \frac{ha_n^2}{2}F''(a_n) + O(h^2)}{F ''(a_n) }F (a_n) + O(\mu^2)$.

Neglecting the second order terms, and substituting these ,

(eq8) $a_{n+1} = a^n - \mu \frac{F(a_{n})}{F''(a_n)} - \frac{\mu ha_n^2}{2} F (a_n) + O(h^2) +O(\mu^2)$.

Again, choosing
$\mu = ch$
for 0<c<1 will give errors in second order of learning rate. I would recommend choosing c as close to zero as possible, because Newton-Raphson method tolerates higher learning rate, while the taylor series approximation of derivatives does not. So choosing very small c (0.0001 for example) will result in stable solutions, and all adaptive methods to adjust learning rates can be applied. I tested this on exponential fitting and it gave solutions much faster than standard gradient descent. Next post will be about using this for conjugate-gradient method. Although I love it, and am a fan of smart algorithms, am not looking forward to it. Another issue is that all these checks are done for 1-parameter problems. Numerically, they apply to solutions using multiple parameters too, however, next step is to check under what conditions will approximation using these methods fail.

Restless 24 Seven

matheqns

Sunday, January 12, 2014

Optimization using Newton-Raphson with Taylor series approximations

No comments:

Post a Comment