(eq1) $a_{n+1} = a_n - \mu \frac{F' (a_n) }{F'' (a_n) }$
Cost function at (n+1) step is
(eq2) $F(a_{n+1}) = F(a_{n}) - \mu \frac{F' (a_n) }{F'' (a_n) } F' (a_n) + O(\mu^2) $
(eq3) $F(a_{n+1}) = F(a_{n}) - \mu \frac{F' (a_n) ^2}{F'' (a_n) } + O(\mu^2) $
Errors in eq2 are of second order of learning rate, and it can be seen from Eq3, that Newton-Raphson method will result in reduction of error in each step, as long as learning rate and initial guesses are chosen appropriately.
I am writing this post to investigate what errors will result from approximating derivatives in (eq4) with taylor series approximations. Now, the Newton-Raphson update rule becomes
(eq5) $a_{n+1} = a_n - \mu \frac{\bar{F}' (a_n) }{\bar{F}'' (a_n) }$.
As before, let h quantify the perturbation (step) size. Therefore, first and second derivatives approximations have the following errors,
(eq6) $\bar{F}'(a_{n}) = F(a_{n}) + \frac{ha_n^2}{2}F''(a_n) + O(h^2) $(eq7) $\bar{F}''(a_{n}) = F''(a_n) + O(h^2) $
Neglecting the second order terms, and substituting these in eq3 as derivative estimates,
(eq8) $F(a_{n+1} ) = F(a^n ) - \mu \frac{F(a_{n}) + \frac{ha_n^2}{2}F''(a_n) + O(h^2)}{F ''(a_n) }F (a_n) + O(\mu^2)$.
Neglecting the second order terms, and substituting these ,
(eq8) $a_{n+1} = a^n - \mu \frac{F(a_{n})}{F''(a_n)} - \frac{\mu ha_n^2}{2} F (a_n) + O(h^2) +O(\mu^2)$.
Again, choosing
$\mu = ch$
for 0<c<1 will give errors in second order of learning rate. I would recommend choosing c as close to zero as possible, because Newton-Raphson method tolerates higher learning rate, while the taylor series approximation of derivatives does not. So choosing very small c (0.0001 for example) will result in stable solutions, and all adaptive methods to adjust learning rates can be applied. I tested this on exponential fitting and it gave solutions much faster than standard gradient descent. Next post will be about using this for conjugate-gradient method. Although I love it, and am a fan of smart algorithms, am not looking forward to it. Another issue is that all these checks are done for 1-parameter problems. Numerically, they apply to solutions using multiple parameters too, however, next step is to check under what conditions will approximation using these methods fail.
Again, choosing
$\mu = ch$
for 0<c<1 will give errors in second order of learning rate. I would recommend choosing c as close to zero as possible, because Newton-Raphson method tolerates higher learning rate, while the taylor series approximation of derivatives does not. So choosing very small c (0.0001 for example) will result in stable solutions, and all adaptive methods to adjust learning rates can be applied. I tested this on exponential fitting and it gave solutions much faster than standard gradient descent. Next post will be about using this for conjugate-gradient method. Although I love it, and am a fan of smart algorithms, am not looking forward to it. Another issue is that all these checks are done for 1-parameter problems. Numerically, they apply to solutions using multiple parameters too, however, next step is to check under what conditions will approximation using these methods fail.
No comments:
Post a Comment