In last post I tested if forward difference approximation of first derivative from taylor series expansion can be used in optimization routines. I was surprised to find that this method gave stable results, inspite of my repeated attempts to make it unstable. Although, the stability of the method depended on the perturbation size (or step size) used to compute approximate first derivative. As long as it was small (below 5% for exponential fit) the algorithm gave convergent solution, but for larger values, it did not. Below is an attempt to estimate source of error due to taylor series expansion, and provide a formal way to estimate the perturbation size.
The optimization goal is to minimize F(a) using the update rule,
(eq1) $a^{n+1} = a^n - \mu \left. \frac{dF}{da} \right|_{a^n}$
The estimated derivative is computed as,
eq(2) $ \left. \frac{\bar{dF}}{da} \right|_{a^n}= \frac{F(a_n+a_n h)-F(a_n)}{a_n h}$
Noting that,
eq(3) $F(a_n+a_n h) = F(a_n ) + a_n h \left. \frac{dF}{da} \right|_{a^n} + \frac{a_n ^2h^2}{2} \left. \frac{d^2F}{da^2} \right|_{a^n} + O(h^3)$ ,
The approximate derivative is
eq(4) $\left. \frac{\bar{dF}}{da_n} \right|_{a^n} = \left. \frac{dF}{da} \right|_{a^n}+ \left. \frac{ah}{2} \frac{d^2F}{da^2} \right|_{a^n} + O(h^2)$.
The optimization goal is to minimize F(a) using the update rule,
(eq1) $a^{n+1} = a^n - \mu \left. \frac{dF}{da} \right|_{a^n}$
The estimated derivative is computed as,
eq(2) $ \left. \frac{\bar{dF}}{da} \right|_{a^n}= \frac{F(a_n+a_n h)-F(a_n)}{a_n h}$
Noting that,
eq(3) $F(a_n+a_n h) = F(a_n ) + a_n h \left. \frac{dF}{da} \right|_{a^n} + \frac{a_n ^2h^2}{2} \left. \frac{d^2F}{da^2} \right|_{a^n} + O(h^3)$ ,
The approximate derivative is
eq(4) $\left. \frac{\bar{dF}}{da_n} \right|_{a^n} = \left. \frac{dF}{da} \right|_{a^n}+ \left. \frac{ah}{2} \frac{d^2F}{da^2} \right|_{a^n} + O(h^2)$.
The cost function, after update using eq1 is
(eq5) $F(a^{n+1}) = F(a^n) - \mu \left. \frac{\bar{dF}}{da} \right|_{a^n} \left. \frac{dF}{da} \right|_{a^n} + O(\mu^2)$.
Rearranging gives
(eq7) $F(a^{n+1}) = F(a^n) - \mu \left( \left. \frac{dF}{da} \right_{a^n} \right)^2 \underbrace{ - \frac{ah \mu}{2} \left. \frac{d^2F}{da^2} \right|_{a^n} \left. \frac{dF}{da} \right|_{a^n} + O(h^2) }_{\text{Effect of taylor series approximation}} + O(\mu^2)$
Based on eq7, choosing
(eq8) $h = c \mu$
for any 0<c<1 will result in errors of second order in learning rate, and the cost function reduces to,
(eq8) $F(a^{n+1}) = F(a^n) - \mu \left( \left. \frac{dF}{da} \right|_{a^n} \right)^2 + O(\mu^2)$,
i.e. choosing h according to eq8 results in errors of second order in learning rate.
CONCLUSION: If h is chosen according to eq8, the error due to computing derivative using taylor series expansion is of same order as that from analytically computing the derivative. This way, all the adaptive methods to modify learning rate can be applied to this algorithm too.
Substituting approximate derivative from eq4,
(eq6) $F(a^{n+1}) = F(a^n) - \mu \left( \left. \frac{dF}{da} \right|_{a^n} + \left. \frac{ah}{2} \frac{d^2F}{da^2} \right|_{a^n} + O(h^2) \right) \frac{dF}{da} + O(\mu^2)$Rearranging gives
(eq7) $F(a^{n+1}) = F(a^n) - \mu \left( \left. \frac{dF}{da} \right_{a^n} \right)^2 \underbrace{ - \frac{ah \mu}{2} \left. \frac{d^2F}{da^2} \right|_{a^n} \left. \frac{dF}{da} \right|_{a^n} + O(h^2) }_{\text{Effect of taylor series approximation}} + O(\mu^2)$
Based on eq7, choosing
(eq8) $h = c \mu$
for any 0<c<1 will result in errors of second order in learning rate, and the cost function reduces to,
(eq8) $F(a^{n+1}) = F(a^n) - \mu \left( \left. \frac{dF}{da} \right|_{a^n} \right)^2 + O(\mu^2)$,
i.e. choosing h according to eq8 results in errors of second order in learning rate.
CONCLUSION: If h is chosen according to eq8, the error due to computing derivative using taylor series expansion is of same order as that from analytically computing the derivative. This way, all the adaptive methods to modify learning rate can be applied to this algorithm too.
No comments:
Post a Comment