matheqns

Tuesday, January 14, 2014

Can Hessian be substituted by its corresponding diagonal matrix?


Am up again, and its 4 am. The gym does not open until 6, so time to ramble on. Previous posts focused on using gradient descent methods for minimization. However, they were for optimization of cost functions that were parametrized by one variable. For multi-variate cases, the update rules for a to minimize F(a) are,
(eq1) $a_{n+1} = a_n - \mu  \left. \nabla F \right| _{a_n}$ and
(eq2) $a_{n+1} = a_n - \mu  \left. H(a_n) \right|_{a_n}^{-1} \left. \nabla F \right| _{a_n}$
The approximated errors estimated earlier all apply to gradient descent eq1 and eq2, however, computations using eq2, require computing hessian which is computationally intensive. Most algorithms (Quasi-newton methods) approximate the hessian based on changes in gradient function $\left. \nabla F \right| _{a_n}$ when solution is updated to  $a_{n+1}$ from  $a_{n}$. However, this still involves computing and inverting a large matrix of dimension $n^2$. In this post I check errors introduced by replacing the hessian in eq2 by a diagonal matrix whose diagonal elements are same as the hessian. 
(eq3) $a_{n+1} = a_n - \mu  \left.H_d(a_n) \right|_{a_n}^{-1} \left. \nabla F \right| _{a_n}$
This way, you need to store and compute n additional terms as each time step, and as the inverse of a diagonal matrix is much easier to compute. Now will try and investigate how this approximation influences the gradient descent method, and can stability be guaranteed? If so, under which conditions? The results below are random thoughts for now, will use them to show convergence in upcoming post. 

Rewriting eq2, 
(eq4) $a_{n+1} = a_n - \mu  \left.H(a_n) \right|_{a_n}^{-1} \left. \nabla F \right| _{a_n}$
Now splitting hessian into diagonal and non-diagonal terms, 
(eq5) $a_{n+1} = a_n - \mu  \left( \left.H_d(a_n) \right|_{a_n} + \left.H_{nd}(a_n) \right|_{a_n} \right)^{-1}  \left. \nabla F \right| _{a_n}$

Using matrix inversion Lemma:
$(A-B D^{-1} C)^{-1} =  A^{-1} + A^{-1} B ( D - C A^{-1} B)^{-1} C A^{-1}  $, 
When B and D are identity matrices and C = -C,
$(A+ C)^{-1} =  A^{-1} - A^{-1}( I + C A^{-1} )^{-1} C A^{-1}  $,
On further simplification,
$(A+ C)^{-1} =  A^{-1} - A^{-1}( I + A C^{-1} )^{-1}  $, 
Dropping subscripts and rewriting the hessian,
$(H_d+ H_{nd})^{-1} = H_d^{-1} - H_d^{-1}( I + H_d H_{nd}^{-1} )^{-1}  $,
Substituting in the update equation eq3, 
(eq6) $a_{n+1} = a_n - \mu  (H_d^{-1} - H_d^{-1}( I + H_d H_{nd}^{-1} )^{-1}) \left. \nabla F \right| _{a_n}$
(eq7) $a_{n+1} = a_n - \mu  H_d^{-1} \left. \nabla F \right| _{a_n} +  \underbrace{\mu H_d^{-1}( I + H_d H_{nd}^{-1} )^{-1} \left. \nabla F \right| _{a_n}}_{\text{error due to ignoring non diagonal terms}}$
Further simplification gives,
(eq8) $a_{n+1} = a_n - \mu  H_d^{-1} \left. \nabla F \right| _{a_n} +  \mu H_d^{-1}H_{nd} H^{-1} \left. \nabla F \right| _{a_n}$

Will stop now, will post proof using this over the weekend. its almost 6, gym time :D 





No comments:

Post a Comment