Understanding Adam Optimizer

Understanding Adam Optimizer 28 Dec 2018

Optimizers are the backbone of any machine learning algorithm, be it a simple Linear Regression OLS or be it GAN’s, and RESNETs etc I would like to discuss one of the Optimizers that has become very popular nowadays Adam Optimizer

Inspiration

If you are able to understand and code up the optimizers, you get a good understanding of hyper parameter tuning

Methodology

Let us consider a simple Linear Regression problem statement for this example

Equation For the above image, the representative equation is $y=w.x$
Objective Function The Objective function gets translated to $\min\sum_{i}^{n}{(y_i-wx_i)}^2$
Gradient Gradient calculation in mathematical terms means, finding the derivative of your Objective Function w.r.t the variable that you are trying to calculate $Gradient=\frac{d}{dw}\min\sum_{i}^{n}{(y_i-wx_i)}^2$
$\frac{d}{dw}\min{(y-w.x)}^2$
$2{(y-wx)}\frac{d}{dw}{-w.x}$
$E={(y-wx)}$
$Gradient\hspace=\hspace-2Ex$

    error= y - np.matmul(x,weights.T)
    gradient = - np.matmul(x.T,error)

Moment 1
Moment 1 is a moving average of the prior gradient values. There is a slight twist here. There are two contributions here
- Prior Moment : There is a large weight assigned to the prior moment
- Current Gradient : There is a small weight assigned to the current gradient

 moment1 = (beta1 * moment1) + ( 1 - beta1) * gradient

Moment 2
Moment 2 is a moving average of the prior gradient values squared. There is a slight twist here too. There are two contributions here
- Prior Moment : There is a large weight assigned to the prior moment
- Current Gradient : There is a small weight assigned to the current gradient

 moment2 = (beta2 * moment2) + ( 1 - beta2) * np.power(gradient,2)

Moment 1 Scaling
Moment1 Scaling, adjusts the Moment 1 value so that with every step, the value scales down further

moment1hat = moment1 / ( 1 - np.power(beta1,iterationCount+1))

Moment 2 Scaling
Moment2 Scaling, adjusts the Moment 2 value so that with every step, the value scales down further

moment2hat = moment2 / ( 1 - np.power(beta2,iterationCount+1))

Weight updates
The adjustment to the weights takes into consideration moment1hat and inverse squareRoot of moment2hat multiplied with the learning rate ( alpha ) The term 1/sqrt(moment2hat) is termed as the Initialization Bias
```
weights = weights - ((alpha * moment1hat) / ( np.sqrt(moment2hat) + epsilon )).T
```
Final Code

# Adam Optimizer

# Sample Data
y=np.array([1,2,3,4,5,6,7,8,9,10]).reshape(10,1)
x=np.array([[2,4,6,8,10,12,14,16,18,20],[3,6,9,12,15,18,21,24,27,30]]).reshape(10,2)

# Weight Initialization
weights=np.random.rand(2).reshape(1,2)

alpha=0.0001
beta1=0.9
beta2=0.999
epsilon = 1.000000 / np.power(10,8)
moment1=np.zeros(x.shape[1]).reshape(2,1)
moment2=np.zeros(x.shape[1]).reshape(2,1)

for iterationCount in range(1000000):
    error= y - np.matmul(x,weights.T)
    gradient = - np.matmul(x.T,error)
    moment1 = (beta1 * moment1) + ( 1 - beta1) * gradient
    moment2 = (beta2 * moment2) + ( 1 - beta2) * np.power(gradient,2)
    moment1hat = moment1 / ( 1 - np.power(beta1,iterationCount+1))
    moment2hat = moment2 / ( 1 - np.power(beta2,iterationCount+1))
    weights = weights - ((alpha * moment1hat) / ( np.sqrt(moment2hat) + epsilon )).T
    if(iterationCount % 10000 ==0):
        print("epoch {0} RMSE Error {1}".format(iterationCount,np.sum(np.power(error,2))/x.shape[0]))
    if((np.sum(np.power(error,2))/x.shape[0]) < 1):
        break

Analysis We will now analyse the changes in the moments with each iteration to understand the progress of the algorithm