Understanding Adam Optimizer 28 Dec 2018
Optimizers are the backbone of any machine learning algorithm, be it a simple Linear Regression OLS or be it GAN’s, and RESNETs etc I would like to discuss one of the Optimizers that has become very popular nowadays Adam Optimizer
If you are able to understand and code up the optimizers, you get a good understanding of hyper parameter tuning
Let us consider a simple Linear Regression problem statement for this example
Equation For the above image, the representative equation is
Objective Function The Objective function gets translated to
Gradient Gradient calculation in mathematical terms means, finding the derivative of your Objective Function w.r.t the variable that you are trying to calculate
error= y - np.matmul(x,weights.T)
gradient = - np.matmul(x.T,error)
- Moment 1
Moment 1 is a moving average of the prior gradient values. There is a slight twist here. There are two contributions here- Prior Moment : There is a large weight assigned to the prior moment
- Current Gradient : There is a small weight assigned to the current gradient
moment1 = (beta1 * moment1) + ( 1 - beta1) * gradient
- Moment 2
Moment 2 is a moving average of the prior gradient values squared. There is a slight twist here too. There are two contributions here- Prior Moment : There is a large weight assigned to the prior moment
- Current Gradient : There is a small weight assigned to the current gradient
moment2 = (beta2 * moment2) + ( 1 - beta2) * np.power(gradient,2)
- Moment 1 Scaling
Moment1 Scaling, adjusts the Moment 1 value so that with every step, the value scales down further
moment1hat = moment1 / ( 1 - np.power(beta1,iterationCount+1))
- Moment 2 Scaling
Moment2 Scaling, adjusts the Moment 2 value so that with every step, the value scales down further
moment2hat = moment2 / ( 1 - np.power(beta2,iterationCount+1))
- Weight updates
The adjustment to the weights takes into consideration moment1hat and inverse squareRoot of moment2hat multiplied with the learning rate ( alpha ) The term 1/sqrt(moment2hat) is termed as the Initialization Biasweights = weights - ((alpha * moment1hat) / ( np.sqrt(moment2hat) + epsilon )).T
- Final Code
# Adam Optimizer
# Sample Data
# Weight Initialization
epsilon = 1.000000 / np.power(10,8)
for iterationCount in range(1000000):
error= y - np.matmul(x,weights.T)
gradient = - np.matmul(x.T,error)
moment1 = (beta1 * moment1) + ( 1 - beta1) * gradient
moment2 = (beta2 * moment2) + ( 1 - beta2) * np.power(gradient,2)
moment1hat = moment1 / ( 1 - np.power(beta1,iterationCount+1))
moment2hat = moment2 / ( 1 - np.power(beta2,iterationCount+1))
weights = weights - ((alpha * moment1hat) / ( np.sqrt(moment2hat) + epsilon )).T
if(iterationCount % 10000 ==0):
print("epoch {0} RMSE Error {1}".format(iterationCount,np.sum(np.power(error,2))/x.shape[0]))
if((np.sum(np.power(error,2))/x.shape[0]) < 1):
- Analysis We will now analyse the changes in the moments with each iteration to understand the progress of the algorithm