Understanding Adam Optimizer 28 Dec 2018
Optimizers are the backbone of any machine learning algorithm, be it a simple Linear Regression OLS or be it GAN’s, and RESNETs etc I would like to discuss one of the Optimizers that has become very popular nowadays Adam Optimizer
Inspiration
If you are able to understand and code up the optimizers, you get a good understanding of hyper parameter tuning
Methodology
Let us consider a simple Linear Regression problem statement for this example
-
Equation For the above image, the representative equation is
-
Objective Function The Objective function gets translated to
-
Gradient Gradient calculation in mathematical terms means, finding the derivative of your Objective Function w.r.t the variable that you are trying to calculate
error= y - np.matmul(x,weights.T)
gradient = - np.matmul(x.T,error)
- Moment 1
Moment 1 is a moving average of the prior gradient values. There is a slight twist here. There are two contributions here- Prior Moment : There is a large weight assigned to the prior moment
- Current Gradient : There is a small weight assigned to the current gradient
moment1 = (beta1 * moment1) + ( 1 - beta1) * gradient
- Moment 2
Moment 2 is a moving average of the prior gradient values squared. There is a slight twist here too. There are two contributions here- Prior Moment : There is a large weight assigned to the prior moment
- Current Gradient : There is a small weight assigned to the current gradient
moment2 = (beta2 * moment2) + ( 1 - beta2) * np.power(gradient,2)
- Moment 1 Scaling
Moment1 Scaling, adjusts the Moment 1 value so that with every step, the value scales down further
moment1hat = moment1 / ( 1 - np.power(beta1,iterationCount+1))
- Moment 2 Scaling
Moment2 Scaling, adjusts the Moment 2 value so that with every step, the value scales down further
moment2hat = moment2 / ( 1 - np.power(beta2,iterationCount+1))
- Weight updates
The adjustment to the weights takes into consideration moment1hat and inverse squareRoot of moment2hat multiplied with the learning rate ( alpha ) The term 1/sqrt(moment2hat) is termed as the Initialization Biasweights = weights - ((alpha * moment1hat) / ( np.sqrt(moment2hat) + epsilon )).T
- Final Code
# Adam Optimizer
# Sample Data
y=np.array([1,2,3,4,5,6,7,8,9,10]).reshape(10,1)
x=np.array([[2,4,6,8,10,12,14,16,18,20],[3,6,9,12,15,18,21,24,27,30]]).reshape(10,2)
# Weight Initialization
weights=np.random.rand(2).reshape(1,2)
alpha=0.0001
beta1=0.9
beta2=0.999
epsilon = 1.000000 / np.power(10,8)
moment1=np.zeros(x.shape[1]).reshape(2,1)
moment2=np.zeros(x.shape[1]).reshape(2,1)
for iterationCount in range(1000000):
error= y - np.matmul(x,weights.T)
gradient = - np.matmul(x.T,error)
moment1 = (beta1 * moment1) + ( 1 - beta1) * gradient
moment2 = (beta2 * moment2) + ( 1 - beta2) * np.power(gradient,2)
moment1hat = moment1 / ( 1 - np.power(beta1,iterationCount+1))
moment2hat = moment2 / ( 1 - np.power(beta2,iterationCount+1))
weights = weights - ((alpha * moment1hat) / ( np.sqrt(moment2hat) + epsilon )).T
if(iterationCount % 10000 ==0):
print("epoch {0} RMSE Error {1}".format(iterationCount,np.sum(np.power(error,2))/x.shape[0]))
if((np.sum(np.power(error,2))/x.shape[0]) < 1):
break
- Analysis We will now analyse the changes in the moments with each iteration to understand the progress of the algorithm