Simulated annealing algorithm is a global search optimization algorithm that is inspired by the annealing technique in metallurgy. In this one, Let’s understand the exact algorithm behind simulated annealing and then implement it in Python from scratch.
First, What is Annealing?
In simple terms, ‘Annealing’ is a technique, where a metal is heated to a high temperature and slowly cooled down to improve its physical properties. When the metal is hot, the molecules randomly re-arrange themselves at a rapid pace.
As the metal starts to cool down, the re-arranging process occurs at a much slower rate. In the end, the resultant metal will be a desired workable metal. The factors of time and metal’s energy at a particular time will supervise the entire process.
In machine learning, Simulated annealing algorithm mimics this process and is used to find optimal (or most predictive) features in the feature selection process.
Let’s now try to draw parallel’s between Annealing in metallurgy and Simulated annealing for Feature selection:
In terms of feature selection,
- Set of features represents the arrangement of molecules in the material(metal).
- No. of Iterations represents time. Therefore, as the no. of iterations decreases temperature decreases.
- Change in Predictive performance between the previous and the current iteration represents the change in material’s energy.
Simulated Annealing is a stochastic global search optimization algorithm which means it operates well on non-linear objective functions as well while other local search algorithms won’t operate well on this condition.
Ok, it sounds somewhat similar to Stochastic hill climbing. What’s the difference?
Stochastic Hill Climbing (Vs) Simulated Annealing
A considerably upgraded version of stochastic hill-climbing is simulated annealing
Consider that you are climbing a hill and trying to find the optimal steps to reach the top. The main difference between stochastic hill-climbing and simulated annealing is that in stochastic hill-climbing steps are taken at random and the current point is replaced with a new point provided the new point is an improvement to the previous point.
Whereas in simulated annealing, the search works the same way but sometimes the worse points are also accepted to allow the algorithm to learn answers that are eventually better.
Simulated annealing algorithm
Let’s go over the exact Simulated Annealing algorithm, step-by-step.
- The initial step is to select a subset of features at random.
- Then choose the no. of iterations. A ML model is then built and the predictive performance (otherwise called objective function) is calculated.
- A small percentage of features are randomly included/excluded from the model. This is just to ‘perturb’ the features. Then the predictive performance is calculated once again for this new set of features.
Two things can happen here:
- If performance Increases in the new set then the new feature set is Accepted.
- If the performance of the new feature set has worse performance, then the Acceptance Probability (otherwise called metropolis acceptance criterion) is calculated. (You will see the formula and its significance shortly. Stay with the flow for now.)
Once the acceptance probability is calculated, generate a random number between 0 – 1 and :
- If the Random Number > Acceptance Probability then the new feature set is Rejected and the previous feature set will be continued to be used.
- If the Random Number < Acceptance Probability then the new feature set is Accepted.
The impact of randomness by this process helps simulated annealing to not get stuck at local optimums in search of a global optimum.
Keep doing this for the chosen number of iterations.
Now, how is all of this related to ‘annealing’ concept of cooling temperature?, you might wonder.
The ‘acceptance probability’ takes care of that. The formula for acceptance probability is designed in such a way that, as the number of iterations increase, the probability of accepting bad performance comes down. As a result, fewer changes are accepted.
Let’s look at the formula now.
Formula for acceptance probability( a.k.a Metropolis acceptance criterion)
The formula for acceptance probability is as follows:
i = No. Of Iterations,
c = Controls the amount of perturbation that can happen,
old = Old score,
new = New score.
The acceptance probability can be understood as a function of time and change in performance with a constant ‘c’, which is used to control the rate of perturbation happening in the features. Usually, ‘c’ is set to be 1.
Working example of acceptance probability formula:
Consider the problem in hand is to optimize the accuracy of a machine learning model. Assume that the previous solution is 77% and the current solution is 73% :
no. of iteration i = 1 and c = 1
iteration i = 5 and c = 1
and finally when iteration i = 10 and c = 1
As you can see after 10 iterations the acceptance probability came down to 0.0055453. Thus, as the no. of iterations increases, the chances of accepting a worse solution decreases.
Now, Why does simulated annealing accept worse performing feature sets ?
If the algorithm tends to accept only the best performing feature sets the probability of getting stuck in the local optima gets very high which is not good. Even if the algorithm is going to continuously face poor-performing feature sets for a certain number of times it allows for better chances of finding the global optima which may exist elsewhere. As the acceptance probability decreases with time (iterations), it tends to go back to the last known local optimum and starts its search for global optimum once again. So the chances of settling on a worser performing results is diminished.
When the temperature is high the chances of worse-performing features getting accepted is high and as the no. of iterations goes up, temperature decreases, and that in turn decreases the chances of worse-performing features getting accepted.
The intent here is that, when the temperature is high, the algorithm moves freely in the search space, and as temperature decreases the algorithm is forced to converge at global optima.
Implementing Simulated annealing from scratch in python
Consider the problem of hill climbing. Consider a person named ‘Mia’ trying to climb to the top of the hill or the global optimum. In this search hunt towards global optimum, the required attributes will be:
- Area of the search space. Let’s say area to be [-6,6]
- A start point where ‘Mia’ can start her search hunt.
- Step_size that ‘Mia’ is going to take.
- Number of attempts ‘Mia’ is going to make. As of algorithm this would be no. of iterations.
- Steeps and slopes she climbs as she tries to reach the top/global optimum. As of algorithm this would be temperature.
Another thing to note here is that both the temperature and no. of iterations arguments will be predefined.
Now how would ‘Mia’ know whether her step is betterment to the previous step or not?
Her steps are validated by a function called ‘objective’. The objective function will be the ‘square of the step taken’. This is because the steps ‘Mia’ is going to take are going to be totally random between the bounds of the specified area and that means there are chances of getting a negative value also, To make it positive, the objective function is used. If this new step is betterment then she will continue on that path.
If her step is not good: The acceptance probability/Metropolis acceptance criterion is calculated. After that, a random number will be generated using rand().
If random_number > Acceptance probability: Reject the new step Else: Accept the new step
Just an overview,
In this code, the steps taken by ‘Mia’ will be random and not user-fed values. Each time there is an improvement/betterment in the steps taken towards global optimum, those values alongside the previous value get saved into a list called
The initial step is to import necessary libraries.
from numpy import asarray, exp from numpy.random import randn, rand, seed from matplotlib import pyplot
Let’s define the
objective function to evaluate the steps taken by mia.
# Objective function is the square of steps taken def objective(step): return step ** 2
Now that the
objective function is defined. ‘Mia’ needs to start the search hunt from some point right ?. Only if she has a start point she can progress towards the global optimum.
The below code cell gives us a random start point between the range of the
area of the search space. Let’s also see the evaluation of this
seed(1) area = asarray([[-6.0, 6.0]]) # area[:,0] = -6.0, area[:,1] = 6.0, ran(len(area))generates random number within the length of area. # length of interval is 1. start_point = area[:, 0] + rand( len( area ) ) * ( area[:, 1] - area[:, 0] ) print(start_point) print('start_point=', start_point) print('objective function evaluation of start point=',objective(start_point))
[-0.99573594] start_point= [-0.99573594] objective function evaluation of start point= 0.9914900693154707
Seems like the new point obtained( objective function evaluated point ) is better than the
seed(1) is a Pseudorandom_number_generator.
By using seed(1) same random numbers will get generated each time the code cell is run,
Let’s now define the simulated annealing algorithm as a function. The parameters needed are:
- Objective function.
- Area of the search space.
- No. of iterations.
After defining the function, the
start_point is initialized then, this
start_point is getting evaluated by the
objective function and that is stored into
def sa(objective, area = ([[-6.0,6.0]]), n_iterations = 1200, step_size = 0.1, temperature = 12): # Generating a random start point for the search hunt start_point = area[:, 0] + rand( len( area ) ) * ( area[:, 1] - area[:, 0] ) # Evaluating the start_point start_point_eval = objective(start_point)
objective function evaluation of start point(
start_point_eval) needs to be stored so that each time an improvement happens, the progress can be seen.
# Storing the start point and its objective evaluation into mia_start_point and mia_start_eval. mia_start_point, mia_start_eval = start_point, start_point_eval # this empty list will updated over time once looping starts. outputs = 
‘Mia’ start point and her start point evaluation are stored into
Outputs is an empty list that will get updated over time once looping starts. As of now, ‘Mia’ started at a point and evaluated that point. Now she has to take her first step towards her search hunt and to do so, a for loop is defined ranging from 0 to the iteration number we specify.
# Looping from 0 to the iteration number we specify for i in range(iterations): # First step taken by mia mia_step = mia_start_point + randn( len( ) ) * step_size
The first step will be in accordance with Gaussian distribution where the mean is the current point and standard deviation is defined by the
step_size. In a nutshell, this means the steps taken will be 3 *
step_size of the current point.
# Evaluating the first step mia_step_eval = objective(mia_step)
This new point obtained must be checked whether it is better than the current point, if it is better, then replace the current point with the new point. Then append those new points into our outputs list. If the new point is better:
(i) Iteration count
(ii) Previous best
(iii) New best
# The new step is checked whether it is better than current step. If better, the current point is replaced with new point. if mia_step_eval < start_point_eval: start_point, start_point_eval = mia_step, mia_step_eval # Step gets appended into the list outputs.append(start_point_eval) # printing out the iteration number, best_so_far and new_best print('iteration Number = ',i," ", 'best_so_far = ',start_point," " ,'new_best = ',start_point_eval)
If the new point isn’t a promising solution, then the difference between the
objective function evaluation of the current solution(
mia_step_eval) and current working solution(
mia_start_eval) is calculated. One of the popular ways of calculating
temperature is by using the “Fast Simulated Annealing Method” which is as follows:
temperature = initial_temperature / (iteration_number + 1)
difference = mia_step_eval - mia_start_eval # Temperature is calculated t = temperature / float(i + 1)
difference gives us the difference between the old point and the new point so that the acceptance probability/metropolis acceptance criterion can be calculated. This helps in calculating the probability of accepting a point with worse performance than the current point.
# Acceptance probability is calculated mac = exp(-difference / t)
Then a random number is generated using rand() and if the Random Number > Acceptance Probability then the new point will be Rejected and if Random Number < Acceptance Probability then the new point will be Accepted.
if difference < 0 or rand() < mac: # Storing the values mia_start_point, mia_start_eval = mia_step, mia_step_eval return [start_point, start_point_eval, outputs] #indenting is outside because return belongs to 'SA' function
The last step is to pass values to the parameters of the simulated annealing function.
seed(1) # define the area of the search space area = asarray([[-6.0, 6.0]]) # initial temperature temperature = 12 # define the total no. of iterations iterations = 1200 # define maximum step_size step_size = 0.1 # perform the simulated annealing search start_point, output, outputs = sa(objective, area, n_iterations, step_size, temp)
Putting all these codes together into a single code cell this is how the final code looks like:
from numpy import asarray, exp from numpy.random import randn, rand, seed from matplotlib import pyplot # Define objective function def objective(step): return step ** 2.0 # Define simulated annealing algorithm def sa(objective, area, iterations, step_size, temperature): # create initial point start_point = area[:, 0] + rand( len( area ) ) * ( area[:, 1] - area[:, 0] ) # evaluate initial point start_point_eval = objective(start_point) # Assign previous and new solution to previous and new_point_eval variable mia_start_point, mia_start_eval = start_point, start_point_eval outputs =  for i in range(iterations): # First step by mia mia_step = mia_start_point + randn( len( area ) ) * step_size mia_step_eval = objective(mia_step) if mia_step_eval < start_point_eval: start_point, start_point_eval = mia_step, mia_step_eval #Append the new values into the output list outputs.append(start_point_eval) print('Acceptance Criteria = %.5f' % mac," ",'iteration Number = ',i," ", 'best_so_far = ',start_point," " ,'new_best = %.5f' % start_point_eval) difference = mia_step_eval - mia_start_eval t = temperature / float(i + 1) # calculate Metropolis Acceptance Criterion / Acceptance Probability mac = exp(-difference / t) # check whether the new point is acceptable if difference < 0 or rand() < mac: mia_start_point, mia_start_eval = mia_step, mia_step_eval return [start_point, start_point_eval, outputs] seed(1) # define the area of the search space area = asarray([[-6.0, 6.0]]) # initial temperature temperature = 12 # define the total no. of iterations iterations = 1200 # define maximum step_size step_size = 0.1 # perform the simulated annealing search start_point, output, outputs = sa(objective, area, iterations, step_size, temperature) #plotting the values pyplot.plot(outputs, 'ro-') pyplot.xlabel('Improvement Value') pyplot.ylabel('Evaluation of Objective Function') pyplot.show()
Acceptance Criteria = 0.76232 iteration Number = 37 best_so_far = [-0.95340594] new_best = 0.90898 Acceptance Criteria = 0.87733 iteration Number = 39 best_so_far = [-0.91563305] new_best = 0.83838 Acceptance Criteria = 1.44710 iteration Number = 40 best_so_far = [-0.85680363] new_best = 0.73411 Acceptance Criteria = 1.42798 iteration Number = 41 best_so_far = [-0.8221177] new_best = 0.67588 Acceptance Criteria = 1.22608 iteration Number = 42 best_so_far = [-0.68541443] new_best = 0.46979 Acceptance Criteria = 2.09273 iteration Number = 43 best_so_far = [-0.61804282] new_best = 0.38198 Acceptance Criteria = 1.16478 iteration Number = 50 best_so_far = [-0.42564785] new_best = 0.18118 Acceptance Criteria = 0.81414 iteration Number = 66 best_so_far = [-0.35632231] new_best = 0.12697 Acceptance Criteria = 1.74595 iteration Number = 67 best_so_far = [-0.33780667] new_best = 0.11411 Acceptance Criteria = 1.02155 iteration Number = 72 best_so_far = [-0.3368772] new_best = 0.11349 Acceptance Criteria = 1.20897 iteration Number = 75 best_so_far = [-0.29671621] new_best = 0.08804 Acceptance Criteria = 1.47133 iteration Number = 88 best_so_far = [-0.29346186] new_best = 0.08612 Acceptance Criteria = 1.78859 iteration Number = 89 best_so_far = [-0.2525718] new_best = 0.06379 Acceptance Criteria = 0.29369 iteration Number = 93 best_so_far = [-0.20893326] new_best = 0.04365 Acceptance Criteria = 1.68933 iteration Number = 94 best_so_far = [-0.12724854] new_best = 0.01619 Acceptance Criteria = 1.24284 iteration Number = 95 best_so_far = [-0.00883141] new_best = 0.00008 Acceptance Criteria = 0.99774 iteration Number = 102 best_so_far = [0.00581387] new_best = 0.00003 Acceptance Criteria = 1.05301 iteration Number = 118 best_so_far = [0.00137051] new_best = 0.00000 Acceptance Criteria = 0.99611 iteration Number = 138 best_so_far = [0.0009528] new_best = 0.00000 Acceptance Criteria = 1.33458 iteration Number = 158 best_so_far = [-0.00047896] new_best = 0.00000 Acceptance Criteria = 0.50157 iteration Number = 473 best_so_far = [0.00045742] new_best = 0.00000 Acceptance Criteria = 1.02387 iteration Number = 482 best_so_far = [0.00010925] new_best = 0.00000
So this output shows us, in which iteration the improvement happened, the previous best point, and the new best point. The graph shows that there are about 22 improvements ( red circle ) as the algorithm reaches the global optima. There are certain places where there are no big improvements but as the algorithm reaches the end there are many improvements.
This is the python implementation of the simulated annealing algorithm.
Feel free to change the
step_size and other inputs to see what you get.
Advantage and limitations of simulated annealing
Some of the advantages worth mentioning are:
- Simulated Annealing Algorithm can work with cost functions and arbitrary systems.
- Easy to code even if the problem in hand is complex.
- This technique guarantees finding an optimal solution by not getting stuck in local optima.
Some limitations of this technique are:
- This technique cannot tell whether it has found the optimal solution or not.
- In problems with few local minima, this method is not necessary, gradient descent would do the job.
- It tends to be a very time consuming procedure.