Simulated Annealing Algorithm Explained from Scratch (Python)

Simulated annealing algorithm is a global search optimization algorithm that is inspired by the annealing technique in metallurgy. In this one, Let’s understand the exact algorithm behind simulated annealing and then implement it in Python from scratch.

First, What is Annealing?

In simple terms, ‘Annealing’ is a technique, where a metal is heated to a high temperature and slowly cooled down to improve its physical properties. When the metal is hot, the molecules randomly re-arrange themselves at a rapid pace.

As the metal starts to cool down, the re-arranging process occurs at a much slower rate. In the end, the resultant metal will be a desired workable metal. The factors of time and metal’s energy at a particular time will supervise the entire process.

In machine learning, Simulated annealing algorithm mimics this process and is used to find optimal (or most predictive) features in the feature selection process.

Let’s now try to draw parallel’s between Annealing in metallurgy and Simulated annealing for Feature selection:

In terms of feature selection,

Set of features represents the arrangement of molecules in the material(metal).
No. of Iterations represents time. Therefore, as the no. of iterations decreases temperature decreases.
Change in Predictive performance between the previous and the current iteration represents the change in material’s energy.

Simulated Annealing is a stochastic global search optimization algorithm which means it operates well on non-linear objective functions as well while other local search algorithms won’t operate well on this condition.

Ok, it sounds somewhat similar to Stochastic hill climbing. What’s the difference?

Stochastic Hill Climbing (Vs) Simulated Annealing

A considerably upgraded version of stochastic hill-climbing is simulated annealing

Consider that you are climbing a hill and trying to find the optimal steps to reach the top. The main difference between stochastic hill-climbing and simulated annealing is that in stochastic hill-climbing steps are taken at random and the current point is replaced with a new point provided the new point is an improvement to the previous point.

Whereas in simulated annealing, the search works the same way but sometimes the worse points are also accepted to allow the algorithm to learn answers that are eventually better.

Simulated annealing algorithm

Let’s go over the exact Simulated Annealing algorithm, step-by-step.

The initial step is to select a subset of features at random.
Then choose the no. of iterations. A ML model is then built and the predictive performance (otherwise called objective function) is calculated.
A small percentage of features are randomly included/excluded from the model. This is just to ‘perturb’ the features. Then the predictive performance is calculated once again for this new set of features.

Two things can happen here:

If performance Increases in the new set then the new feature set is Accepted.
If the performance of the new feature set has worse performance, then the Acceptance Probability (otherwise called metropolis acceptance criterion) is calculated. (You will see the formula and its significance shortly. Stay with the flow for now.)

Once the acceptance probability is calculated, generate a random number between 0 – 1 and :

If the Random Number > Acceptance Probability then the new feature set is Rejected and the previous feature set will be continued to be used.
If the Random Number < Acceptance Probability then the new feature set is Accepted.

The impact of randomness by this process helps simulated annealing to not get stuck at local optimums in search of a global optimum.

Keep doing this for the chosen number of iterations.

Now, how is all of this related to ‘annealing’ concept of cooling temperature?, you might wonder.

The ‘acceptance probability’ takes care of that. The formula for acceptance probability is designed in such a way that, as the number of iterations increase, the probability of accepting bad performance comes down. As a result, fewer changes are accepted.

Let’s look at the formula now.

Formula for acceptance probability( a.k.a Metropolis acceptance criterion)

The formula for acceptance probability is as follows:

Where,
i = No. Of Iterations,
c = Controls the amount of perturbation that can happen,
old = Old score,
new = New score.

The acceptance probability can be understood as a function of time and change in performance with a constant ‘c’, which is used to control the rate of perturbation happening in the features. Usually, ‘c’ is set to be 1.

Working example of acceptance probability formula:

Consider the problem in hand is to optimize the accuracy of a machine learning model. Assume that the previous solution is 77% and the current solution is 73% :

When,

no. of iteration i = 1 and c = 1

iteration i = 5 and c = 1

and finally when iteration i = 10 and c = 1

As you can see after 10 iterations the acceptance probability came down to 0.0055453. Thus, as the no. of iterations increases, the chances of accepting a worse solution decreases.

Now, Why does simulated annealing accept worse performing feature sets ?

If the algorithm tends to accept only the best performing feature sets the probability of getting stuck in the local optima gets very high which is not good. Even if the algorithm is going to continuously face poor-performing feature sets for a certain number of times it allows for better chances of finding the global optima which may exist elsewhere. As the acceptance probability decreases with time (iterations), it tends to go back to the last known local optimum and starts its search for global optimum once again. So the chances of settling on a worser performing results is diminished.

When the temperature is high the chances of worse-performing features getting accepted is high and as the no. of iterations goes up, temperature decreases, and that in turn decreases the chances of worse-performing features getting accepted.

The intent here is that, when the temperature is high, the algorithm moves freely in the search space, and as temperature decreases the algorithm is forced to converge at global optima.

Implementing Simulated annealing from scratch in python

Consider the problem of hill climbing. Consider a person named ‘Mia’ trying to climb to the top of the hill or the global optimum. In this search hunt towards global optimum, the required attributes will be:

Area of the search space. Let’s say area to be [-6,6]
A start point where ‘Mia’ can start her search hunt.
Step_size that ‘Mia’ is going to take.
Number of attempts ‘Mia’ is going to make. As of algorithm this would be no. of iterations.
Steeps and slopes she climbs as she tries to reach the top/global optimum. As of algorithm this would be temperature.

Another thing to note here is that both the temperature and no. of iterations arguments will be predefined.

Now how would ‘Mia’ know whether her step is betterment to the previous step or not?

Her steps are validated by a function called ‘objective’. The objective function will be the ‘square of the step taken’. This is because the steps ‘Mia’ is going to take are going to be totally random between the bounds of the specified area and that means there are chances of getting a negative value also, To make it positive, the objective function is used. If this new step is betterment then she will continue on that path.

If her step is not good: The acceptance probability/Metropolis acceptance criterion is calculated. After that, a random number will be generated using rand().

If random_number > Acceptance probability:
  Reject the new step
Else:
  Accept the new step

Just an overview,

In this code, the steps taken by ‘Mia’ will be random and not user-fed values. Each time there is an improvement/betterment in the steps taken towards global optimum, those values alongside the previous value get saved into a list called outputs.

The initial step is to import necessary libraries.

from numpy import asarray, exp
from numpy.random import randn, rand, seed
from matplotlib import pyplot

Let’s define the objective function to evaluate the steps taken by mia.

# Objective function is the square of steps taken
def objective(step):
  return step[0] ** 2

Now that the objective function is defined. ‘Mia’ needs to start the search hunt from some point right ?. Only if she has a start point she can progress towards the global optimum.

The below code cell gives us a random start point between the range of the area of the search space. Let’s also see the evaluation of this start_point.

seed(1)
area = asarray([[-6.0, 6.0]])

# area[:,0] = -6.0, area[:,1] = 6.0, ran(len(area))generates random number within the length of area.
# length of interval is 1.

start_point = area[:, 0] + rand( len( area ) ) * ( area[:, 1] - area[:, 0] )
print(start_point)
print('start_point=', start_point)
print('objective function evaluation of start point=',objective(start_point))

[-0.99573594]
start_point= [-0.99573594]
objective function evaluation of start point= 0.9914900693154707

Seems like the new point obtained( objective function evaluated point ) is better than the start_point.

seed(1) is a Pseudorandom_number_generator.

By using seed(1) same random numbers will get generated each time the code cell is run,

Let’s now define the simulated annealing algorithm as a function. The parameters needed are:

Objective function.
Area of the search space.
No. of iterations.
step_size.
Temperature.

After defining the function, the start_point is initialized then, this start_point is getting evaluated by the objective function and that is stored into start_point_eval

def sa(objective, area = ([[-6.0,6.0]]), n_iterations = 1200, step_size = 0.1, temperature = 12):
    # Generating a random start point for the search hunt
    start_point = area[:, 0] + rand( len( area ) ) * ( area[:, 1] - area[:, 0] )
    # Evaluating the start_point
    start_point_eval = objective(start_point)

Now start_point and objective function evaluation of start point(start_point_eval) needs to be stored so that each time an improvement happens, the progress can be seen.

# Storing the start point and its objective function evaluation into mia_start_point and mia_start_eval.
mia_start_point, mia_start_eval = start_point, start_point_eval

# this empty list will get updated over time once looping starts.
outputs = []

‘Mia’ start point and her start point evaluation are stored into mia_start_point and mia_start_eval. Outputs is an empty list that will get updated over time once looping starts. As of now, ‘Mia’ started at a point and evaluated that point. Now she has to take her first step towards her search hunt and to do so, a for loop is defined ranging from 0 to the iteration number we specify.

# Looping from 0 to the iteration number we specify
for i in range(iterations):
    # First step taken by mia
    mia_step = mia_start_point + randn( len( area ) ) * step_size

The first step will be in accordance with Gaussian distribution where the mean is the current point and standard deviation is defined by the step_size. In a nutshell, this means the steps taken will be 3 * step_size of the current point.

# Evaluating the first step
mia_step_eval = objective(mia_step)

This new point obtained must be checked whether it is better than the current point, if it is better, then replace the current point with the new point. Then append those new points into our outputs list. If the new point is better:

(i) Iteration count
(ii) Previous best
(iii) New best
are printed.

# The new step is checked whether it is better than current step. If better, the current point is replaced with new point.
if mia_step_eval < start_point_eval:
    start_point, start_point_eval = mia_step, mia_step_eval
    # Step gets appended into the list
    outputs.append(start_point_eval)
  # printing out the iteration number, best_so_far and new_best
    print('iteration Number = ',i," ", 'best_so_far = ',start_point," " ,'new_best = ',start_point_eval)

If the new point isn’t a promising solution, then the difference between the objective function evaluation of the current solution(mia_step_eval) and current working solution(mia_start_eval) is calculated. One of the popular ways of calculating temperature is by using the “Fast Simulated Annealing Method” which is as follows:

temperature = initial_temperature / (iteration_number + 1)

difference = mia_step_eval - mia_start_eval
# Temperature is calculated
t = temperature / float(i + 1)

difference gives us the difference between the old point and the new point so that the acceptance probability/metropolis acceptance criterion can be calculated. This helps in calculating the probability of accepting a point with worse performance than the current point.

# Acceptance probability is calculated
mac = exp(-difference / t)

Then a random number is generated using rand() and if the Random Number > Acceptance Probability then the new point will be Rejected and if Random Number < Acceptance Probability then the new point will be Accepted.

if difference < 0 or rand() < mac:
  # Storing the values
    mia_start_point, mia_start_eval = mia_step, mia_step_eval
return [start_point, start_point_eval, outputs] #indenting is outside because return belongs to 'SA' function

The last step is to pass values to the parameters of the simulated annealing function.

seed(1)
# define the area of the search space
area = asarray([[-6.0, 6.0]])
# initial temperature
temperature = 12
# define the total no. of iterations
iterations = 1200
# define maximum step_size
step_size = 0.1
# perform the simulated annealing search
start_point, output, outputs = sa(objective, area, n_iterations, step_size, temp)

Putting all these codes together into a single code cell this is how the final code looks like:

from numpy import asarray, exp
from numpy.random import randn, rand, seed
from matplotlib import pyplot

# Define objective function
def objective(step):
    return step[0] ** 2.0

# Define simulated annealing algorithm
def sa(objective, area, iterations, step_size, temperature):
    # create initial point
    start_point = area[:, 0] + rand( len( area ) ) * ( area[:, 1] - area[:, 0] )
    # evaluate initial point
    start_point_eval = objective(start_point)
    # Assign previous and new solution to previous and new_point_eval variable 
    mia_start_point, mia_start_eval = start_point, start_point_eval
    outputs = []
    for i in range(iterations):
        # First step by mia
        mia_step = mia_start_point + randn( len( area ) ) * step_size  
        mia_step_eval = objective(mia_step)
        if mia_step_eval < start_point_eval:
            start_point, start_point_eval = mia_step, mia_step_eval
      #Append the new values into the output list
            outputs.append(start_point_eval)
            print('Acceptance Criteria = %.5f' % mac," ",'iteration Number = ',i," ", 'best_so_far = ',start_point," " ,'new_best = %.5f' % start_point_eval)
        difference = mia_step_eval - mia_start_eval
        t = temperature / float(i + 1)
        # calculate Metropolis Acceptance Criterion / Acceptance Probability
        mac = exp(-difference / t)
        # check whether the new point is acceptable 
        if difference < 0 or rand() < mac:
            mia_start_point, mia_start_eval = mia_step, mia_step_eval
    return [start_point, start_point_eval, outputs]

seed(1)
# define the area of the search space
area = asarray([[-6.0, 6.0]])
# initial temperature
temperature = 12
# define the total no. of iterations
iterations = 1200
# define maximum step_size
step_size = 0.1
# perform the simulated annealing search
start_point, output, outputs = sa(objective, area, iterations, step_size, temperature)
#plotting the values
pyplot.plot(outputs, 'ro-')
pyplot.xlabel('Improvement Value')
pyplot.ylabel('Evaluation of Objective Function')
pyplot.show()

Acceptance Criteria = 0.76232   iteration Number =  37   best_so_far =  [-0.95340594]   new_best = 0.90898
Acceptance Criteria = 0.87733   iteration Number =  39   best_so_far =  [-0.91563305]   new_best = 0.83838
Acceptance Criteria = 1.44710   iteration Number =  40   best_so_far =  [-0.85680363]   new_best = 0.73411
Acceptance Criteria = 1.42798   iteration Number =  41   best_so_far =  [-0.8221177]   new_best = 0.67588
Acceptance Criteria = 1.22608   iteration Number =  42   best_so_far =  [-0.68541443]   new_best = 0.46979
Acceptance Criteria = 2.09273   iteration Number =  43   best_so_far =  [-0.61804282]   new_best = 0.38198
Acceptance Criteria = 1.16478   iteration Number =  50   best_so_far =  [-0.42564785]   new_best = 0.18118
Acceptance Criteria = 0.81414   iteration Number =  66   best_so_far =  [-0.35632231]   new_best = 0.12697
Acceptance Criteria = 1.74595   iteration Number =  67   best_so_far =  [-0.33780667]   new_best = 0.11411
Acceptance Criteria = 1.02155   iteration Number =  72   best_so_far =  [-0.3368772]   new_best = 0.11349
Acceptance Criteria = 1.20897   iteration Number =  75   best_so_far =  [-0.29671621]   new_best = 0.08804
Acceptance Criteria = 1.47133   iteration Number =  88   best_so_far =  [-0.29346186]   new_best = 0.08612
Acceptance Criteria = 1.78859   iteration Number =  89   best_so_far =  [-0.2525718]   new_best = 0.06379
Acceptance Criteria = 0.29369   iteration Number =  93   best_so_far =  [-0.20893326]   new_best = 0.04365
Acceptance Criteria = 1.68933   iteration Number =  94   best_so_far =  [-0.12724854]   new_best = 0.01619
Acceptance Criteria = 1.24284   iteration Number =  95   best_so_far =  [-0.00883141]   new_best = 0.00008
Acceptance Criteria = 0.99774   iteration Number =  102   best_so_far =  [0.00581387]   new_best = 0.00003
Acceptance Criteria = 1.05301   iteration Number =  118   best_so_far =  [0.00137051]   new_best = 0.00000
Acceptance Criteria = 0.99611   iteration Number =  138   best_so_far =  [0.0009528]   new_best = 0.00000
Acceptance Criteria = 1.33458   iteration Number =  158   best_so_far =  [-0.00047896]   new_best = 0.00000
Acceptance Criteria = 0.50157   iteration Number =  473   best_so_far =  [0.00045742]   new_best = 0.00000
Acceptance Criteria = 1.02387   iteration Number =  482   best_so_far =  [0.00010925]   new_best = 0.00000

So this output shows us, in which iteration the improvement happened, the previous best point, and the new best point. The graph shows that there are about 22 improvements ( red circle ) as the algorithm reaches the global optima. There are certain places where there are no big improvements but as the algorithm reaches the end there are many improvements.

This is the python implementation of the simulated annealing algorithm.

Feel free to change the area, step_size and other inputs to see what you get.

Advantage and limitations of simulated annealing

Advantages

Some of the advantages worth mentioning are:

Simulated Annealing Algorithm can work with cost functions and arbitrary systems.
Easy to code even if the problem in hand is complex.
This technique guarantees finding an optimal solution by not getting stuck in local optima.

Limitations

Some limitations of this technique are:

This technique cannot tell whether it has found the optimal solution or not.
In problems with few local minima, this method is not necessary, gradient descent would do the job.
It tends to be a very time consuming procedure.