Welcome to QUICKLB’s documentation!
QUICKLB is a very efficient library that sacrifices optimality for speed. It shines in distributed applications that have unpredictable unbalanced computational peaks in their domains. To begin using QUICKLB we refer you to the Installation section and the example in the Example usage section.
API documentation can be found in the Loadbalancer section
Installation
Quick and easy installation can be done with conda by creating an environment from the included conda_env.yml file and installing quicklb into this environment:
$ conda create -c conda-forge -n quicklb --file conda_env.yml
$ conda activate quicklb
$ python setup.py install
Manual installation
Make sure you have the folling packages available:
cmake >= 3.22
gfortran >= 11.2 or ifort >= 2022.1
openmpi >= 4.1.2
For python functionality:
python >= 3.9
numpy >= 1.20
mpi4py >= 3.1.1
matplotlib >= 3.5.1
scipy >= 1.8
The library can be created through cmake directly, the CMakeFile honors the CONDA_PREFIX if present, and tries to install there instead of the default install location:
$ mkdir build
$ cd build
$ cmake ../ -DFortran_COMPILER=mpif90 -DCMAKE_BUILD_TYPE={RELEASE,DEBUG} -DPYTHON_INTERFACE={off,on}
$ make
$ make install
Or it can be created throught the python setuptools:
$ python setup.py install
Example usage
To use QUICKLB it has to be imported first, if this fails please go to the Installation section.
import quicklb
Unbalanced reaction diffusion with QUICKLB
In this example we will create a reaction diffusion simulation which uses a loadbalancer for the chemistry calculation part (which is local), and a normal non-loadbalanced process for the diffusion process.
The easies and quickest way to get started is with the Loadbalancer object.
from quicklb import Loadbalancer
We have to create our domain, and work arrays and divide them over the number of processors, to reduce the amount of communication code necessary for the non-loadbalanced part we use shared-memory array (which are a bit more complex than normal numpy arrays). It also means that the example will only work if all processors have acces to the same memoryspace.
import numpy as np
from mpi4py import MPI
import matplotlib.pyplot as plt
import scipy.integrate
comm = MPI.COMM_WORLD
def allocate_shared_array(size):
nbytes = np.prod(size) * MPI.DOUBLE.Get_size()
if comm.Get_rank() != 0:
nbytes = 0
win = MPI.Win.Allocate_shared(nbytes, MPI.DOUBLE.Get_size(),comm=comm)
buf, _ = win.Shared_query(0)
return np.ndarray(size,dtype=np.float64,buffer=buf), win
# Using shared array we cheat a bit to make it simpler
size = 30
A, winA = allocate_shared_array((size,size))
B, winB = allocate_shared_array((size,size))
# Calculate which part of the shared array is "local" to this processor
remainder = size%comm.Get_size()
slice_size = size//comm.Get_size()
start = int(slice_size*comm.Get_rank() + min(comm.Get_rank(), remainder))
end = int(slice_size*(comm.Get_rank() + 1) + min(comm.Get_rank()+1, remainder))
A_local = A[start:end,:]
This gives us a slice on each processor which we fill up with an initial condition:
if comm.Get_rank() == 0:
A[:,:] = 1.
B[:,:] = 0.
A[size//2-3:size//2+3,size//2-10:size//2+10] = 0.50
Secondly we need a structure to encapsulate these points such that the loadbalancer knows which data to communicate, how to encapsulate, and how to compute an interation
class Cell():
def __init__(self,AA=None,BB=None):
self.A = np.array([float('nan')],dtype=np.float64)
self.B = np.array([float('nan')],dtype=np.float64)
if AA != None:
self.A = AA
if BB != None:
self.B = BB
def compute(self):
a = self.A[0]
b = self.B[0]
soln =scipy.integrate.solve_ivp(Cell.deriv,(0,1),(a,b),
method='BDF', rtol=0.0001)
self.A[0] = soln.y[0][-1]
self.B[0] = soln.y[1][-1]
def deriv(t,y):
a, b = y
abb = a*b*b
adot = - abb + 0.0545 * ( 1 - a )
bdot = + abb - ( 0.0545 + 0.062 ) * b
return adot, bdot
def serialize(self):
return np.concatenate( (self.A.view(dtype=np.byte),
self.B.view(dtype=np.byte)))
def deserialize(self, buffer):
self.A[0] = buffer[0:8].view(dtype=np.float64)
self.B[0] = buffer[8:16].view(dtype=np.float64)
This Cell class can either point to data when local, or store data when offloaded, furthermore a compute function is provided
We must create a list of cells that we want loadbalanced
cells = [Cell(A_local[i,j:j+1], B_local[i,j:j+1])
for j in range(size) for i in range(A_local.shape[0])]
Then we can create the Loadbalancer:
lb = Loadbalancer(cells, "SORT", 0.01, 0.01, 3)
As an intermezzo we also need a diffusion operator which is separate from the loadbalancer:
A_diff = np.empty(shape=(A_local.shape[0]+2,A_local.shape[1]+2))
B_diff = np.empty(shape=(B_local.shape[0]+2,B_local.shape[1]+2))
def diffuse():
# Perform diffusion over the local field + boundaries (this is where we cheat with shared memory arrays
A_diff[1:-1,1:-1] = A_local
# local periodic boundaries
A_diff[0,1:-1] = A[start-1,:]
A_diff[-1,1:-1] =A[end%A.shape[0],:]
# "communicate" boundaries,
A_diff[1:-1,0] = A_local[:,-1]
A_diff[1:-1,-1] =A_local[:,0]
B_diff[1:-1,1:-1] = B_local
# local periodic boundaries
B_diff[0,1:-1] = B[start-1,:]
B_diff[-1,1:-1] =B[end%B.shape[0],:]
# "communicate" boundaries,
B_diff[1:-1,0] = B_local[:,-1]
B_diff[1:-1,-1] =B_local[:,0]
LA = A_diff[ :-2, 1:-1] + \
A_diff[1:-1, :-2] - 4*A_diff[1:-1, 1:-1] + A_diff[1:-1, 2:] + \
+ A_diff[2: , 1:-1]
LB = B_diff[ :-2, 1:-1] + \
B_diff[1:-1, :-2] - 4*B_diff[1:-1, 1:-1] + B_diff[1:-1, 2:] + \
+ B_diff[2: , 1:-1]
return LA,LB
And with this loadbalancer + diffusion operator we can iterate to our heart’s content:
number_of_iterations = 100
lb.partition()
for i in range(number_of_iterations):
winA.Sync()
winB.Sync()
comm.Barrier()
LA,LB = diffuse()
# Chemistry
if i%10 == 1:
lb.partition()
lb.partitioning_info()
if i%10 == 0:
if comm.Get_rank() == 0:
plt.imshow(B)
plt.show(block=False)
plt.pause(0.03)
lb.iterate()
A_local[:,:] = A_local + 0.1*LA
B_local[:,:] = B_local + 0.05*LB
Lastly we can create visualise this quite easily with matplotlib!
winB.Sync()
comm.Barrier()
if comm.Get_rank() == 0:
plt.imsave("gray_scott_rd.png",B)
plt.show()
Timings |
Normal |
Load Balanced |
---|---|---|
2 MPI procs |
50.31 s |
51.63 s |
3 MPI procs |
46.02 s |
41.25 s |
4 MPI procs |
39.54 s |
33.04 s |

Loadbalancer
- class quicklb.Loadbalancer(objects, algorithm, max_abs_li, max_rel_li, max_it)
- __init__(objects, algorithm, max_abs_li, max_rel_li, max_it)
Create a Loadbalancer object
- Parameters
objects (list<Cell>) – A list of objects which implement all functions specified in the mock Cell object.
algorithm (string) – Specifies the loadbalancing algorithm, valid values are: - “GREEDY” - “SORT” - “SORT2”
max_abs_li (float) – specify the maximum absolute load-imbalance for the partitioning algorithm. When this threshold is reached the partitioning concludes 0.0 would be no load imbalance, 1. would be a load imbalance of 100%
max_rel_li (float) – specify the maximum relative load-imbalance for the partitioning algorithm. Is not used in the GREEDY partitioning scheme. When this threshold is reached the partitioning concludes.
max_it (int) – Maximum number of iterations for the loadbalancing algorithm. Set this somewhere between 1 and the number of processors used (or larger, it is limited to the maximum number of processors anyway)
- Returns
A very fresh loadbalancing object !
- Return type
- iterate()
Perform a single iteration, with computation offloading. this eventually calls compute on every single cell
- Return type
None
- partition()
Call this function to (re)-partition the cells over the processors, it is recommended to call this once before iterate()
- Return type
None
- partitioning_info(detailed=False)
Writes out partitioning info to the loadbalance.info file
- Parameters
detailed (bool) – When set to True write detailed information about every cell to loadbalance.info
- Return type
None
Abstract Cell Class
this is a possible implementation of a cell class that needs to be fed to the loadbalancer
import struct
class Cell():
data: float = 42.
def compute(self):
self.data = 1./data
def serialize(self):
return struct.pack('d',self.data)
def deserialize(self, buffer):
self.data = struct.unpack('d',buffer.tobytes())
- class quicklb.Cell
Example Cell for the Loadbalancer class that needs some work to be done
- compute()
Function which is called whenever work needs to be done, as this cell can be offloaded at the time of computation, it is recommended to only use (de)serialized data for the computation, otherwise results will be undefined
- deserialize(buffer)
Function which deserializes the object.
- Parameters
buffer (1D numpy bytes array) –
- Return type
None
- serialize()
Function which serializes the object.
- Return type
Bytes-like object