Distributed Simulation of Statevectors and Density Matrices (Jones et al. 2023)
I. Introduction
A. Notation
Bit Extraction Notation
bit: 3 2 1 0
1 1 0 1Negated Bit Notation
If
Flipped Index Notation
If
Multiple-bit Flip Notation
Similar to flipped index notation, except
Statevector Notation
For a
Computational bases are represented using binary decimal notation. For a
And it is stored as
This is the statevector, where each index corresponds to the basis state
Prefix & Suffix Bits
Consider a node to be "one computer/machine participating in the simulation. Distributed simulation defines
Every amplitude index
Some of those bits identify the node
The remaining bits identify the amplitudes position within that node
Suppose
q6 q5 q4 q3 q2 q1 q0Now suppose
r1 r0 q4 q3 q2 q1 q0The remaining suffix bits are local index bits, defined as
Suffix/local bits can be manipulated entirely within one node
Prefix/rank bits requires communication with other nodes, usually reassigning the statevector to a new node
global index
Moreover, global amplitude index
i = (r << lambda) + jSymbol | Meaning |
|---|---|
value of bit | |
! | flipped value of bit |
index after flipping bit | |
index after flipping bits | |
amplitude stored at index | |
node rank | |
local index | |
number of nodes | |
number of suffix/local bits | |
prefix bits | encoded in rank |
suffix bits | encoded in local index |
II. Local Statevector Algorithms
A. Local Costs
"Before we can discuss the challenges of distributed simulation, we must first understand the simpler problem of local (i.e. non-distributed), shared memory simulation." (Section II, first paragraph)
Recall that the simulation stores a full-state vector
This notation gives a complete description of a quantum state, but the memory costs grow exponentially as
Considering C/C
++ stores complex doubles as16 bytes, forN qubits we would need
Dense mixed states are even worse
Full statevector simulators are useful for researchers because resource utilization and runtime costs are roughly the same across circuits of equal size, irregardless of state properties, making resource prediction trivial
But there are two main drawbacks of full state simulation, which is that it requires simultaneously
Storing
2 complex doubles(α) Simulating
n qubit operators, which requiredO(2) floating point operations
Thus, the authors use four local performance metrics to measure these costs
Basic Operations bops
Bitwise operations, integer arithmetic, logical checks, indexing, etc.
i ^ (1ULL << targ)
(i >> targ) & 1
if (ctrlBit == 1)Floating Point Operations flops
Arithmetic operations on real or complex values
m00*beta + m01*gamma;
m10*beta + m11*gamma;NOTE: The paper distinguishes bops from flops because complex arithmetic is much for expensive than logical or bit manipulation operations
Memory memory
Memory overhead, i.e. the size of temporary data structures created during an algorithm
This excludes persistent pre-allocated memory dedicated to the quantum state representation
one-target gate:
O(1) extra memorymany-target gate:
O(2) temporary array
Writes writes
The number of writes to heap/main memory, especially writes to statevectors
Moving amplitudes to and from memory is more often the bottleneck as opposed to performing arithmetic operations
Thus if an algorithm writes to fewer amplitudes, it may be much faster than other algorithms even if the math looks the same
Cost tags are summarized as:
[a bops][b flops][c memory][d writes]
NOTE: These metrics of course do not capture all local simulation considerations. Presented algorithms are optimized to avoid branching, enable auto-vectorization, cache efficiently, and avoid false sharing when possible in multithreaded settings. (Section
B. One Target Gate
Let
!
This final equations reveals that applying a one target gate operator to a full-statevector linearly combines pairs of amplitudes. Thus, when applying the operator
Since the pairs of amplitudes are