Distributed Simulation of Statevectors and Density Matrices (Jones et al. 2023)

I. Introduction

A. Notation

Bit Extraction Notation

i[t] represents the t qubit of the binary basis i

13=1101=i

bit:  3 2 1 0
      1 1 0 1

i[2]=1

Negated Bit Notation

If i[t]=1 represents the binary value of qubit t in statevector i, then ! i[t] represents flipping the value such that ! i[t]=0, and vice versa

Flipped Index Notation

If (i_t) represents the statevector i with t unchanged, then (i_¬t) represents the same statevector with the t bit flipped

Multiple-bit Flip Notation

Similar to flipped index notation, except (i_¬{(t_1),(t_2)}) represents flipping both bit values in the binary string

Statevector Notation

For a N qubit state

|ψ>=(∑_i^2)((α_i)|i>)

Computational bases are represented using binary decimal notation. For a 3 qubit state

|0>3=|000>, |1>3=|001>, |2>3=|010>, |3>3=|011>,⋯,|7>3=|111>

And it is stored as

|ψ>N=[(α_0),(α_1),⋯,(α_2-1)]

This is the statevector, where each index corresponds to the basis state

Prefix & Suffix Bits

Consider a node to be "one computer/machine participating in the simulation. Distributed simulation defines W as the number of nodes being used, which assumes

W=2 (# of nodes available is a power of 2)

Every amplitude index i has N bits total:

Some of those bits identify the node
The remaining bits identify the amplitudes position within that node

Suppose N=6. A statevector then would look like

q6 q5 q4 q3 q2 q1 q0

Now suppose W=8. That is, we have 8 nodes to work with. Since W=2, that means we need 3 bits to identify the node, i.e. the prefix bits. These prefix bits are defined as the rank bits

r1 r0 q4 q3 q2 q1 q0

The remaining suffix bits are local index bits, defined as λ=N-w. The key difference is

Suffix/local bits can be manipulated entirely within one node
Prefix/rank bits requires communication with other nodes, usually reassigning the statevector to a new node

global index = (prefix bits)(suffix bits)

Moreover, global amplitude index i can be decomposed into a combination of the node rank r bit-shifted to the left by λ and the local-index j

i = (r << lambda) + j

Symbol	Meaning
(i_t)	value of bit t in index i
! (i_t)	flipped value of bit t
(i_¬t)	index after flipping bit t
(i_¬{(t_1),(t_2)})	index after flipping bits (t_1), (t_2)
(α_i)	amplitude stored at index i
r	node rank
j	local index
W=2	number of nodes
λ=N-w	number of suffix/local bits
prefix bits	encoded in rank
suffix bits	encoded in local index

II. Local Statevector Algorithms

A. Local Costs

"Before we can discuss the challenges of distributed simulation, we must first understand the simpler problem of local (i.e. non-distributed), shared memory simulation." (Section II, first paragraph)

Recall that the simulation stores a full-state vector |ψ>=(∑_i^2)((α_i)|i>) as an array

ψ={(α_0),(α_1),⋯,(α_2-1)}

This notation gives a complete description of a quantum state, but the memory costs grow exponentially as N increases

Considering C/C++ stores complex doubles as 16 bytes, for N qubits we would need

2 states × 16 bytes

Dense mixed states are even worse

2 states × 2 states × 16 bytes

Full statevector simulators are useful for researchers because resource utilization and runtime costs are roughly the same across circuits of equal size, irregardless of state properties, making resource prediction trivial

But there are two main drawbacks of full state simulation, which is that it requires simultaneously

Storing 2 complex doubles (α)
Simulating n qubit operators, which required O(2) floating point operations

Thus, the authors use four local performance metrics to measure these costs

Basic Operations `bops`

Bitwise operations, integer arithmetic, logical checks, indexing, etc.

i ^ (1ULL << targ)
(i >> targ) & 1
if (ctrlBit == 1)

Floating Point Operations `flops`

Arithmetic operations on real or complex values

m00*beta + m01*gamma;
m10*beta + m11*gamma;

NOTE: The paper distinguishes bops from flops because complex arithmetic is much for expensive than logical or bit manipulation operations

Memory `memory`

Memory overhead, i.e. the size of temporary data structures created during an algorithm

This excludes persistent pre-allocated memory dedicated to the quantum state representation
- one-target gate: O(1) extra memory
- many-target gate: O(2) temporary array

Writes `writes`

The number of writes to heap/main memory, especially writes to statevectors

Moving amplitudes to and from memory is more often the bottleneck as opposed to performing arithmetic operations
Thus if an algorithm writes to fewer amplitudes, it may be much faster than other algorithms even if the math looks the same

Cost tags are summarized as:

[a bops][b flops][c memory][d writes]

NOTE: These metrics of course do not capture all local simulation considerations. Presented algorithms are optimized to avoid branching, enable auto-vectorization, cache efficiently, and avoid false sharing when possible in multithreaded settings. (Section 2,pg. 5)

B. One Target Gate

Let (M_t)denote a single qubit gate operating on qubit t, described by a complex matrix

(M_t)=[[(m_00),(m_01)],[(m_10),(m_11)]]

(M_t) gets applied to the target qubit t as follows

(M_t)|ψ>=(∑_i^2)((α_i)|(i_N-1)>⋯(M_t)|(i_t)>⋯|(i_0)>)

(M_t)|(i_t)>={[(m_00)|0>+(m_01)|1>,,(i_t)=0],[(m_10)|0>+(m_11)|1>,,(i_t)=1])

! (i_t) here means logical bit flip on (i_t), where I will also be denoting it as (j_t). Putting it together:

(M_t)|ψ>=(∑_i^2)((α_i))*|(i_N-1)>⋯((m_(i_t)*(i_t))|(i_t)>+(m_(j_t)*(i_t))|(j_t)>)⋯|(i_0)>

(M_t)|ψ>=(∑_i^2)((α_i)*(m_(i_t)*(i_t)))*|i>+(α_i)*(m_(j_t)*(i_t))|(i_¬t)>

This final equations reveals that applying a one target gate operator to a full-statevector linearly combines pairs of amplitudes. Thus, when applying the operator (M_t), the amplitudes of |ψ> are simultaneously modified to become

(α_i)→(α_i)*(m_(i_t)*(i_t))+(α_(i_¬t))*(m_(i_t)*(j_t))

Since the pairs of amplitudes are ((α_i),(α_(i_¬t))), and since (i_¬t)=i±2, paired amplitudes are always a distance 2 apart