High Resolution 3D Ultrasonic Breast Imaging by Time-Domain Full Waveform Inversion

Ultrasound tomography (UST) scanners allow quantitative images of the human breast's acoustic properties to be derived with potential applications in screening, diagnosis and therapy planning. Time domain full waveform inversion (TD-FWI) is a promising UST image formation technique that fits the parameter fields of a wave physics model by gradient-based optimization. For high resolution 3D UST, it holds three key challenges: Firstly, its central building block, the computation of the gradient for a single US measurement, has a restrictively large memory footprint. Secondly, this building block needs to be computed for each of the $10^3-10^4$ measurements, resulting in a massive parallel computation usually performed on large computational clusters for days. Lastly, the structure of the underlying optimization problem may result in slow progression of the solver and convergence to a local minimum. In this work, we design and evaluate a comprehensive computational strategy to overcome these challenges: Firstly, we exploit a gradient computation based on time reversal that dramatically reduces the memory footprint at the expense of one additional wave simulation per source. Secondly, we break the dependence on the number of measurements by using source encoding (SE) to compute stochastic gradient estimates. Also we describe a more accurate, TD-specific SE technique with a finer variance control and use a state-of-the-art stochastic LBFGS method. Lastly, we design an efficient TD multi-grid scheme together with preconditioning to speed up the convergence while avoiding local minima. All components are evaluated in extensive numerical proof-of-concept studies simulating a bowl-shaped 3D UST breast scanner prototype. Finally, we demonstrate that their combination allows us to obtain an accurate 442x442x222 voxel image with a resolution of 0.5mm using Matlab on a single GPU within 24h.

Ultrasound tomography (UST) scanners allow quantitative images of the human breast's acoustic properties to be derived with potential applications in screening, diagnosis and therapy planning. Time domain full waveform inversion (TD-FWI) is a promising UST image formation technique that fits the parameter fields of a wave physics model by gradient-based optimization. For high resolution 3D UST, it holds three key challenges: Firstly, its central building block, the computation of the gradient for a single US measurement, has a restrictively large memory footprint. Secondly, this building block needs to be computed for each of the 10 3 − 10 4 measurements, resulting in a massive parallel computation usually performed on large computational clusters for days. Lastly, the structure of the underlying optimization problem may result in slow progression of the solver and convergence to a local minimum. In this work, we design and evaluate a comprehensive computational strategy to overcome these challenges: Firstly, we exploit a gradient computation based on time reversal that dramatically reduces the memory footprint at the expense of one additional wave simulation per source. Secondly, we break the dependence on the number of measurements by using source encoding (SE) to compute stochastic gradient estimates. Also we describe a more accurate, TD-specific SE technique with a finer variance control and use a state-of-the-art stochastic LBFGS method. Lastly, we design an efficient TD multi-grid scheme together with preconditioning to speed up the convergence while avoiding local minima. All components are evaluated in extensive numerical proof-of-concept studies simulating a bowl-shaped 3D UST breast scanner prototype. Finally, we demonstrate that their combination allows us to obtain an accurate 442 × 442 × 222 voxel image with a resolution of 0.5 mm using Matlab on a single GPU within 24 hours.

Introduction
Screening using X-ray-based mammography has saved many lives by catching early-stage breast cancer. Nevertheless, there remain concerns related to tissue superposition, overdiagnosis, the effect of ionising radiation, the lower sensitivity in dense breasts, and the pain caused by breast compression [65]. Alternative breast imaging approaches have therefore been explored, including MRI, which is expensive and time-intensive [47], optical approaches such as photoacoustic tomography [63] and diffuse optical tomography [101] which are limited by the significant scatting of light by tissue, and ultrasound (US). In conventional US imaging, high frequency acoustic waves are transmitted into the breast from a hand-held linear array of typically 128 detection elements. The back-scattered waves are detected by the same array and used to form qualitative 2D images in real time. It is often used as an adjunct modality to supplement mammography [76,44], but the operator-dependence means that hand-held US requires an expert user. It is also time-consuming when imaging a whole breast, making it unsuitable as a screening modality.
In an attempt to overcome these drawbacks, automated breast US (ABUS) has been introduced, which uses a machine to scan a linear array perpendicular to its length over the breast to form a qualitative 3D reflection image [91] but it does not give a quantitative image of tissue properties. In contrast, a principal aim in ultrasound tomography (UST) of the breast [17,23] is to obtain accurate high resolution quantitative images of the tissue's acoustic properties. To achieve this, a pulse of US is transmitted into a pendant breast and measurements are made of the resulting time-varying acoustic field at multiple detector positions around the breast. This is repeated for multiple source positions (or, more generally, source distributions). The set of acoustic time series thus measured will incorporate both the unscattered (directly transmitted) and scattered parts of the acoustic field. To facilitate practical measurement times, large arrays are typically used, consisting of many hundreds of elements. In recent years, a number of such systems have been developed for UST breast imaging, including ones based on ring arrays [24,1,70], rotating planar or linear arrays [62,3,48], and bowl-shaped detector arrays [36,2]. Once the data has been measured, the challenge is to use it to form accurate quantitative images of the tissue's acoustic properties.

Image Reconstruction for Ultrasound Tomography
Typically three types of images have been produced from UST data: quantitative images of the acoustic attenuation and the speed of sound, and qualitative images indicating the 'strength' of the scattering at each point in the image [20]. To obtain these images requires a nonlinear inverse scattering problem to be tackled. The various approaches to solving this, which can be characterised by the approximations and assumptions they make, represent different trade-offs between computational efficiency and imaging accuracy. UST systems and measurement protocols are therefore often designed and optimized for a particular image formation method, unlike other imaging modalities where it is typically the other way around. There is an extensive literature on UST reconstruction, and it shares much common ground with other areas of wave-based imaging, such as in seismology and non-destructive testing. This is not the place for a comprehensive account of that history, so we give just a few examples of typical reconstruction approaches, focusing on quantitative reconstructions of sound speed as that is the topic of this paper.
Travel-time or time-of-flight tomography, widely used in geophysics [19], assumes the wave propagation can be accurately described in terms of rays: either straight rays [74,64,53,12,70], thereby neglecting scattering, diffraction and refraction, or bent rays [54,5,58,59,22,52], which can account for refraction to some extent. The measured travel times are linearly related to the integral of the slowness (reciprocal of the sound speed) along the rays, and the system matrix, once constructed, can be inverted using standard linear algebra approaches. Ray-based approaches can lead to robust and computationally efficient algorithms to recover smoothed sound speed images, but usually fine details are lost. Because the significance of diffraction depends on the relative sizes of the scatterer and wavelength, the ray approximation becomes more accurate the higher the frequency used. However, at higher frequencies the sources and detectors become more directional, reducing the possible ray paths across the imaged region, and the acoustic attenuation of biological tissues is greater.
Another approach, sometimes called diffraction tomography, considers the material heterogeneities to be weakly scattering perturbations over a background and uses Born or Rytov-type linearisations of the nonlinear scattering problem to arrive at linear methods of image formation. The free-space Green's function is used to estimate the effect of the perturbations on the field. In general for breast tissue, the weak scattering approximation does not hold, so adaptations of these linearisations are required to develop techniques suitable for quantitative breast imaging [66,49].
To solve the full nonlinear inverse scattering tomography problem requires an iterative approach. One way is to iteratively update both the forward operator and the unknown parameter in the linearised approximations above. In bent-ray reconstructions, for example, both the rays and the unknown sound speed are updated at each iteration [52]. In the Distorted Born Iterative Method [46,16], the Green's function and the unknown parameter are updated. This works well for low contrast media, but suffers from the failure of the Born series to converge for large contrast heterogeneities.
Before we introduce the particular approach we follow in this work in more detail, we remark that as in many fields, there has been a recent flurry of interest into the question of whether deep neural networks can be utilized for ultrasonic imaging [89], see, e.g., [21,4,29,30,61,28,31,32,55,56,100,89] for approaches to improve the speed and accuracy of 2D UST reconstructions. While it seems likely that neural networks will find wide application in UST, there are reasons to think that, as in other imaging modalities, they will complement rather than supersede existing approaches [6]. In particular in high resolution 3D settings, neural network components that link the object to be imaged to the measured data can hardly be fully learned. Efficient and accurate implementations will always contain components derived from physical models accurately describing wave propagation in biological tissue according to the best of our understanding.

Full Waveform Inversion
An approach that has gained increasing attention in recent years has been to formulate the image formation as a optimization problem, essentially fitting the measured data to a forward model of the acoustic propagation. This way of tackling the inverse scattering tomography problem is known as waveform tomography, model-based inversion or full waveform inversion (FWI). Because the framework of FWI allows all the measured data to be used in the inversion -the whole time series, or all the frequenciesit can in principle produce quantitatively correct, high resolution images of the acoustic properties with many fewer measurements than is required for travel-time tomography (which essentially reduces each time series to a single value). There is a sizable literature on FWI in the seismic community, see [86] for a recent review.
FWI constitutes a large-scale, non-convex, PDE-constrained optimization problem. In addition to the intrinsic difficulties of such optimization problems, iterative optimization schemes need to run numerous numerical wave simulations to compare artificial UST measurements caused by the current guess of the acoustic parameters to the real US data obtained. Depending on the scan design and the spatial resolution desired, this can present a very significant computational challenge.
Dedicated hardware design can circumvent this bottleneck to some extent. For instance, when using a 2D ring-array focused in the plane, 2D FWI can be used to reconstruct the 3D volume slice-by-slice [58,24,38], but the risk of out-of-plane artefacts and the poor image resolution perpendicular to the plane makes this solution less than ideal [73]. The approach taken by Wiskin et al. [94,93,95,96] has been to design the hardware with a planar geometry so that a computationally-efficient approximation to the wave simulation can be exploited.
There will be a highest usable frequency dictated by inherent attenuation through the breast and the sensitivity of the hardware, and this will limit the achievable image resolution. If this frequency is too high for the computations to be practicable then reducing the highest frequency, at the loss of the achievable image resolution, may be a compromise it is necessary to make. Goncharsky et al. [38,39,37,41,40,42] follow this idea and consider using low-frequency transducers. They further combine with a "2.5"D sliceby-slice reconstruction approach.
However, there is not always the freedom to choose the hardware. For instance, in the case we are interested in, the scanner has been designed to be optimal for photoacoustic tomography (PAT) of the breast [2] and uses a hemispherical array with unfocused, broadband US transducers. UST, which will be an adjunct modality, will be performed using the same array and we aim to obtain the same, isotropic sub-millimeter resolution across the whole breast volume. Another example of a fully 3D UST system, also a bowl array, is the KIT 3D USCT system [36,35], which was designed for obtaining both highquality reflectivity images and quantitative tissue images. To realize FWI for such UST scenarios, large high-performance computing clusters are typically used [38,13,40] and/or the computation takes days, which severely limits the range of clinical applications for which these methods are viable. A notable exception is given by the work presented in [8], which we were made aware of during the writing of this article.

Paper Scope and Structure
The key focus and contribution of our work is to develop and demonstrate a comprehensive computational strategy that achieves accurate high-resolution, 3D FWI for breast UST with a hemispherical array using only moderate computational resources (at least one GPU) and a self-imposed time limit of 24 hours. This would allow the method to fit into clinical trajectories for diagnosis and therapy planing at competitive costs compared to other modalities such as MRI.
In Section 2 we summarize the mathematical modeling of UST and FWI, and describe the state of the art. Section 3 illustrates several separate computational challenges of FWI for high-resolution 3D UST, and describes how novel and established solutions can be combined into a comprehensive inversion strategy. Extensive numerical studies in Section 4 first validate solutions to the separate sub-problems before the results of the whole scheme are demonstrated in Section 5. Finally, we discuss the results of our work and point to future directions of research in Section 6 before closing with conclusions in Section 7. Table 1 lists all commonly occurring abbreviations for reference.

Full Waveform Inversion for Ultrasound Tomography
This section presents the model used to describe ultrasound propagation in soft tissue (Sec. 2.1) and the mathematical formulation of the inverse problem of UST (Sec. 2.2). The FWI approach to tackling this problem, the progress that has been made, and the challenges that remain, are then described (Secs. 2.3 and 2.4).

Modelling Ultrasound Propagation in Soft Biological Tissues
It is common to model soft biological tissues as compressible fluids and describe acoustic propagation therein by linearising the equations of fluid dynamics derived from conservation laws and combining them with an empirical equation of state. Following this approach, a sufficiently low amplitude ultrasonic wave can be modelled by the following system of first order equations [82]: where v is the acoustic particle velocity (the time derivative of the acoustic displacement d), p and ρ are the acoustic pressure and acoustic density fluctuations, ρ 0 and c 0 are the ambient density and sound speed, and m is a mass source term. L models acoustic absorption and dispersion using fractional Laplacians: where τ and η denote absorption and dispersion proportionality coefficients, α 0 is the power law prefactor, and y is the power law exponent [81]. Equations (1)-(3) can be combined into the following lossy second-order wave equation for a heterogeneous medium: where we introduced a lossy d'Alembert operator A. As initial conditions we have p = 0 and ∂ t p = 0. Suitable data pre-processing typically allows us to model the propagation as if it occurs in an unbounded domain, i.e., no explicit boundary conditions are required (reflections from experimental equipment can be time-gated out and waves propagating deeper into the body are absorbed). See [80,82,7] for more detailed discussions on the mathematical modeling and [10,43,79] for recent extensions of FWI to UST involving hard tissues such as bone.

The Inverse Problem of Ultrasound Tomography
In most UST systems, movable arrays of US transducers are placed around the sample Γ ⊂ R 3 and acoustically coupled with it. For a single recording, a subset of transducers emit an acoustic pressure signal of length T s while the others are receiving for a time T > T s , chosen long enough such that the acoustic pressure remaining inside the scanner is not measurable anymore. A whole scan consists of n s such recordings, between which the transducer arrays may be moved. Here, we model a UST scan in the following way: we have n s temporal sources s i (x, t), with supp(s i ) ⊂ Γ c × [0, T s ] (in practice, each source is only defined over a small region corresponding to the front face of the transducer). The measurement process is modelled by applying a linear operator M i to the pressure fields p i (x, t), t ∈ [0, T ] resulting from (6), i.e., we have Here, u denotes the unknown acoustic material properties inside the sample Γ, in the extreme case u = (c 0 , ρ 0 , α 0 , y), where c 0 , ρ 0 and α 0 depend on x. We assume that all other properties are sufficiently well-known to ignore their modeling error. The general inverse problem of UST is to recover u (or features of it) given a noisy data set {f δ i } ns i=1 . A general discussion of the uniqueness and stability of this problem can be found in [78]. As mentioned in the Introduction, many different approaches have been used to tackle this problem.

Full Waveform Inversion
In FWI, we assume that for a given u, we can solve (7) to simulate data, i.e., f i (u) := M i A(u) −1 s i . Then, we try to optimize u such that the discrepancies between simulated and measured data become small: where D(f, g) is a loss function (see [98,25] for a discussion of suitable loss functions), and U is a set of constraints on u, e.g., bound constraints. First-order optimization schemes solve (8) using only the gradient ∇J (u), which is given by the sum over terms of the form ∇ u D M A −1 (u)s, f δ . Such expressions can be computed efficiently using the adjoint state method [71], which we will summarize here in a short, instructive way. One starts by differentiating both sides of (7) wrt u: Here, A −T is the operator solving the adjoint wave equation [71]. For instance, for u = c 0 and L = 0, we get with ∂A ∂c0 = − 2 where q * is the adjoint pressure field obtained by solving A(u)q * = s * , and s * (x, t) is the time-reversed data discrepancy gradient ∂D ∂f [71]. We will use D(f, g) = 1 2 f − g 2 2 , for which we have ∂D ∂f = f − g. Corresponding formulas for ρ 0 , α 0 and y are listed in Appendix A.
The focus of this work is the practical feasibility of computing a sufficiently accurate and spatially resolved approximation to the solution of the core problem (8) for 3D breast UST within a reasonable amount of time and with reasonable computing resources. For this reason, we will only include bound constraints on u via U in the main studies. To embed more sophisticated a-priori knowledge on u and account for noise models and model discrepancies in f δ , one needs to extend (8) by adding regularization functionals (e.g., to penalize unwanted spatial features of u) or additional constraints (see [67,26] and references therein). The additional noise stability study in Appendix F will demonstrate the use of a simple regularization strategy.
We introduced FWI in the time domain (TD) here, which is typically also the domain the measurements are obtained in. One can also formulate FWI in the frequency domain (FD). Then, computing J (u) and its gradient requires solving a direct and an adjoint Helmholtz equation for each frequency. If the emitted waves are narrow band and/or the transducers are only sensitive within a narrow band (modelled by M i ), this can be very advantageous but typically leads to high memory requirements if high resolution in 3D is desired (see [73] and references therein).
Depending on the scanning geometry, transducer characteristics, and measurement protocol and whether TD-FWI or FD-FWI is pursued, different numerical schemes are advantageous for solving the wave or Helmholtz equation. Direct methods such as finite-difference, pseudospectral, finite/spectral element methods or discontinous Galerkin methods aim to solve the PDEs directly while integral equation methods such as boundary element methods first transform them. Asymptotic or approximate methods such as geometrical optics or Gaussian beams only capture a part of the wave-matter interaction while typically being computationally much more efficient. See [33,50] for overviews on computational wave propagation methods for seismic FWI imaging.

Challenges of High Resolution 3D Time Domain Full Waveform Inversion
Many UST systems are designed to capture 3D information and emit unfocused, broadband waves to obtain sub-millimeter isotropic spatial resolutions over the whole breast volume [36,2]. For instance, within the PAMMOTH project [2], we built a scanner with a bowl-shaped transducer array and aim to reach a resolution of ≤0.5 mm. Even when using the adjoint state method, solving TD-FWI (8) for such scenarios is computationally challenging: For the PAMMOTH setup, even an efficient k -space pseudospectral method [80] takes at least 10 minutes to solve a single wave simulation on a recent GPU. As J (u) involves the sum over n s sources (8), computing ∇J (u) requires n s parallel gradient computations (10) each involving two wave simulations. On a single GPU, this would take 14 days for n s = 1024. While using a cluster with many computational nodes could reduce this [13,40], a closer look at (10) reveals a problem of the TD adjoint state method: Ideally, the field p(x, t) needs to be kept in GPU memory or at least in the main memory of the computational node while the adjoint field q * (x, t) is computed. For the PAMMOTH setup, this would require up to 150 GB working memory. Current GPU cards are limited to 48 GB and typical nodes in computational clusters are not equipped with that much main memory, either. In addition to these difficulties, first-order optimization methods for FWI (8) suffer from slow convergence and may get stuck in suboptimal local minima of the non-convex function J (u).
In the following section, we describe techniques to circumvent each of these problems and combine them into a comprehensive computational strategy to achieve high resolution 3D TD-FWI for ultrasonic breast imaging.

Memory-Efficient Gradient Computation Using Time Reversal
First, we describe an approach to circumvent storing p(x, t), x ∈ Γ, t ∈ [0, T ]. Note that in UST, all sources and sensors are outside Γ and p(x, 0) = ∂ t p(x, 0) = 0. We first consider the case of no absorption, i.e., L = 0. By Huygens' principle, the field inside Γ is uniquely determined by its trace on the boundary ∂Γ × [0, ∞]. The theory of time reversal (TR) describes under which conditions the time reversal symmetry of the wave equation (6) allows p(x, t) to be reconstructed by simulating the following wave equation for the time-reversed pressure p (x, t): if Γ is convex, u is non-trapping and T is chosen large enough such that p(x, T ) and ∂ t p(x, T ) are small and monotonically decaying inside Γ [34,97,18,57]. In 3D UST of the breast, these conditions are fulfilled: Γ can be chosen as a convex set including the breast (e.g., in the PAMMOTH system, the breast is placed in a plastic cup with convex shape). The acoustic parameter variations are such that no waves get trapped in the sample and after all acoustic energy has entered the breast, it decays exponentially fast, see Appendix E. Choosing T = T is a safe choice as we chose T large enough to ensure the pressure inside the scanner is not measurable any more (cf. Section 2.2). Note that (11) is not the same as the adjoint wave equation [7].
TR was developed for focusing ultrasound waves through inhomogenous media [34,97,18] and is used for image reconstruction in Photoacoustic Tomography (PAT ) [57,9]. Here, we use it as a numerical trick to approximately replay the forward field p(x, t) backwards in time, in parallel to solving the adjoint wave equation. This way, we do not need to keep the pressure field p(x, t) in the sample's volume Γ in memory, only its trace on the boundary ∂Γ, which reduces the memory footprint considerably. Note that similar techniques have been developed in the seismic literature. See [68] and references therein for the idea to shift the storage onto the boundary for our type of boundary conditions.
The integral in (10) can directly be accumulated during the parallel time stepping scheme solving the both TR and adjoint wave equations: The state variables in each computation are exactly what is needed to compute the contribution to the integral. For instance, the first time step of the TR wave simulation results in p (x, 0), which is approximately p(x, T ), while the first time step of the adjoint wave simulation results in q(x, 0). With these variables, the contribution to the integral (10) for time t = T can be computed. The price we pay for the heavily reduced memory footprint is that we have to run three instead of two wave simulations to compute (10). However, the simulation domain of the TR wave simulation only needs to enclose Γ and can therefore often be chosen smaller. Appendix B describes the implementation of the TR-based gradient computation using the k -space pseudospectral method implemented in the k-Wave Matlab Toolbox [80] and Section 4.3 validates it numerically.
In the case of absorption, i.e., L = 0, [85] describes how to modify TR to recover the pressure fields (essentially, the sign of the absorption needs to be switched). Note however, that compared to the application in PAT image reconstruction considered in [85], no additional regularization is needed in our case as the boundary time series are not contaminated with measurement noise.

Tuneable Stochastic Gradient Optimization Using Delayed Source Encoding
In incremental gradient or ordered sub-set methods [11], one chooses a different subset S k in each iteration k in a predetermined way. If the subsets are chosen at random, g S (u) becomes a stochastic estimator of ∇J (u) and stochastic gradient descent (SGD) schemes need to be used. Fortunately, supervised training of deep neural networks are also instances of finite sum minimization problems (empirical risk minimization) and the recent success of deep learning techniques has stimulated research into efficient SGD schemes [14]. In Section 4.4, we will demonstrate the benefits of using a recently developed stochastic L-BFGS (SLBFGS ) method (see [27] and Appendix D) for UST over earlier schemes. Besides being unbiased, i.e., E [g(u)] = ∇J (u), a desirable property of a gradient estimator g(u) is that it is both computationally efficient (fast to compute) and stochastically efficient (small variance/error). Computing g S (u) with a subset of size one takes only two wave simulations (or three if TR is used). It turns out that due to the structure of (8), there are gradient estimators with the same computational but higher stochastic efficiency. In this work, we assume ,f δ is an unbiased estimator of ∇J (u). Practically, we run two "super-wave" simulations where all sources are fired simultaneously. This technique exploits the linearity of the wave equation and is called source encoding, see [45] for the general theory, [92] for the form we use it here (but we apply it in 3D), and Appendix E for a more detailed description. Choosing w i = ±1 with equal probability (Rademacher distribution) minimizes the variance of g E and will be used from now on (note that one could also define w such that g S is recovered). The case M i = M will be examined in forthcoming work that applies the techniques discussed here to concrete experimental data scenarios.
While SGD methods need to reduce the variance of the update progressively in order to converge at all, tailored variance reduction schemes can improve the convergence rate to that of deterministic methods (cf. Section 5 in [14]). If the variance of the gradient estimator can not be influenced in a direct way, variance reduction can only be achieved via reducing the step size or averaging iterates or gradients. Having fine control over the variance of g(u) is more advantageous. For g S (u), one can only increase the size of S; for g E , one would average multiple realizations. In both cases, we would have a sudden increase of the computational effort which means a rather coarse control over the variance [88]. If multiple GPUs are available, one can perform these computations in parallel, which we will investigate in Section 4.5. However, if different GPU architectures are used in this way, the slowest one limits the overall efficiency, cf. Section 4.6. To overcome these problems, we propose a novel source encoding scheme g τ E with fine variance control that also exploits the time-invariance of the wave equation and the fact that all waves eventually leave the region of interest (which is essential for the sequential scanning of UST, cf. Section 2.2): Let d i ∼ U [0, 1] be random, τ ≥ 0 a fixed maximal delay and definê For τ = 0, we recover the conventional source encoding where all sources are fired simultaneously while for τ > 0, we activate the sources with random delays. While this necessitates running the super-wave simulation for a longer time T + τ , it dampens the cross-talk between the pressure fields coming from different sources, which is the origin of the error of conventional source encoding. For n s = 2 and τ = 2T , the average delay between the two sources is T and we essentially fire sources with the delay we deemed necessary for regarding the measurements as completely separate (cf. Section 2.2). More generally, g τ E converges to ∇J (u) for growing τ for any realization of Rademacher weights w and d. More details can be found in Appendix E. Note that our approach extends the 2D FWI random time-delay approach described in [99] by combining it with the rigorous stochastic analysis from [45]. Recently, another interesting TD approach to reduce the cross-talk between different sources by running longer "super-wave" simulations has been presented in [87,8]: The encoding assigns a monochromatic time course to each source i, chosen from a set of regularly spaced frequencies. The forward and adjoint wave simulations are run long enough to produce the steady-state pressure fields from which the contributions of each source can be decoded by Fourier decomposition. In Section 4.4, we will validate the proposed gradient estimator numerically. On a single GPU, this technique allows one to increase the gradient estimator's precision in arbitrarily small amounts while on heterogeneous multi-GPU environments, it allows faster GPUs to compute more precise gradient estimates using tailored τ > 0 while the slowest GPUs use τ = 0.

Improving Convergence Using Coarse-To-Fine Schemes and Preconditioning
FWI (8) is a non-convex optimization problem. Iterative optimization techniques based on local descent directions like gradient descent schemes may experience slow convergence or even get stuck in local minima that only explain parts of the data. In seismic imaging, the latter is known as cycle skipping [98,25,86]. A common way to avoid it in FD-FWI schemes is to restrict the frequency range to the lower frequencies first and and increase it progressively [94]. In TD schemes, this corresponds to solving the FWI on a coarse spatio-temporal grid first, which is then progressively refined. In the most basic case, the interpolation of the solution computed on the coarse grid is used to initialize the optimization scheme on the finer grid but more sophisticated multi-grid schemes can be derived [51]. In the k -space pseudospectral method used here, all grids are regular and implementing correct up and down-sampling is easy, see Appedix C. In 3D, a coarsening of the spatio-temporal grid by factor of η = 2 (i.e., dx c = η dx f , dt c = η dt f ,...) leads to a reduction of the computational operations by a factor of η 4 = 16. We will illustrate in Section 4.7 that it is essential to exploit this advantage to obtain fast computational schemes.
Preconditioning techniques try to improve the convergence of numerical optimization schemes by reformulating the original problem. For instance, instead of using u = c 2 0 (x) in (8), we could solve for u = 1/c 0 (x) (slowness), u = 1/c 2 0 (x) (squared slowness) or a weighted version thereof, u = w(x)/c 2 0 (x). The weights w(x) could be a function of the average distance of x to the transducer locations. Due to the non-convex nature of J (u) and the way SLBFGS constructs low-rank quadratic approximations to J (u) based on stochastic function and gradient evaluations, this choice is not trivial. In Section 4.8, we will examine different possibilities.  , which leads to T /∆t = 3912 time steps. The anatomically realistic numerical breast phantom used is part of the OA-Breast Database 1 , which was designed for (photo-)acoustic simulations and is described in [60]. It is based on the tissue segmentation of a clinical contrast enhanced MRI of a breast in prone, free-hanging position. The original segmentation is fitted and interpolated into our setup. The tissue-dependent speed of sound values we use are the same as in [60] and can be found in the caption of Figure 1. The mass density ρ 0 was assumed constant at 1000 kg/m 3 for simplicity here.
We distribute 2048 transducer locations over the half-sphere by the golden section method. Half of them will be emitting (sources, n s = 1024) while the other half will be receiving (sensors). Simulated data of this artificial source-sensor setup is an array of size 1024 × 1024 × 3912 which corresponds to 15.28 GB in single precision. We assume idealized spatio-temporal transducer characteristics here, i.e., we model them as point-like in space and their temporal impulse response is a delta function. The sources emit a single broadband pressure pulse ("click") smoothed in time to remove temporal frequencies not supported by the computational grid (the grid supports up to 1.5 MHz at c 0 =1500 m/s [80]). More details can be found in Appendix C.
Note that the setup we use here was primarily designed to validate the accuracy and efficiency of the computational solutions we described in the previous section in the best possible way. For this reason, it reproduces the key properties of 3D FWI for breast imaging but omits modeling features of real-world systems that would complicate the interpretation of our results. It is furthermore not designed to examine the condition of the inverse problem 7.

Baseline Reconstructions
Before we can examine the different aspects discussed in Section 3 in depth, we need to illustrate some basic features of the reconstructions. Figure 2a shows the speed of sound reconstruction of the data simulated at 0.5 mm resolution when reconstructed using a grid size of 2 mm. This reconstruction was computed using SLBFGS with source encoding gradient estimator g E for 128 iterations (computing time ∼30 min) and could be the multi-grid based initialization of a FWI run on the finer resolution. For this reason, we will call it c ini 0 and take it as a reference point in the following studies. Note that here and in the following, we only reconstruct the speed of sound inside the breast region Γ, while the speed of sound outside is known and fixed. Figure 2b plots the reconstruction error c ini 0 − c † 0 locally, where c † 0 is the ground-truth speed of sound inside the breast. We will measure the global error of a variable a by their relative 2 distances to a reference b computed as rel 2 (a, b) = a − b 2 / b 2 , and expressed in percent, e.g., we have that rel 2 (c ini 0 , c † 0 ) = 2.18%. First, we run SLBFGS initialized at c ini 0 with the source encoding gradient estimator g E . Figure 3 shows the iterates after 32/64/128 gradient evaluations (eval ), the corresponding rel 2 errors are 1.53%/1.29%/1.07%. Note that we deliberately chose to not add measurement noise to the data (see above), the noisy appearance of the reconstructions are due errors of the gradient estimator (gradient noise), a purely numerical phenomena that we need to highlight. On the difference plots in Figure 3d-3f, one can see that the error is larger around tissue interfaces and close to the chest wall. The former comes from the finite spatial grid spacing: the k -space pseudospectral method cannot propagate the highest spatial frequencies present for a given spatial grid which means that we can at best hope to reconstruct a slightly smoothed version of c † 0 . The higher errors close to the chest wall are caused by the insufficient sensor coverage of the half-spherical sensor array (cf. Figure 1). To illustrate this effect in more detail, we computed error statistics for each depth slice. For slices deeper than 2cm into the bowl, the errors are almost normally distributed, with zero mean and a standard deviation that decays as the SLBFGS advances as can be seen in Figure 4a. We gain an accuracy of ∼5 m/s with each doubling of number of iterations, i.e., the computational effort.

Accuracy of Time Reversal Based Gradient Computation
First, we validate the TR based gradient computation described in Section 3.1 when implemented using a k -space pseudospectral method (cf. Appendix B). In principle, this schemes relies on regular spatial grids and is not well suited for representing arbitrarily shaped boundaries ∂Γ, cf. (11). To record or impose pressure values on ∂Γ, we need to instead record and impose them on a layer of grid points lying on the inner surface of Γ. Determining the thickness of this layer is a trade-off: A thin layer leads to a small memory footprint when storing p(x, t) on them, which was our original motivation to introduce the TRbased gradient computation. A large layer will lead to a more accurate reproduction of the forward field, i.e., p (x, t) is a better approximation of p(x, T − t). To examine the effects of this, we first compute the gradient (10) at c ini 0 for a single source storing the complete field p(x, t) in Γ × [0, T ]. Then, we compute it using the TR formulation using boundary layers of thickness 1/2/4/8 voxels. Table 2 shows the average relative errors of the TR gradients computed over 16 randomly picked sources and the working memory requirements in our setup. The errors decay very fast as a function of boundary layer thickness. However, we need to check that the errors in the TR gradients do not accumulate during an iterative reconstruction: Table 2 shows the reconstruction errors when running SLBFGS with 32 gradient evaluations initialized at c ini 0 . Interestingly, even using a boundary layer of just one voxel, the error between reconstructions with normal gradient and TR-based gradient computations are small. Furthermore, Figure 4b shows that the error appears noise-like with no clear pattern visible. This suggests that the errors rather cancel than accumulate over the course of the iteration. The difference in computational time will be examined in   Section 4.6. For the rest of the studies, a TR based gradient computation will be used with a boundary layer of 8 voxels.

Accuracy of Stochastic Gradient Optimization
First, we demonstrate the impact the choice of the stochastic gradient method has and compare SLBFGS to plain SGD and SGD with inertia/momentum [14] in Figure 5. All methods use source encoding as a gradient estimator and a maximum of 100 gradient evaluations. Furthermore, once an increase in the estimated energy is detected, they do not return the iterates u k directly but a weighted averagē u k := ( k l l 3 u l )/ k l l 3 to reduce variance, cf. Appendix D. SLBFGS reaches the final rel 2 error of plain SGD already after 55 evaluations, so twice as fast. This number increases to 76 when we add inertia of β = 0.5 [14] to SGD. However, we see that while SGD with inertia converges as fast as SLBFGS in the beginning, the decay levels off after ∼ 60 evaluations.
Second, we examine the impact of using different gradient estimators: Figure 6a shows the full gradient computed at c ini 0 , which took a week of computing time on a server with 4 GPUs (cf. Section 4.6). Then, we computed the different stochastic estimates of this gradient, g S , g E , g τ E (cf. Section 3.2) and examined how their precision increases by investing more computational effort. Figure 6b shows the results averaged over 16 i.i.d. gradient realizations. One can see that using source encoding (g E , g τ E ) indeed leads to a more accurate gradient estimation at the same computational effort. We will use source encoding for the rest of the studies. Also, we can see that using the novel delayed source encoding scheme g τ E not only gives a finer variance control compared to increasing the precision of g E via averaging, it also provides a more accurate gradient estimate with the same computational effort. Within a stochastic gradient method, the question is whether to do more steps with a less accurate but computationally cheaper gradient estimator vs. less steps with a more accurate but computationally more costly estimator. The former is typically more beneficial in the beginning of the iteration, while the latter becomes more beneficial towards the end of it [14]. To illustrate this, we first initialized SLBFGS with c ini 0 and then with the result shown in Figure 3c, which corresponds to 128 gradient evaluations (roughly 4 days computing time on a TITAN RTX GPU). In both cases, we ran it for another 72 hours using g τ E with τ = b (i − 1) for different values of b and with and without starting iterate averaging after an increase in the energy estimate in the case b = 0 (so no delay). Figure 7a shows that in the low accuracy regime, all methods perform rather similar for the first 24 hours. In the case b = 0, iterate averaging is only activated after 66 hours (the point where the first two plots diverge). The delayed source encoding methods b > 0 eventually perform better than b = 0, but are slower at the start. Figure 7b shows the results for the high accuracy regime: While using iterate averaging stabilizes the iteration for b = 0, using increasing values of b leads to faster convergence in terms of computing time.

Multi-GPU Acceleration
Computing platforms with multiple GPUs can be used in different ways to accelerate the image reconstruction. In the TR based gradient computation described in Section 3.1, the adjoint and TR wave simulation can be run on two separate GPUs in parallel, which would lower the computational cost to that of the standard gradient computation again. To reconstruct 3D images on higher resolutions, one  can implement 3D domain decomposition methods to distribute the computations over several GPUs [69,83,84]. A straightforward way which does not need sophisticated implementation is to average statistically independent gradient estimates each computed on a different GPU. This reduces the variance of the gradient estimator at least by a factor corresponding to the number of GPUs, cf. Section 3.2: Figure 8 shows the reconstruction errors of SLBFGS initialized at c ini 0 when using a growing number of GPUs in this way (using source encoding without time shifting). One can see that the convergence is in general faster. However, Figure 8a highlights that due to the non-convexity of J (u), there is not a simple relationship between the accuracy of the gradient estimator and the convergence curve. The reconstruction error obtained after running SLBFGS using 2/4/8 GPUs for 32 gradient evaluations is the same as running it on a single GPU for 51/89/103 gradient evaluations. Comparing Figures 4a and 8b shows also the decrease in standard deviation saturates with a growing number of GPUs. GeForce RTX 2080 Ti GeForce GTX 1070  memory  24 GB  24 GB  11 GB  8 GB  fwd  19m 54s  10m 59s  11m 42s  30m 27s  grad  46m 9s  26m 44s  27m 34s  n/a  TR grad  61m 48s  34m 22s  36m 15s  1h 34m   Table 3.: Comparison of different Nvidia GPUs. Shown are the computation times for simulating data generated by a single source (fwd ), computing a gradient for it via the standard approach (10) (grad ) and the TR based computation (cf. Section 3.1) (TR grad ). N/A indicates that the computer the GPU was installed in did not have insufficient working memory (RAM).  Table 3 compares the performance of different GPUs. One can see that the run times of normal and TR-based gradient computations are more alike than expected based on the number of computational operations (flops). One reason is that the huge memory footprint of the normal gradient computation requires a lot of working memory operations, which are no longer negligible compared to the computational operations.

Acceleration by Coarse-To-Fine Multigrid Initilization
Next, we illustrate that a coarse-to-fine initialization strategy can speed up the convergence and avoid local minima as described in Section 3.3. We use SLBFGS with source encoding without time shifting and initialize it with the sound speed value of water (1500 m/s) everywhere. The first scheme runs on the finest grid (dx = 0.5mm) only. The second scheme has two levels: It starts on dx = 2mm before switching to dx = 0.5mm, so η = 4. The third scheme has three levels dx = 2, 1, 0.5mm (η = 2), and the forth dx = 2, 1.41, 1, 0.70, 0.5mm (η = √ 2). Figure 9a shows the error plots vs computational effort and shows how essential it is to use multi-grid schemes in UST: When the 2-level scheme switches to the finest level after 23 minutes, it has already reached a lower error than the single-level scheme will reach after 16 hours. Figure 9b shows that using the non-linear slowness and squared-slowness preconditioning transformations discussed in Section 3.3 do indeed speed up the convergence and provide clear advantages over the standard parameterization. All spatial error plots clearly indicate that the errors increase with decreasing depth, i.e., closer to the chest wall where the sensor coverage is getting worse, cf. Figures 1-4. This observation suggests that introducing a local weighting w(x) that depends on depth or sensor coverage could be beneficial. We tried different variants of such weightings but could not find significant differences. One explanation for this is that quasi-Newton schemes like SLBFGS have low sensitivity to linear preconditioning transformations.

Full Inversion Scheme Demonstration
In this section, we combine all the techniques described in the previous section to demonstrate the results one can obtain within a computing time limit of 24 hours. While this limit may seem arbitrary at this point, it is motivated by the perspective of adding UST to clinical work-flows for breast-cancer diagnosis and treatment planning: After a positive screening result, several examinations have to be scheduled and carried out and before a treatment decision can be made. While health care providers try to shorten the overall time for this as much as possible, it can rarely be done in a single day [75]. Within this context, a computing time of 24 hours would allow UST to fit into clinical trajectories. Similar considerations hold for the computational resources needed: A main advantage of UST over MRI could be lower operating costs for a single examination. Having to use or rent a large computational cluster over several days to compute the results of a single examination may diminish this advantage.
To appreciate the computational advantages described in this work, we first show the reconstruction result of applying a computational technique similar to the one used in [72] in Figure 10a. One can see that even after 24 hours the result can not be used to reliably identify anatomic structures. Then, we use • TR based gradient computation with a 8 voxel boundary layer as examined in Section 4.3.
• SLBFGS with source encoding and weighted iterate averaging as examined in Section 4.4.
• Squared slowness preconditioning as examined in Section 4.8.
The results of computing on a single GeForce RTX 2080 Ti installed in a conventional desktop PC (total price ∼ 5Ke) are shown in Figures 11a/11d. While one can still perceive stochastic gradient noise and structured errors in the reconstruction, it already provides a useful high-resolution estimate of the underlying anatomy. Figures 11b/11e and 11c/11f show how these results improve using a server with 4 GeForce RTX 2080 Ti (total price ∼ 20Ke) or a cluster with 16 TITAN RTX (total price ∼ 100Ke). Figure 10b shows the corresponding depth distribution of the errors. At this point, we remind the reader that the images shown only illustrate a sub-set of the whole reconstruction domain, cf. Figure 1.
As discussed earlier, we did not add measurement noise to the data in our studies to allow for an easier visual assessment of the convergence of the different computational schemes as the stochastic gradient error also manifests as noise-like image features. To round off the numerical studies, Appendix F examines the sensitivity to noise.

Discussion and Outlook
Our work was motivated by the specifications of the PAMMOTH scanner that is currently being developed and tested [2]. The main contribution was to combine a number of techniques to design a comprehensive computational strategy to realize high resolution 3D TD-FWI for UST of the human breast with moderate computational resources within a day of computing time. The results were validated with extensive numerical proof-of-concept studies on different computational platforms.
TR-based gradient computation For the biggest breast cup designed for PAMMOTH, storing p(x, t) throughout the breast would require 146.81 GB at 0.5 mm spatial resolution. This memory footprint rules out keeping it in the memory of any currently available GPU. If kept in CPU memory, it severely limits the number of parallel source simulations one can perform, e.g., running parallel simulations on a server with 16 GPUs like showcased in Section 5 would require more than 2 TB CPU memory. In addition, memory operations start to dominate the total computation time (cf. Table 3). The time-reversal-based gradient computation presented in Section 3.1 can overcome this bottleneck by shifting the storage from the breast volume to its boundary: With a moderate increase of computation time (cf. Table 3), the memory requirement can be lowered by a factor 28.79 to as little as 5.10 GB without compromising accuracy of the final solution by more than 0.25%, cf. Section 4.3. This allows us to utilize multiple GPUs without memory restrictions, which is examined in detail in Section 4.5 and demonstrated in Section 5, cf. Figures 8, 11.
Stochastic optimization The second crucial ingredient of our strategy is to combine novel stochastic optimization techniques with efficient stochastic gradient estimators and preconditioning. In Section 4.4, we demonstrate the benefits of SLBFGS over plain SGD and show that source encoding schemes lead to much more accurate gradient estimators at the same computational cost as sub-sampling methods, cf. Figure 6. Furthermore, we presented a novel source encoding scheme that exploits the time-invariance of the wave equation by introducing random time delays to increase the accuracy of the gradient estimate at the expense of longer computation times. This offers a much finer variance control than estimate averaging as well as being more accurate, cf. Figure 6b. Embedded in SLBFGS, we showed that using delayed source encoding will be beneficial over iterate averaging in the high accuracy regime, beyond the fixed computation time budget of 24 hours we set ourselves in this work, cf. Figure 7. Future work will examine this topic more carefully, e.g., by developing adaptive delay choice rules and by examining how to balance delays to optimally utilize heterogeneous multi-GPU computing environments. The encoding scheme presented here relies on the transient nature of the wave fields to reduce cross-talk. It would be interesting to compare it to the approach presented in [8], which obtains impressive results by encoding a harmonic steady state instead. Finally, one can observe that the error of the source encoding gradient estimates appears to be noise-like (cf. Figures 3, 11). As such, it is worth examining whether coupling source encoding with preconditioning that performs image denoising will further accelerate the convergence.

Multigrid schemes
The results presented in Section 4.7 reveal that using coarse-to-fine multi-grid schemes are another essential component of our strategy: Even the simple multi-grid initialization technique presented here accelerates the overall convergence massively. In the future, we will examine more sophisticated multi-grid schemes.

Further Extensions
• The methods presented here were exclusively implemented in Matlab, using the GPU support offered by the Parallel Computing Toolbox. Implementing it using optimized CUDA code will lead to further performance gains [69,83,84].
• While we presented the material in terms of a general acoustic parameter u, we only showcased the methods for u = c 0 . The extension to acoustic density ρ 0 and attenuation (α 0 , y) is described in Appendix A and will be examined numerically in future work.
• In this numerical proof-of-concept study, we used idealized assumptions on the transducer locations and properties. In forthcoming work, we will address realistic transducer and scan modeling.
• Finally, we will demonstrate the application of this work to real data from experimental phantoms, healthy volunteers and breast-cancer patients acquired by the PAMMOTH scanner in the future.

Conclusions
In this work, we described and evaluated a comprehensive computational strategy for high resolution 3D TD-FWI for UST of the human breast. With moderate computational resources, accurate results can be obtained within less than a day of computation time. The extent of the improvements presented here can best be appreciated by comparing Figure 10a, which corresponds to our starting point [72] with Figure  11, which showcases our current results. Similarly promising results were presented in [8], which was developed in parallel to the work described here. While the computing time and costs of our scheme are still too high for screening applications, it may well already fit into clinical trajectories for diagnosis and therapy planing.

A. Extension to Density and Absorption
This section extends the formulas in Section 2.3 and lists the remaining derivatives. First, including absorption with L = 0, we get as additional terms for the derivative with respect to c 0 . See [72] for a discussion of their relative importance. For u = ρ 0 we get For u = α 0 we get For u = y, we remark that functions of the self-adjoint operator −∇ 2 , are understood in terms of its spectral decomposition, i.e., in Fourier space (cf. [81]) and get

B. k-Space Pseudospectral Time Domain Implementation of Time-Reversal-Based Gradient Computation
Here we sketch how to implement the proposed memory efficient gradient computation (11) using a k-space TD scheme to solve the underlying first order system of equations (1)- (3). It follows the same approach implemented in k-Wave [80]. For a more detailed description, we refer to the k-Wave manual 2 . The k-space TD method is a collocation scheme that interpolates between the collocation points using a truncated Fourier series. This allows gradients to be calculated using the fast Fourier transform (FFT ) which leads to fast implementations on GPU architectures. The 'k-space' in the name refers to a correction κ applied in the Fourier domain to account for the finite difference approximation of the time derivative. This correction is exact for a homogeneous medium [77] and reduces the errors for acoustically heterogeneous media. The spatial domain Ω has to be embedded into a rectangular region, which is then discretised by a regular grid of N collocation points with grid size ∆x. To mimic free-space propagation, a Perfectly Matched Layer (PML) absorbing boundary is wrapped around the box to damp outgoing waves without reflecting them. In this work, the size of the PML layer is automatically chosen to be between 10 − 20 voxels by minimizing the the highest prime factor of the total grid size (cf. function getOptimalPMLSize.m in k-Wave). This will lead to efficient fft executions. We discretize the measurement time interval [0, T ] by t n = n∆t, n = 0, . . . , N t , ∆t = T /N t . The discrete pressure is denoted by p and the ξ-component of the discrete particle velocity by v ξ (ξ stands for one of the spatial components x, y, z of these vector fields). As the terms containing ∇ρ 0 (x) in (1)-(3) cancel out, they are omitted [82]. The scalar density ρ is split into three parts, ρ ξ . This is nonphysical but allows anisotropic absorption to be included within the PML. The k-space derivative and the k-space operator κ are defined as where k ξ ∈ R N is the discrete wavevector in ξ direction and all multiplications between N -dimensional vectors are understood as componentwise. c ref is a reference sound speed, chosen to ensure stability. For consistency during the FWI inversion, we fix c ref = c max 0 . Grid-staggering is incorporated into the calculation of the gradients. This means that a spatial translation of ±∆ξ/2 is introduced as Staggering in time will be included by interleaving the gradient and updates steps. Multiplication operators Λ ξ and Λ s ξ implement the PML on the normal and staggered grid, respectively. The measurement operator M , i.e., the receiving US transducers, is modeled by a linear mapping between spatial grid and measurement channels and is also denoted by M . This assumes that it acts instantaneously, i.e., in iteration n, we approximate f (t n ) = M p (otherwise, the pressure fields at the sensor locations need to be buffered). In the same way, a source operator S will embed the discrete pressure source s(t n ) into the spatial grid. After this, Ss(t n ) is divided by c 0 pointwise and scaled by 2/(3∆x) to convert it into a mass source, cf. Section 2.4 in the k-Wave manual. To implement the TR-based gradient (11), we need to extract the pressure on the boundary of sample, ∂Γ, during the forward wave simulation. In our collocation scheme, we will simply extract the pressure in a layer of grid points with a defined thickness as examined in Section 4.3, which we will denote as p(∂Γ) for simplicity. The scheme to compute the TR-based gradient consists of three wave simulations, forward, adjoint, and time-reversal of which the latter two are interleaved. For simplicity, we only consider u = c 0 here (as in the numerical studies) and any scaling factors like ∆t have been omitted if they cancel out in the end. Algorithm 1 describes the whole iteration.
Step 13 corrects for the fact that the first order scheme effectively uses the time derivative of the mass source term provided, not the term itself, cf. (6). Code will be released as part of a more general Matlab toolbox for ultrasonic and photoacoustic image reconstruction.

C. Multigrid for k-Space Pseudospectral Schemes
In the Fourier collocation scheme described above, the discrete, distributed field parameters (p, v, ρ) implicitly represent band-limited interpolants of the corresponding infinite dimensional variables (cf. Section 2.8. in the k-Wave manual). As such, Fourier/trigonometric interpolation is the natural way to transfer variables from one spatial grid into another. For regular spatial grids, this can be implemented efficiently using FFTs (cf. function interpftn.m in k-Wave), and can be combined with smoothing in Fourier space. To implement the coarse-to-fine initialization strategy described in Section 3.3 and examined in Section 4.7, we first need to down-sample the temporal dimension of data f δ and our model of the corresponding (mass) source m to a coarser temporal grid. This is done via a combination of Fourier interpolation and low-pass filtering with a Kaiser window (cf. function filterTimeSeries.m in k-Wave) to remove frequencies not supported by the numerical scheme. The data additionally needs to be scaled by η d−1 . Figure 12 illustrates the results. After the FWI solution u c is computed on the coarse grid, we need to up-sample it to initialize the FWI on the next, finer grid. For this, we use a combination of Fourier interpolation and smoothing with a Blackman window (cf. Section 2.8. in the k-Wave manual): are the dimensions of the fine spatial grid.

D. L-BFGS with Stochastic Gradient Estimates
We want to minimize an energy J (u) as in (8) while only having access to an unbiased estimator g(u, θ, n) of (J (u), ∇J (u)). Here, θ ∈ [0, 1] determines the approximation strength of g where θ = 0 corresponds to no approximation, e.g., g(u, 0, ·) = J (u), and θ = 1 to maximal approximation, e.g., by using a data sub-set S in g S of size one. In our implementation, θ can be a function of the iteration count i. The last input n ∈ N in g(u, θ, n) controls the the stochastic approximation performed by g. For instance, it could seed the random generator used to select S in g S . Important for the algorithm is only that it leads to the same type of approximation for different u, e.g., for a given θ, n, g S (·, θ, n) always uses the same subset of data S. Algorithm 2 describes the stochastic L-BFGS (SLBFGS) scheme used in this work, which is based on the one presented in [27]. See Section 6.2. in [14] and references therein for a detailed discussion of stochastic quasi-Newton schemes. The key idea is that for computing the vectors used to construct the low-rank approximation of the inverse Hessian, H k , only iterate pairs evaluated with the same stochastic approximation should be used. This increases the number of gradient computations to at least two per iteration (more if line-search is used) but the second gradient evaluation can be utilized to compute a second update. In [27], no bound constraints u min ≤ u ≤ u max were considered. Here, we include them but assume that they are not active at the vicinity of the solution. This means that they do not stabilize the image reconstruction problem. Rather, they merely safeguard against running the numerical wave simulation scheme with values of the parameter fields u that would lead to instability. For this reason, we enforce them by a simple projection of the updated variable u k at the end of each iteration (step 23). We use progressive iterate averaging with w(i) = i 3 , but only start averaging after the estimated energy increases for the first time (step 31). We use history size m h = 64 throughout the paper. Code will be released as part of a more general Matlab toolbox for deterministic and stochastic reconstruction.

E. Delayed Source Encoding for Time-Invariant Systems
Here, we examine the conventional and delayed source encoding described in Section 3.2 in more detail. It is crucial that we assumed that the acoustic properties are known everywhere outside the sample Γ (e.g., supp(u) ⊂ Γ), and that p(x, t) is not practically measurable for t > T anymore. Letp andq * be the forward and adjoint wave fields corresponding to the delayed encoded sourcesŝ and measurementsf δ , F u , G u ← g(u, θ(i), i) evaluate estimator 10: n evl ← n evl + 1 update function evaluation count 11: if do_linesearch then 13: . With this, we get (cf. Section 2.3 and (10)): With E [w] = 0, Cov[w] = I, it is easy to see now that E ∇ u D f (u),f δ = ∇J (u). If we choose the Rademacher distribution (w i = ±1 with equal probability), we have that w 2 i = 1 and is the cross talk between sources i and j. To examine the impact of delays (τ > 0) vs. conventional source encoding (τ = 0), we assumed = d j − d i > 0 without loss of generality and shift the time in the integral by d i τ to get For τ ≥ T /d, the delay between both sources is bigger than T , and from our assumptions, it follows that X i,j (τ ) = 0. To obtain generic estimates for τ < T /d, one needs to estimate how fast the field power decays inside Γ. For such decay estimates, see Section "Local Energy Decay Estimates" in [57] and references therein. In 3D, there is a T d such that P i (t) < ae −bt with a, b > 0 for all t > T d . In Figure 13, we plot P i (t) for 8 different sources in the scenario used in the numerical studies to show that practically, the decay is quite rapid, which is also confirmed by the results in Section 4.4, cf. Figure 6.

F. Noise Sensitivity
In this section, we demonstrate the impact of adding measurement noise to the simulated data. The noise characteristics of UST systems can differ quite a lot from conventional linear array US systems, e.g., both the US transducers and the data acquisition (DAQ) system of the PAMMOTH scanner were custom designed and a detailed noise characterization will be included in forthcoming work. Here, we will assume that an additive Gaussian noise model is adequate to summarize the different noise contributions and further that a decorrelation transform will render its covariance matrix sufficiently close to a scaled identity: In the following, we will use values of σ determined by the signal-to-noise (SNR) level defined in terms of the root mean square (RMS) of the clean signal f as σ = RM S(f ) 10 −SN R/20 . Figure 14a shows an example of a clean and noisy time trace for SNR 5. Without measurement noise (ε = 0), all noise-like features in the reconstructed images were a result of using stochastic gradient estimators in the optimization schemes to solve the FWI problem (8), i.e., a purely numerical phenomena. This "gradient-noise" vanishes as the iteration progresses (cf. Figure 3) or if more accurate gradient estimators are used (cf. Figure 11). From the way that ∇J (u) is computed as presented in Section 2.3 is is clear that any measurement noise ε > 0 will be back-propagated into the image alongside with the data residual. As the iteration progresses, the norm of this type of image noise will grow while its spectral content changes. This phenomena arises from the fact that in principle, we use a Landweber-type iteration to solve an ill-conditioned inverse problem (7). See [90,15] for a general introduction into this topic. Here, we illustrate these two opposing effects by first using exactly the same settings as in Section 5 on a single GPU for different SNR values. The first column of Figures 15, 16 show that while SNR 30 is visually indistinguishable from clean data result ( Figure 11a) the image noise increases towards SNR 5. Then we change the number of gradient evaluations on finest spatial level from 32 to 64. The second column of Figures 15,16 and the first two plots in Figure 14b show the results and confirm that for the highest noise level (SNR = 5), doing more iterations now indeed leads to a worse result. To prevent this, we will need to add a regularization functional to the data fidelty ns i D i (u) in (8). For simplicity, we use a smoothed version of the popular Total Variation (TV) functional as described in [90] here: where we set β to 1 m/s and fix α > 0 to an illustrative value that ensures observable impact of the regularization (for optimal results, one would need to choose it in dependence of σ). The third and forth columns of Figures 15, 16 and the second two plots in Figure 14b show that now the errors do not increase with increasing iteration count. Note however that it is not straight forward to predict the overall impact of adding regularization to FWI and in particular the question which type of functionals are advantageous for breast UST needs further research.