Choose Your Path Wisely: Gradient Descent in a Bregman Distance Framework

. We propose an extension of a special form of gradient descent—in the literature known as linearized Bregman iteration—to a larger class of nonconvex functions. We replace the classical (squared) two norm metric in the gradient descent setting with a generalized Bregman distance, based on a proper, convex, and lower semicontinuous function. The algorithm’s global convergence is proven for functions that satisfy the Kurdyka–(cid:32)Lojasiewicz property. Examples illustrate that features of diﬀerent scale are being introduced throughout the iteration, transitioning from coarse to ﬁne. This coarse-to-ﬁne approach with respect to scale allows us to recover solutions of nonconvex optimization problems that are superior to those obtained with conventional gradient descent, or even projected and proximal gradient descent. The eﬀectiveness of the linearized Bregman iteration in combination with early stopping is illustrated for the applications of parallel magnetic resonance imaging, blind deconvolution, as well as image classiﬁcation with neural networks.

In this paper, we follow a different approach of incorporating nonsmoothness into firstorder methods for nonconvex problems.We present a direct generalization of gradient descent, first introduced in [10], where the usual squared two-norm metric that penalizes the gap of two subsequent iterates is being replaced by a potentially nonsmooth distance term.This distance term is given in form of a generalized Bregman distance [20,22,66], where the underlying function is proper, lower semicontinuous, and convex, but not necessarily smooth.If the underlying function is a Legendre function (see [73, section 26] and [7]), the proposed generalization basically coincides with the recently proposed nonconvex extension of the Bregman proximal gradient method [17].In the more general case, the proposed method is a generalization of the so-called linearized Bregman iteration [33,83,25,24] to nonconvex data fidelities.
Motivated by inverse scale space methods (cf.[21,22,66]), the use of nonsmooth Bregman distances for the penalization of the iterates gap allows us to control the scale of features present in the individual iterates.Replacing the squared two-norm, for instance, with a squared two-norm plus the Bregman distance w.r.t. a one-norm leads to very sparse initial iterates, with iterates becoming more dense throughout the course of the iteration.This control of scale, i.e., the slow evolution from iterates with coarse structures to iterates with fine structures, can help to overcome unwanted minima of a nonconvex objective, as we are going to demonstrate with an example in section 2. This is in stark contrast to many of the nonsmooth, nonconvex first-order approaches mentioned above, where the methods are often initialized with random inputs that become more regular throughout the iteration.
Our main contributions of this paper are the generalization of the linearized Bregman iteration to nonconvex functions, a detailed convergence analysis of the proposed method, as well as the presentation of numerical results that demonstrate that the use of coarse-to-fine scale space approaches in the context of nonconvex optimization can lead to superior solutions.
The outline of the paper is as follows.Based on the nonconvex problem of blind deconvolution, we first give a motivation in section 2 of why a coarse-to-fine approach in terms of scale can indeed lead to superior solutions of nonconvex optimization problems.Subsequently, we define the extension of the linearized Bregman iteration for nonconvex functions in section 3.Then, motivated by the informal convergence recipe of Bolte, Sabach, and Teboulle [16, section 3.2] we show a global convergence result in section 4, which concludes the theoretical part.We conclude with the modeling of the applications of parallel magnetic resonance imaging (MRI), blind deconvolution, and image classification in section 5, followed by corresponding numerical results in section 6 as well as conclusions and outlook in section 7.

Motivation.
We want to motivate the use of the linearized Bregman iteration for nonconvex optimization problems with the example of blind deconvolution.In blind (image) deconvolution the goal is to recover an unknown image u from a blurred and usually noisy image f .Assuming that the degradation is the same for each pixel, the problem of blind deconvolution can be modeled as the minimization of the energy Downloaded 08/04/21 to 193.60.238.99.Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/termswith respect to the arguments u ∈ R n and h ∈ R r .Here * denotes a discrete convolution operator, and χ C is the characteristic function defined over the simplex constraint set Even with data f in the range of the nonlinear convolution operator, i.e., f = û * ĥ for some û ∈ R n with ĥ ∈ C, it is usually still fairly challenging to recover û and ĥ as solutions of (2.1).A possible reason for this could be that (2.1) is an invex function on R n × C, where every stationary point is already a global minimum.If we simply try to recover û and ĥ via projected gradient descent, we usually require an initial point in the neighborhood of (û, ĥ) in order to converge to that point.We want to illustrate this with a concrete example.Assume we are given an image û and a convolution kernel ĥ as depicted in Figure 2.1, and f = û * ĥ is as shown in Figure 2.1(b).Minimizing (2.1) via projected gradient descent leads to the following procedure: where proj C denotes the projection onto the convex set C. If we initialize with u 0 = (0, . . ., 0) T and h 0 = (1, . . ., 1) T /r, set τ 0 = 1, update τ k via backtracking to ensure a monotonic decrease of the energy E 1 , and iterate (2.2) for 3500 iterations, we obtain the reconstructions visualized in Figure 2.1(c).Even without any noise present in the data f , the algorithm converges to a solution very different from û and ĥ.This is not necessarily surprising as we do not impose any regularity on the image.We can try to overcome this issue by modifying (2.1) as follows: Here TV denotes the discretized total variation, i.e., TV(u) := |∇u| 1 , where ∇ : R n → R 2n is a (forward) finite difference discretization of the gradient operator, | • | the Euclidean vector norm, and • 1 the one-norm, and α is a positive scalar.The minimization of (2.3) can easily be carried out by the proximal gradient descent method, also known as forward-backward splitting [54], which is a minor modification of the projected     gradient method [41,42,13] to more general proximal mappings.In the context of minimizing (2.3), the proximal gradient method reads as where (I +α∂TV) −1 denotes the proximal mapping [58,59] with respect to the total variation, i.e., (I + α∂TV) −1 (z) := arg min It is straightforward to solve (2.5) for a given argument with numerical methods such as the (accelerated) primal-dual hybrid gradient method (cf.[84,67,37,28,29]) up to sufficient numerical accuracy.If we then evaluate 3000 iterations of (2.4) for α ∈ {10 −3 , 10 −4 } with the same initial values that we used for the projected gradient method, we obtain the results visualized in Figure 2.1.We observe that for the larger choice of α = 10 −3 we obtain a better reconstruction of the convolution kernel, but at the cost of a reconstructed image that is very cartoon-like.Reducing the parameter α to α = 10 −4 reduces the impact of the total Downloaded 08/04/21 to 193.60.238.99.Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms Copyright © by SIAM.Unauthorized reproduction of this article is prohibited.variation regularization; however, the reconstructed image then remains fairly blurry and the reconstructed convolution kernel is closer to a Dirac delta.
The reason for this is that the total variation-based model (2.3) is basically not suitable for deconvolution tasks.Blurred images generally have a smaller total variation compared to their sharp counterparts, hence it is easier to minimize the energy in (2.3) by recovering a kernel close to a Dirac delta and a smoothed version of the blurry image in order to reduce the total variation.
We therefore want to use an alternative approach that is different to the two approaches presented above.We do observe from the proximal gradient example that a larger regularization parameter seems to work better for a more accurate reconstruction of the convolution kernel (at the cost of a rather cartoon-like image).The explanation for this is that image features at a relatively coarse scale have to be adjusted to minimize the data fit, forcing the convolution kernel to correct for this.It therefore seems reasonable to find a minimizer of (2.1) with a scale-space approach, changing from coarse to fine scales over the course of the iteration.Specifically, we propose to use a variant of the linearized Bregman iteration adopted to minimizing nonconvex problems such as the minimization of the function E 1 as defined in (2.1).For the choice of E 1 in (2.1), this method reads as Here q k ∈ ∂TV(u k ) denotes a subgradient of TV at u k , α ≥ 0 is a scalar, and D q k TV (u k+1 , u k ) is the generalized Bregman distance [20] with respect to the total variation, i.e., Note that (2.6) reduces to the projected gradient method (2.2) for the choice α = 0. Replacing the total variation seminorm in (2.4) with its Bregman distance yields an iterative scale-space method that changes the influence of the total variation regularization throughout the course of the iteration.With a larger parameter α, the initial iterates have a very low total variation and contain only coarse features.Throughout the iteration, features of finer and finer scale are introduced.We have visualized several iterates of (2.6) for the choice α = 0.05 in Figure 2.2 to demonstrate this phenomenon.
We observe that this modification of projected gradient descent enables us to converge to minimizers of E 1 as defined in (2.1) that are fairly close to the original choices of û and ĥ.Hence, the choice of Bregman distance strongly affects the outcome of the iteration procedure and can be used to guide the iterates toward more desirable outcomes.
Obviously real data is never in the range of the forward model, and in that case we do not want to converge to a minimizer of E 1 .However, we can still apply the linearized Bregman iteration in combination with early stopping in order to produce superior results compared to projected or proximal gradient descent, which we will further demonstrate in sections 5 and 6.Proposed approach for blind deconvolution.Several iterates of the linearized Bregman iteration (2.6) for the choice α = 0.05.The strong initial effect of the total variation regularization enables the algorithm to converge to a solution close to û and ĥ.
Prior to this, we provide a comprehensive convergence analysis of the linearized Bregman iteration in sections 3 and 4.
3. Linearized Bregman iteration for nonconvex problems.We are interested in the minimization of functions E ∈ S L , where S L is the class of L-smooth functions as defined in Definition A.8 in the appendix.We want to emphasize that the function E does not necessarily have to be convex.In order for the minimization of E to make sense, we have to introduce some additional assumptions for this function first.From now on we assume E ∈ Ψ L , with Ψ L being defined as We further recall the definition of the set of critical points of The requirements on E ensure that sequences {u k } k∈N are already bounded if the sequences {E(u k )} k∈N are bounded, that an infimum exists, and that the set of critical points is nonempty.
We want to minimize E iteratively in a way that allows us to follow solution paths of different regularity.This regularity will be induced by an additional function J ∈ Γ 0 , where Γ 0 is defined in the appendix.Precisely, we approach the minimization of E via the linearized Bregman iteration for k ∈ N, a sequence of positive parameters {τ k } k∈N , and initial values u 0 and p 0 with p 0 ∈ ∂J(u 0 ).Here ∂J denotes the subdifferential; we refer to the appendix for its definition.Note that (3.2b) is simply the optimality condition of (3.2a).If J is differentiable, ∂J is single-valued and we do not have to compute (3.2b) as we do not need to pick a specific element from the set.However, if ∂J is multivalued, (3.2) guarantees p k+1 ∈ ∂J(u k+1 ) for all k ∈ N.This general form of linearized Bregman iteration for the minimization of nonconvex functions is summed up in Algorithm 3.1.
2) (and therefore also Algorithm 3.1) reduces to classical gradient descent.Hence, the linearized Bregman iteration is indeed a generalization of gradient descent.
Based on what has become known as the Bregman iteration [27,76,36,46,65], the linearized Bregman iteration was initially proposed in [33] for the computation of sparse solutions of underdetermined linear systems of equations.It has been extensively studied in this context (cf.[83,25,24]) and also in the context of the minimization of more general convex functions (see [82]).It is also closely linked to (linearized variants of) the alternating direction method of multipliers [39], as well as generalizations to nonquadratic Bregman distances [79].It has further been analyzed in the context of nonlinear inverse problems in [5].In [10], the linearized Bregman iteration has been studied in the context of minimizing general smooth but nonconvex functions.Algorithm 3.1 allows us to control the scale of the iterates, depending on the choice of J.Note that we can also reformulate (3.2a) as follows: In order to ensure that a solution of update (3.3) (respectively, (3.2a)) exists, we choose J such that J(u) + τ k u * , u is coercive for all u * ∈ R n .In particular, we choose J to be of the form for q k ∈ ∂R(u k ).Note that (3.4b) can be written as and hence for constant stepsize τ k = τ (3.4a) simplifies to Equations (3.4) are summarized in Algorithm 3.2.Note that both Algorithm 3.2 and equation (3.6) demonstrate that this specialized linearized Bregman iteration is indeed different to proximal gradient descent, for which one iterate reads Instead, from (3.4a) we observe that one computes a subgradient descent step in the direction of the subgradient of E − R, followed by an application of the proximal step with respect to R.
In the following we prove decrease properties and a global convergence result for Algorithm 3.2.

Algorithm 3.2. Specialized linearized Bregman iteration for minimizing E
Initialize {τ k } k∈N , u 0 and q 0 ∈ ∂R(u 0 ) A global convergence result for Algorithm 3.2.The convergence analysis is inspired by the global convergence recipe of [16].It is an extension to a class of nonsmooth surrogate functions for which a tailored convergence analysis is presented that utilizes the convexity of R. We begin our analysis of Algorithm 3.2 by showing a sufficient decrease property of the surrogate function and a subgradient bound by the (primal) iterates gap.In order to do so, we first define the following surrogate function for E.
Here R * denotes the convex conjugate of R as defined in Definition A.
for any z ∈ ∂R * (y), which implies F (x, y) ≥ E(x) for all x, y ∈ R n .Before we continue, we want to introduce the concise notation s k := (u k , q k−1 ) for all k ∈ N, such that F (s k ) = F (u k , q k−1 ).With the following lemma we prove a sufficient decrease property of the surrogate energy (4.1) for subsequent iterates.
Lemma 4.2 (sufficient decrease property).Assume E ∈ Ψ L and R ∈ Γ 0 .Further, suppose that the stepsize τ k satisfies the condition for some ρ 1 > 0 and all k ∈ N. Then the iterates of Algorithm 3.2 satisfy the descent estimate for s k := (u k , q k−1 ) and F as defined in (4.1).In addition, we observe Proof.First of all, we compute as the optimality condition of (3.4a), which is also the rearranged update formula (3.4b) as mentioned earlier (for q k+1 ∈ ∂R(u k+1 )).Taking the inner product with u k+1 − u k therefore yields Due to the Lipschitz-continuity of the gradient of E we can use (A.4) from the appendix and further estimate Together with (4.5) and the stepsize bound (4.2) we therefore obtain the estimate ) to both sides of the inequality then allows us to conclude Downloaded 08/04/21 to 193.60.238.99.Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/termsDue to the nonnegativity of D q k+1 R (u k , u k+1 ) and D q k−1 R (u k , u k−1 ), we have verified (4.3).Moreover, summing up (4.6) over k = 0, . . ., N yields and thus (4.4), due to ρ 1 > 0.
Remark 2. As Lemma 4.2 implies the monotonic decrease F (s k+1 ) ≤ F (s k ), we already know that the sequence {F (s k )} k∈N is bounded from above.It is also bounded from below, since It is worth mentioning that the name "sufficient decrease" can be misleading in the context of Algorithm 3.2 as it is not unusual for specific choices of R that the function value of E does not change for several iterations.
Our next result is a bound for the subgradients of the surrogate energy at the iterates computed with Algorithm 3.2.Note that the subdifferential of the surrogate objective reads as which can, for example, be deduced from [74].With q k+1 ∈ ∂R(u k+1 ), and the fact that q k ∈ ∂R(u k ) is equivalent to u k ∈ ∂R * (q k ) (Lemma A.4 in the appendix), we know that Subsequently, we want to show that the norm of this sequence of subgradients {r k } k∈N is bounded by the iterates gap of the primal variable.
Together with (3.4b) we therefore estimate where we have made use of the Lipschitz-continuity of the gradient of E.
Remark 3. We want to point out that the Lipschitz-continuity of ∇E is not necessary if R ≡ 0. In that case it is easy to see that we can obtain the estimate ∇E(u k ) ≤ 1 τ min u k+1 − u k instead of (4.8) (see also [10]), without the use of Lipschitz-continuity.For the sufficient decrease theorem, Lemma 4.2, it is already enough to choose τ k such that G := 1 2 • 2 −τ k E is convex for all arguments and all k ∈ N.This observation has already been made and exploited in [6,10,17].We also want to emphasize that the requirement of Lipschitz continuity can potentially be relaxed if backtracking strategies are incorporated into Algorithm 3.2.
To conclude our convergence analysis we prove global convergence of Algorithm 3.2 with the help of the Kurdyka-Lojasiewicz (KL) property as defined in the appendix in Definition A.11.In order to apply the KL property, we have to verify some properties of the set of limit points.Let {s k } k∈N = {(u k , q k−1 )} k∈N be a sequence generated by Algorithm 3.2 from starting points u 0 and q 0 with q 0 ∈ ∂R(u 0 ).The set of limit points is defined as ω(s 0 ) := s = (u, q) ∈ R n × R n there exists an increasing sequence of integers {k j } j∈N such that lim j→∞ u k j = u and lim j→∞ q k j = q .
Before we continue, we want to emphasize that the current assumptions on E and R are not sufficient in order to guarantee convergence of the dual variable, which we want to demonstrate with a simple counterexample.
It is obvious that E ∈ Ψ 1 and that the only critical point of E is û = −1.However, Algorithm 3.2 can never converge to that point but will converge to u = 0 due to the choice of R.This can be seen, for instance, for the choices u 0 > 0, q 0 = 0, and τ k = 1.Then the subsequent iterates are u k = 0 and q k = u 0 − k, thus u k → 0 and q k → −∞.
Downloaded 08/04/21 to 193.60.238.99.Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms For convex, quadratic fidelity terms (such as E in the example above) it is sufficient to satisfy a source condition of the form ∂R(û) = ∅ (which in Remark 4 is clearly violated) in order to guarantee boundedness of the subgradients; see, for instance, [38].For general, nonconvex terms E it is not straightforward to adapt the concept of source conditions, which is why we are going to assume local boundedness of the subgradients instead.
Definition 4.4 (locally bounded subgradients).We say that R has locally bounded subgradients if for every compact set U ⊂ R n there exists a constant C ∈ (0, ∞) such that for all v ∈ U and all q ∈ ∂R(v) we have q ≤ C.
Boundedness is not a very restrictive requirement as it is, for instance, satisfied for the large class of Lipschitz-continuous functions.Proof.From the convexity of R we observe for any U ⊂ R n and any h, v ∈ U with v + h ∈ U and q ∈ ∂R(v).Taking the supremum over h with h ≤ 1 shows q ≤ L, which proves the assertion.
Remark 5. Note that every continuously differentiable function is already locally Lipschitzcontinuous and therefore has locally bounded gradients according to Proposition 4.5.
Before we show global convergence of Algorithm 3.2 to a critical point of E, we need to verify that the surrogate function converges to E on ω(s 0 ), that ω(s 0 ) is a nonempty, compact and connected set, and that its primal limiting points form a subset of the set of critical points of E. The following lemma guarantees that for a sequence converging to a limit point we also know that the surrogate objective converges to the objective evaluated at this limit point.Lemma 4.6.Suppose E ∈ Ψ L , R ∈ Γ 0 , and let s ∈ ω(s 0 ).Then we already know Proof.Since s is a limit point of {s k } k∈N we know that there exists a subsequence {s k j } j∈N with lim j→∞ s k j = s.Hence, we immediately obtain due to the continuity of E and lim j→∞ D q k j −1 R (u k j , u k j −1 ) = 0 as a result of Lemma 4.2.Since {F (s k )} k∈N is also monotonically decreasing and bounded from below according to Remark 2, we can further conclude (4.9) as a consequence of the monotone convergence theorem.
In addition to Lemma 4.6, the following lemma states that ω(s 0 ) is a nonempty, compact and connected set and that the objective F is constant on that set.Lemma 4.7 ([16, Lemma 5]).Suppose E ∈ Ψ L and that R ∈ Γ 0 has locally bounded subgradients.Then the set ω(s 0 ) is a nonempty, compact, and connected set, the surrogate objective F is constant on ω(s 0 ), and we have lim k→∞ dist(s k , ω(s 0 )) = 0. Downloaded 08/04/21 to 193.60.238.99.Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms We can further verify that the set of primal limiting points is a subset of the set of critical points of the energy E. Lemma 4.8.Suppose E ∈ Ψ L , and that R ∈ Γ 0 has locally bounded subgradients.Then we have u ∈ crit(E) for every s = (u, q) ∈ ω(s 0 ).
Proof.We prove this assertion by contradiction to the boundedness of the subgradients.Let s := (u, q) ∈ ω(s 0 ), which means lim k→∞ u k = u.Assume that ∇E(u) = 0 and let c := ∇E(u) > 0. It follows from the subgradient update (3.5) and the reverse triangle inequality a + i a i ≥ a − i a i that As u k → u, there exists K ∈ N such that for all n ≥ K the bounds u n − u ≤ cτ min /8 and ∇E(u n ) − ∇E(u) ≤ c/4 hold.Thus, we have for all n ≥ K that and therefore for all k ∈ N, with a constant independent of k.Combining these two estimates yields Hence, we observe lim k→∞ q k = ∞, which is a contradiction to the boundedness of {q k }.Thus, ∇E(u) = 0, which means u ∈ crit(E).Now we have all the necessary ingredients to show the following global convergence result for Algorithm 3.2.Theorem 4.9 (finite length property).Suppose that F is a KL function in the sense of Definition A.11. Further, assume R ∈ Γ 0 with locally bounded subgradients.Let {s k } k∈N = {(u k , q k−1 )} k∈N be a sequence generated by Algorithm 3.2.Then the sequence {u k } k∈N has finite length, i.e., Proof.We follow the steps of the proof of [16, Theorem 1] but with nontrivial modifications.
The sequence {u k } k∈N is bounded, which follows from the assumption E ∈ Ψ L and the monotonic decrease.Thus, we know that there exists a convergent subsequence {u k j } j∈N and u ∈ R n with lim As a consequence of Lemma 4.6 we further know that lim k→∞ F (s k ) = F (s) = E(u).If there exists an index l ∈ N with F (s l ) = E(u) the results follow trivially.If there does not exist such an index, we observe that for any η > 0 there exists an index k 1 such that for all k > k 1 .In addition, for any ε > 0 there exists an index k 2 with dist(s k , ω(s 0 )) < ε for all k > k 2 , due to Lemma 4.7.Hence, if we choose l := max(k 1 , k 2 ), we know that u k is in the set (A.5) for all k > l according to Lemma A.12 in the appendix.
By Lemma 4.7, ω(u 0 ) satisfies all the assumptions of Lemma A.12 and we have for all k > l.This inequality makes sense due to F (s k ) > E(u) for all k.
From the concavity of ϕ we know that x − y holds for all x, y ∈ [0, η), x > y, which we will use for the specific choices of x = F (w k ) − E(u) and y = F (s k+1 ) − E(u).Combining the latter with Lemma 4.2 and abbreviating Inserting (4.12) and the subgradient bound (4.8) into the KL inequality (4.11) leads to Taking the square root, multiplying by 2, and using Young's inequality of the form 2 and hence, we obtain the finite length property by taking the limit N → ∞.
Corollary 4.10 (convergence).Under the same assumptions as Theorem 4.9, the sequence {u k } k∈N converges to a critical point of E.
Proof.As in the proof of [16, Theorem 1(ii)], the finite length property Theorem 4.9 implies N k=l u k+1 − u k → 0 for N → ∞.Thus, for any s ≥ r ≥ l we have This shows that {u k } k∈N is a Cauchy sequence and, thus, is convergent.According to Lemma 4.8 its limit is a critical point of E.

4.1.
Global convergence in the absence of locally bounded subgradients.In the previous section we have made the assumption that the subgradients of R have to be locally bounded in order to guarantee convergence of the primal iterates to a critical point of E. In Remark 4 we have seen an example for which the subgradients of R diverge, but the primal iterates still converge, just not to a critical point of E. This leaves us with two open questions: (1) could we prove convergence of the primal iterates without boundedness of the dual iterates and (2) would the limit (if it exists) be a critical point of some other energy?It might be possible to answer the first question by slightly modifying Definition A.11 and Lemma A.12 in the appendix, as well as Lemma 4.7 to accommodate the fact that the surrogate function is also constant on the set of limiting points that only depends on the primal variable (which we denote by ω(u 0 ) for convenience).A potential modification of (A.5) in Lemma 4.7 could, for instance, be where u ∈ ω(u 0 ).Note that this modification would not affect the finite length proof of Theorem 4.9 and therefore would still imply global convergence, but not necessarily to a critical point of E. Remark 4 leaves room for speculation whether an answer to the second question is that the primal iterates converge to a critical point of E + χ dom(R) , where χ dom(R) denotes the characteristic function over the effective domain of R. Proving this, however, is beyond the scope of this paper.

Limitations of the convergence analysis and possible remedies.
The convergence analysis presented in this paper relies on the fact that the function E satisfies E ∈ S L , which is often restrictive for practical applications.Even simple functions such as the blind deconvolution data fidelity term from section 2 are not globally L-smooth.Remedies are the Downloaded 08/04/21 to 193.60.238.99.Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/termsuse of an alternating version of Algorithm 3.2 in the spirit of [16] and to make use of local smoothness of the functions with fixed variables.For two variables u 1 and u 2 , such a scheme is of the form Here ∇ 1 and ∇ 2 refer to the partial gradients of E with respect to u 1 and u 2 , and are subgradients in the subdifferential of J 1 and J 2 , respectively.The analysis of such a scheme should be relatively straightforward but is beyond the scope of this work.
Another limitation in terms of convergence analysis that becomes obvious from the motivating example in section 2 is the use of characteristic functions.If we incorporate them in the function R, we run into the issues outlined in section 4.1.If we add them to the objective function E, we lose the continuity and differentiability.A remedy for the blind deconvolution example (and many similar examples) in section 2 is that for the convolution kernel the additional Bregman function R is simply zero, so that the algorithm merely has to perform a proximal point step in the direction of the convolution kernel.The convergence analysis in such a setting is straightforward, but we did not include it in order not to complicate notation.Alternatively, one could replace the characteristic function with its Moreau-Yosida envelope.
This concludes the theoretical analysis of Algorithm 3.2.In the following two sections we are going to discuss three applications, their mathematical modeling in the context of Algorithm 3.2, and their numerical results.

Applications.
We demonstrate the capabilities of the linearized Bregman iteration by using it to approximately minimize several nonconvex minimization problems.We say approximately, as we do not exactly minimize the corresponding objective functions but rather compute iteratively regularized solutions to the associated inverse problems via early stopping of the iteration.
5.1.Parallel magnetic resonance imaging.In (standard) MRI the goal is to recover the spin-proton density from subsampled Fourier measurements that were obtained with a single radio-frequency (RF) coil.In parallel MRI, multiple RF coils are used for taking measurements, thus allowing one to recover the spin-proton density from more measurements compared to the standard case.This, however, comes at the cost of having to model the sensitivities of the individual RF coils w.r.t. the measured material.We basically follow the mathematical modeling of [70,77] and describe the recovery of the spin-proton density and the RF coil sensitivities as the minimization of the following energy function: Downloaded 08/04/21 to 193.60.238.99.Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms Here F ∈ C n×n is the (discrete) Fourier transform, S ∈ {0, 1} m×n is a subsampling operator, K is the nonlinear operator K(u, b 1 , . . ., b s ) = (ub 1 , ub 2 , . . ., ub s ) T , u denotes the spin-proton density, b 1 , b 2 , . . ., b s denotes the s coil sensitivities, f 1 , . . ., f s denotes the corresponding subsampled k-space data and > 0 is a scalar parameter that ensures bounded level-sets of E. Since C has the same topology as R×R, we can formally treat all variables as variables in R 2n .Note that E as defined in (5.1) is not globally L-smooth, which is why we also assume that we choose parameters and initial values such that our sequence {u k } k∈N of primal variables generated by Algorithm 3.2 satisfies continuous in the sense of Definition A.7. Furthermore, we assume that the sequence {L k } k∈N is bounded from above, i.e., L k ≤ L for all k ∈ N, and consequently E ∈ Ψ L .It is not necessarily straightforward to prove existence of L a priori, but it is relatively easy to validate it a posteriori.Note that, alternatively, one could use an alternating version of Algorithm 3.2 as discussed in section 4.2.
The inverse problem of parallel MRI has been subject in numerous research publications [71,48,12].We follow a different methodology here and apply Algorithm 3.2 to approximately minimize (5.1) with the following configuration.We choose the function R to be of the form Here ∇ denotes a discrete finite forward difference approximation of the gradient, | • | is the Euclidean vector norm, C denotes the discrete two-dimensional cosine transform, {w l } l∈{1,...,n} is a set of weighting-coefficients, and α 0 , . . ., α s+1 are positive scaling parameters.Note that all functions are chosen to be semialgebraic, and semialgebraic functions and their additive compositions are KL functions (see [2,3,4]).Iterating Algorithm 3.2 for too long may lead to unstable minimizers of (5.1) in case the k-space data f 1 , . . ., f s are noisy, which is why we are going to apply Morozov's discrepancy principle [60] as a stopping criterion to stop Downloaded 08/04/21 to 193.60.238.99.Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms the iteration early (see also [65,40,56], and [75,5,45] in the context of nonlinear inverse problems), i.e., we stop the iteration as soon as is satisfied, for some η > 0. Usually η depends on the variance of the normal-distributed noise.
5.2.Blind deconvolution.Blind deconvolution is extensively discussed in the literature, e.g., [49,30,26] and the references therein, with several approaches for which the convergence proofs also rely on the KL inequality [15,72,32].We follow the same setting as in section 2 (with additional regularization as in (5.1) in order to guarantee bounded level-sets) and make the assumptions that the blur-free image u has low total variation and that the kernel h satisfies a simplex constraint, i.e., all entries are nonnegative and sum up to one.The assumption of low total variation can, for instance, be motivated by [31], but as as we have seen in section 2, minimizing E with some additional total variation regularization does often not lead to visually satisfactory results.We therefore apply Algorithm 3.2 with R : R n × R r → R defined as for α ≥ 0. All functions are semialgebraic, and we make the same local smoothness assumption as in section 5.1.In case of noisy data, we will proceed as in section 5.1 and stop the iteration via the discrepancy principle.

Classification.
The last application that we want to discuss is the classification of images.Given a set D ∈ R s×r of r training images (with s pixel each) in column vector form, we want to train a neural network to classify those images.We do so by learning the parameters (A 1 , . . ., A l ) of the l-layer neural network ρ(x) := ρ 1 (A 1 ρ 2 (A 2 . . .ρ l (A l x)) . ..) in a supervised fashion.Here the parameters A j ∈ R m j ×n j are matrices of different size, and the functions {ρ j } l j=1 are so-called activation functions of the neural net.Typical choices for activation functions are max-and min-functions, also known as rectifiers.However, due to their nondifferentiability it is common to approximate them with either the pointwise smoothmax-function, i.e., Note that if each function ρ j (A j x) is chosen to be semialgebraic, the composition ρ is also semialgebraic; see [1,Proposition 2.2.10].If we choose ρ j (y) := min(1, max(0, y)) for all j ∈ {1, . . ., l}, for instance, we can then show that also ρ is semialgebraic.
Defining a nonlinear operator K(A 1 , A 2 , . . ., A l ) := ρ 1 (A 1 ρ 2 (A 2 . . .ρ l (A l D)) . ..) for a given matrix D and a given label matrix Y ∈ R m 1 ×r , we aim to minimize Fro , (5.3) where D : R m 1 ×r × R m 1 ×r → R denotes a function that measures the distance between its arguments in some sense.Our choice for D is simply the squared Frobenius norm Fro but other choices are possible.As mentioned earlier, the whole objective E can be made a KL function if, for instance, D and ρ are chosen to be semialgebraic, as their composition will also be semialgebraic.
As in the previous sections, we aim to minimize (5.3) with Algorithm 3.2 and make the same local smoothness assumption as before.This time we choose R(A 1 , . . ., A l ) = l j=1 α j A j * .Here {α j } l j is a set of positive scaling parameters, and X * := rank(X) i=1 σ i is the one norm of the singular values {σ i } rank(X) i=1 of the argument X, also known as the nuclear norm.The rationale behind this choice for R is that we can create iterates where the ranks of the individual matrices are steadily increasing.This way we control the number of effective parameters and do not fit all parameters right from the start.
6. Numerical results.We demonstrate the particular properties and idiosyncrasies of Algorithm 3.2 by computing several numerical solutions to the problems described in section 5.All results have been computed with MATLAB R2017b.The code for the following examples is available at https://doi.org/10.17863/CAM.16931 and can be used under the Creative Commons Attribution (CC BY) license once the article is accepted for publication.
Notably, all regularization parameters that ensure boundedness of the level-sets are set to the smallest possible value ( = machine accuracy) in practice.Since we do not use explicit Lipschitz constants, we employ a naïve backtracking strategy for the variable stepsize {τ k } k∈N .We start with an initial stepsize τ 0 > 0 and check after each iteration whether E(u k+1 ) ≤ E(u k ) + ε is satisfied.Here, ε > 0 is a small constant that accounts for numerical rounding errors that may cause E(u k+1 ) > E(u k ) when E(u k+1 ) ≈ E(u k ).If the decrease is satisfied, we set τ k+1 = τ k ; otherwise we set τ k+1 = (3τ k )/4 and backtrack again until we get a decrease.We want to emphasise that more sophisticated backtracking approaches can be used; we found, however, that the naïve strategy that we use already works well for the computational results shown in the following subsections.
6.1.Parallel MRI.We compute parallel MRI reconstructions from real k-space data.We use data from a T2-weighted TSE scan of a transaxial slice of a brain acquired with a fourchannel head-coil in [47].A reconstruction from fully sampled data is taken as a ground truth.The spiral subsampling is simulated by pointwise multiplication of the k-space data with the spiral pattern visualized in Figure 5.1(d).We initialize with u 0 = 2 × 1 65536×1 and b 0 j = 1 65536×1 for j ∈ {1, . . ., 4} and compute a q 0 ∈ ∂R(u 0 ).6.2.Blind deconvolution.To simulate blurring of a gray-scale image f orig ∈ R 424×640 we subtract its mean, normalize it, and subsequently blur f orig with a motion-blur filter h ∈ R 9×31 .The filter was obtained with the MATLAB command fspecial('motion', 30,15), and we assume periodic boundary conditions for the blurring process.Subsequently we add normally distributed noise with mean zero and standard deviation σ = 10 −4 to obtain a blurry and noisy image f with ground truth f orig .Both f orig and f , as well as h, are visualized in Figure 6.1.
We use f as our input image for Algorithm 3.2.We initialize Algorithm 3.2 with u 0 = 0 and q 0 = 0. We choose h 0 = 1/(r 2 ) × 1 r×r for r = 35 to ensure that h 0 satisfies the simplex constraint.We set τ 0 = 2 and pick α ∈ {10 −1 , 10 −3 }.We then iterate Algorithm 3.2 until the discrepancy principle is violated for η = (1.2σ 2 )/(2 √ 424 × 640).The inner total variation subproblem is solved with the primal-dual hybrid gradient method [84,67,37,28,29].The results are visualized in Figure 6.1.6.3.Classification.We test the proposed framework for the classification of images of handwritten digits.We use the well-known MNIST dataset [51] as the basis for our classification.Ten example images of each class are visualized in Figure 6.cross validation.We model our classifier as a two-level neural network as described in section 5.3.We choose the original rectifier activation functions for the networks' architecture, in order to ensure that the composition is semialgebraic and that the KL condition is satisfied.We overcome the nondifferentiability by setting the derivatives to zero at the non-differentiable points.This is consistent with the smooth-max approximation of the rectifier for β → ∞.We choose E to be the squared Frobenius norm and set the scaling parameters to α 1 = α 2 = 0.2.The stepsize τ 0 is initialized with τ 0 = 10 −3 .Subsequently, we run Algorithm 3.2 for 10000 iterations.The prediction results of the classifier and the rank of the trained matrices are visualized in Figure 6.

Figure 2 . 1 .
Figure 2.1.Standard approaches for blind deconvolution.Figure 2.1(a) shows the image û of Pixel the Gambian pouched rat, courtesy of Monique Boddington.Figure 2.1(b) shows a motion-blurred version f of that same image; the corresponding convolution kernel ĥ is depicted in the bottom left corner.Figure 2.1(c) visualizes the reconstruction of the image and the convolution kernel obtained with the projected gradient descent method (2.2).In Figure 2.1(d) we see the result of the gradient descent method (2.4) for α = 10 −3 , whereas Figure 2.1(e) shows the result of (2.4) for the choice α = 10 −4 .

Figure 2 .
Figure 2.1.Standard approaches for blind deconvolution.Figure 2.1(a) shows the image û of Pixel the Gambian pouched rat, courtesy of Monique Boddington.Figure 2.1(b) shows a motion-blurred version f of that same image; the corresponding convolution kernel ĥ is depicted in the bottom left corner.Figure 2.1(c) visualizes the reconstruction of the image and the convolution kernel obtained with the projected gradient descent method (2.2).In Figure 2.1(d) we see the result of the gradient descent method (2.4) for α = 10 −3 , whereas Figure 2.1(e) shows the result of (2.4) for the choice α = 10 −4 .
Figure 2.1.Standard approaches for blind deconvolution.Figure 2.1(a) shows the image û of Pixel the Gambian pouched rat, courtesy of Monique Boddington.Figure 2.1(b) shows a motion-blurred version f of that same image; the corresponding convolution kernel ĥ is depicted in the bottom left corner.Figure 2.1(c) visualizes the reconstruction of the image and the convolution kernel obtained with the projected gradient descent method (2.2).In Figure 2.1(d) we see the result of the gradient descent method (2.4) for α = 10 −3 , whereas Figure 2.1(e) shows the result of (2.4) for the choice α = 10 −4 .
Figure 2.1.Standard approaches for blind deconvolution.Figure 2.1(a) shows the image û of Pixel the Gambian pouched rat, courtesy of Monique Boddington.Figure 2.1(b) shows a motion-blurred version f of that same image; the corresponding convolution kernel ĥ is depicted in the bottom left corner.Figure 2.1(c) visualizes the reconstruction of the image and the convolution kernel obtained with the projected gradient descent method (2.2).In Figure 2.1(d) we see the result of the gradient descent method (2.4) for α = 10 −3 , whereas Figure 2.1(e) shows the result of (2.4) for the choice α = 10 −4 .

Figure 2 . 2 .
Figure 2.2.Proposed approach for blind deconvolution.Several iterates of the linearized Bregman iteration (2.6) for the choice α = 0.05.The strong initial effect of the total variation regularization enables the algorithm to converge to a solution close to û and ĥ.

Lemma 4 . 3 (
a subgradient lower bound for the iterates gap).Let the same assumptions hold true as in Lemma 4.2 and τ k ≥ τ min := inf k τ k > 0. Then the iterates of Algorithm 3.2 satisfy

Proposition 4 . 5 .
Let R ∈ Γ 0 be a (globally) Lipschitz continuous function in the sense of Definition A.7 in the appendix.Then R has locally bounded subgradients.

Figure 5 . 1 .
Figure 5.1.Figure 5.1(a) shows a log-plot of the modulus of the fully sampled k-space data of the first coil taken from [48].Figure 5.1(b) shows the reconstruction of the spin proton density from the data visualized in Figure 5.1(a) via gradient descent, whereas Figure 5.1(c) shows the reconstruction of the spin proton density from the same data but via Algorithm 3.2.In Figure 5.1(d) we see roughly 25% of the k-space data visualized in Figure 5.1(a), sampled on a spiral on a cartesian grid [11].Figure 5.1(e) shows the reconstruction of the spin proton density from this subsampled k-space data with gradient descent, while Figure 5.1(f) shows the reconstruction of the spin proton density from the same data but with Algorithm 3.2.Figures 5.1(g) and 5.1(h) show the convergence plots in terms of energy decrease per iteration for the reconstructions that are obtained from the k-space data shown in Figure 5.1(a), respectively, in Figure 5.1(d).

Figure 5 .
Figure 5.1.Figure 5.1(a) shows a log-plot of the modulus of the fully sampled k-space data of the first coil taken from [48].Figure 5.1(b) shows the reconstruction of the spin proton density from the data visualized in Figure 5.1(a) via gradient descent, whereas Figure 5.1(c) shows the reconstruction of the spin proton density from the same data but via Algorithm 3.2.In Figure 5.1(d) we see roughly 25% of the k-space data visualized in Figure 5.1(a), sampled on a spiral on a cartesian grid [11].Figure 5.1(e) shows the reconstruction of the spin proton density from this subsampled k-space data with gradient descent, while Figure 5.1(f) shows the reconstruction of the spin proton density from the same data but with Algorithm 3.2.Figures 5.1(g) and 5.1(h) show the convergence plots in terms of energy decrease per iteration for the reconstructions that are obtained from the k-space data shown in Figure 5.1(a), respectively, in Figure 5.1(d).

Figure 5 .
Figure 5.1.Figure 5.1(a) shows a log-plot of the modulus of the fully sampled k-space data of the first coil taken from [48].Figure 5.1(b) shows the reconstruction of the spin proton density from the data visualized in Figure 5.1(a) via gradient descent, whereas Figure 5.1(c) shows the reconstruction of the spin proton density from the same data but via Algorithm 3.2.In Figure 5.1(d) we see roughly 25% of the k-space data visualized in Figure 5.1(a), sampled on a spiral on a cartesian grid [11].Figure 5.1(e) shows the reconstruction of the spin proton density from this subsampled k-space data with gradient descent, while Figure 5.1(f) shows the reconstruction of the spin proton density from the same data but with Algorithm 3.2.Figures 5.1(g) and 5.1(h) show the convergence plots in terms of energy decrease per iteration for the reconstructions that are obtained from the k-space data shown in Figure 5.1(a), respectively, in Figure 5.1(d).

Figure 5 .
Figure 5.1.Figure 5.1(a) shows a log-plot of the modulus of the fully sampled k-space data of the first coil taken from [48].Figure 5.1(b) shows the reconstruction of the spin proton density from the data visualized in Figure 5.1(a) via gradient descent, whereas Figure 5.1(c) shows the reconstruction of the spin proton density from the same data but via Algorithm 3.2.In Figure 5.1(d) we see roughly 25% of the k-space data visualized in Figure 5.1(a), sampled on a spiral on a cartesian grid [11].Figure 5.1(e) shows the reconstruction of the spin proton density from this subsampled k-space data with gradient descent, while Figure 5.1(f) shows the reconstruction of the spin proton density from the same data but with Algorithm 3.2.Figures 5.1(g) and 5.1(h) show the convergence plots in terms of energy decrease per iteration for the reconstructions that are obtained from the k-space data shown in Figure 5.1(a), respectively, in Figure 5.1(d).

Figure 5 . 2 .
Figure 5.2.Figures 5.2(a)-(d) show the reconstructions of the coil sensitivities from the fully sampled data.Figures 5.2(e)-(h) show the reconstructions of the same quantities from the subsampled data.
Figure 5.2.Figures 5.2(a)-(d) show the reconstructions of the coil sensitivities from the fully sampled data.Figures 5.2(e)-(h) show the reconstructions of the same quantities from the subsampled data.
Figure 5.2.Figures 5.2(a)-(d) show the reconstructions of the coil sensitivities from the fully sampled data.Figures 5.2(e)-(h) show the reconstructions of the same quantities from the subsampled data.
(a)-(d).In Figures 5.1(f) and 5.2(e)-(h) we show the results of the reconstructions from subsampled data using the subsampling scheme in Figure 5.1(d).

Figure 6 . 1 .
Figure 6.1.Figure 6.1(a) shows an image of Pixel the Gambian pouched rat.Figure 6.1(b) shows a motion-blurred version of that image, together with some added normal distributed noise.The corresponding convolution kernel is depicted in the bottom left corner.Figure 6.1(c) visualizes the reconstruction of the image and the convolution kernel with Algorithm 3.2 for the choice α = 10 −1 .Figure 6.1(d) shows the reconstructions of the same quantities for the choice α = 10 −3 .We clearly see that a larger choice of α results in a regular solution, whereas a smaller α will mimic traditional gradient descent with almost no additional regularity of the reconstruction.

Figure 6 .
Figure 6.1.Figure 6.1(a) shows an image of Pixel the Gambian pouched rat.Figure 6.1(b) shows a motion-blurred version of that image, together with some added normal distributed noise.The corresponding convolution kernel is depicted in the bottom left corner.Figure 6.1(c) visualizes the reconstruction of the image and the convolution kernel with Algorithm 3.2 for the choice α = 10 −1 .Figure 6.1(d) shows the reconstructions of the same quantities for the choice α = 10 −3 .We clearly see that a larger choice of α results in a regular solution, whereas a smaller α will mimic traditional gradient descent with almost no additional regularity of the reconstruction.

Figure 6 .
Figure 6.1.Figure 6.1(a) shows an image of Pixel the Gambian pouched rat.Figure 6.1(b) shows a motion-blurred version of that image, together with some added normal distributed noise.The corresponding convolution kernel is depicted in the bottom left corner.Figure 6.1(c) visualizes the reconstruction of the image and the convolution kernel with Algorithm 3.2 for the choice α = 10 −1 .Figure 6.1(d) shows the reconstructions of the same quantities for the choice α = 10 −3 .We clearly see that a larger choice of α results in a regular solution, whereas a smaller α will mimic traditional gradient descent with almost no additional regularity of the reconstruction.
Figure 6.1.Figure 6.1(a) shows an image of Pixel the Gambian pouched rat.Figure 6.1(b) shows a motion-blurred version of that image, together with some added normal distributed noise.The corresponding convolution kernel is depicted in the bottom left corner.Figure 6.1(c) visualizes the reconstruction of the image and the convolution kernel with Algorithm 3.2 for the choice α = 10 −1 .Figure 6.1(d) shows the reconstructions of the same quantities for the choice α = 10 −3 .We clearly see that a larger choice of α results in a regular solution, whereas a smaller α will mimic traditional gradient descent with almost no additional regularity of the reconstruction.