{"title": "Gaussian sampling by local perturbations", "book": "Advances in Neural Information Processing Systems", "page_first": 1858, "page_last": 1866, "abstract": "We present a technique for exact simulation of Gaussian Markov random fields (GMRFs), which can be interpreted as locally injecting noise to each Gaussian factor independently, followed by computing the mean/mode of the perturbed GMRF. Coupled with standard iterative techniques for the solution of symmetric positive definite systems, this yields a very efficient sampling algorithm with essentially linear complexity in terms of speed and memory requirements, well suited to extremely large scale probabilistic models. Apart from synthesizing data under a Gaussian model, the proposed technique directly leads to an efficient unbiased estimator of marginal variances. Beyond Gaussian models, the proposed algorithm is also very useful for handling highly non-Gaussian continuously-valued MRFs such as those arising in statistical image modeling or in the first layer of deep belief networks describing real-valued data, where the non-quadratic potentials coupling different sites can be represented as finite or infinite mixtures of Gaussians with the help of local or distributed latent mixture assignment variables. The Bayesian treatment of such models most naturally involves a block Gibbs sampler which alternately draws samples of the conditionally independent latent mixture assignments and the conditionally multivariate Gaussian continuous vector and we show that it can directly benefit from the proposed methods.", "full_text": "Gaussian sampling by local perturbations\n\nGeorge Papandreou\n\nDepartment of Statistics\n\nDepts. of Statistics, Computer Science & Psychology\n\nAlan L. Yuille\n\nUniversity of California, Los Angeles\n\nUniversity of California, Los Angeles\n\ngpapan@stat.ucla.edu\n\nyuille@stat.ucla.edu\n\nAbstract\n\nWe present a technique for exact simulation of Gaussian Markov random \ufb01elds\n(GMRFs), which can be interpreted as locally injecting noise to each Gaussian fac-\ntor independently, followed by computing the mean/mode of the perturbed GMRF.\nCoupled with standard iterative techniques for the solution of symmetric positive\nde\ufb01nite systems, this yields a very ef\ufb01cient sampling algorithm with essentially\nlinear complexity in terms of speed and memory requirements, well suited to ex-\ntremely large scale probabilistic models. Apart from synthesizing data under a\nGaussian model, the proposed technique directly leads to an ef\ufb01cient unbiased es-\ntimator of marginal variances. Beyond Gaussian models, the proposed algorithm\nis also very useful for handling highly non-Gaussian continuously-valued MRFs\nsuch as those arising in statistical image modeling or in the \ufb01rst layer of deep\nbelief networks describing real-valued data, where the non-quadratic potentials\ncoupling different sites can be represented as \ufb01nite or in\ufb01nite mixtures of Gaus-\nsians with the help of local or distributed latent mixture assignment variables. The\nBayesian treatment of such models most naturally involves a block Gibbs sampler\nwhich alternately draws samples of the conditionally independent latent mixture\nassignments and the conditionally multivariate Gaussian continuous vector and\nwe show that it can directly bene\ufb01t from the proposed methods.\n\n1\n\nIntroduction\n\nUsing Markov random \ufb01elds (MRFs) one can capture global statistical properties in large scale\nprobabilistic networks while only explicitly modeling the interactions of neighboring sites. First in-\ntroduced in statistical physics, MRFs and related models such as Boltzmann machines have proved\nparticularly successful in computer vision and machile learning tasks such as image segmentation,\nsignal recovery, texture modeling, classi\ufb01cation, and unsupervised learning [1, 3, 5]. Drawing ran-\ndom samples from MRFs and juxtaposing them with real data allows one to directly assess the model\nquality. Sampling of MRFs also plays an important role within algorithms for model parameter \ufb01t-\nting [7], signal estimation, and in image analysis for texture synthesis or inpainting [16, 19, 37]. The\nsimplest but typically very slow way to draw random samples from MRFs is through single-site\nGibbs sampling, a Markov chain Monte-Carlo (MCMC) algorithm in which one visits each node in\nthe network and stochastically updates its state given the states of its neighbors [5].\n\nGaussian Markov random \ufb01elds (GMRFs) are an important MRF class describing continuous vari-\nables linked by quadratic potentials [3,22,29,33] \u2013 see Sec. 2. They are very useful both for modeling\ninherently Gaussian data and as building blocks for constructing more complex models. In this pa-\nper we study a technique which allows drawing exact samples from a GMRF in a single shot by \ufb01rst\nperturbing it and then computing the least energy con\ufb01guration of the perturbed model. The pertur-\nbation involved amounts to independently injecting noise to each of the Gaussian factors/potentials\nin a fully distributed manner, as discussed in detail in Sec. 3. This reduction of sampling to quadratic\nenergy minimization allows us to employ as black-box GMRF simulator any existing algorithm for\nMAP computation which is effective for a particular Gaussian graphical model.\n\n1\n\n\fThe reliability of the most likely solution in a Gaussian model is characterized by the marginal vari-\nances. Marginal variances also arise in computations within non-linear sparse Bayesian learning and\ncompressed sensing models [11, 26, 32]. However, their computation can be very challenging and\na host of sophisticated techniques have been developed for this purpose, which often only apply to\nrestricted classes of models [12, 24, 25, 28]. Being able to ef\ufb01ciently sample from a GMRF makes\nit practical to employ the generic sample-based estimator for computing Gaussian variances, as dis-\ncussed in Sec. 4. This estimator, whose accuracy is independent of the problem size, is particularly\nattractive if only relatively rough variance estimates suf\ufb01ce, as is often the case in practice.\n\nGaussian models have proven inadequate for image modeling as they fail to capture important as-\npects of natural image statistics such as the heavy tails in marginal histograms of linear \ufb01lter re-\nsponses. Nevertheless, much richer statistical image tools can be built if we also incorporate into our\nmodels latent variables or allow nonlinear interactions between multiple Gaussian \ufb01elds and thus the\nGMRF sampling technique we describe here is very useful within this wider setting [10, 16, 19, 34].\nIn Sec. 5 we discuss the integration of our GMRF sampling algorithm in a block-Gibbs sampling\ncontext, where the conditionally Gaussian continuous variables and the conditionally independent\nlatent variables are sampled alternately. The most straightforward way to capture the heavy tailed\nhistograms of natural images is to model each \ufb01lter response with a Gaussian mixture expert, thus\nusing a single discrete assignment variable at each factor [16, 23]. However, our ef\ufb01cient GMRF\nalgorithm can also be used in conjunction with Gaussian scale mixture (GSM) models for which the\nlatent scale variable is continuous [2]; we demonstrate this in the context of Bayesian signal restora-\ntion by sampling from the posterior distribution under a total variation (TV) prior, employing the\nGSM characterization of the Laplacian density. Further, our sampling technique also applies when\nthe latent variables are distributed, with each hidden variable affecting multiple experts. An inter-\nesting case we examine is the recently proposed factored Gaussian restricted Boltzmann machine\n(GRBM) of [18], which takes into account residual correlations among visible units by modeling\nthem as a multivariate GMRF, conditional on the distributed state of an adjacent layer of discrete hid-\nden units. We show that we can effectively replace the hybrid Monte Carlo sampler used by [18] with\na block-Gibbs sampler in which the visible conditionally Gaussian units are sampled collectively by\nlocal perturbations, potentially allowing extension of the current patch-based model to a full-image\nfactored GRBM, as has been recently done for the \ufb01elds of independent experts model [19, 23].\n\nOur GMRF sampling algorithm relies on a property of Gaussian densities (see Sec. 3) which, in a\nsomewhat different form, has appeared before in the statistics literature [21, 22]. However, [21, 22]\nemphasize direct matrix factorization methods for solving the linear system arising in computing the\nGaussian mean, which cannot handle the large models we consider here and do not discuss models\nwith hidden variables. Variations of the sampling technique we study here have been also used in the\nimage modeling work of [16] and very recently of [23]. However the sampling technique in these\npapers is used as a tool and not studied by itself. Apart from highlighting the power and versatility of\nthe ef\ufb01cient GMRF sampling algorithm and drawing the machine learning community\u2019s attention to\nit, our main novel contributions in this paper are: (1) Our interpretation of the Gaussian sampling al-\ngorithm as local factor perturbation followed by mode computation, which highlights its distributed\nnature and implies that any Gaussian mean computation routine can be equally effectively employed\nfor GMRF sampling; (2) the application of the ef\ufb01cient sampling algorithm in rapid sampling and\nvariance estimation of very large Gaussian models; and (3) the demonstration that, in the presence\nof hidden variables, it can be effectively integrated in a block-Gibbs sampler not only in discrete\nbut also in continuous GSM models and in conjunction not only with local but also with distributed\nlatent assignment representations.\n\n2 Gaussian graphical models\n\n2.1 The linear Gaussian model\n\nWe are working in the context of linear Gaussian models [20], in which a hidden vector x \u2208 RN is\nassumed to follow a prior distribution P (x) and noisy linear measurements y \u2208 RM of it are drawn\nwith likelihood P (y|x). Speci\ufb01cally:\n\nP (x) \u221d N (Gx; \u00b5p, \u03a3p) \u221d exp(cid:0)\u2212 1\nP (y|x) = N (y; Hx + c, \u03a3n) \u221d exp(cid:16)\u2212 1\n\n2 xT Jxx + kT\n\nx x(cid:1)\n2 xT Jy|xx + kT\n\ny|xx \u2212 1\n\n2 yT \u03a3\u22121\n\nn y(cid:17)\n\n(1)\n\n2\n\n\fwhere N (x; \u00b5, \u03a3) = |2\u03c0\u03a3|\u22121/2 exp(cid:0)\u2212 1\n\n2 (x \u2212 \u00b5)T \u03a3\u22121(x \u2212 \u00b5)(cid:1) denotes the multivariate Gaus-\n\np \u00b5p\n\nand Jy|x = HT \u03a3\u22121\n\np G , kx = GT \u03a3\u22121\n\nsian density on x with mean \u00b5 and covariance \u03a3. It is convenient to express the prior and likelihood\nGaussian densities on x in Eq. (1) in information form; the respective parameters are\nJx = GT \u03a3\u22121\nn H , ky|x = HT \u03a3\u22121\nn (y \u2212 c) . (2)\nWe recall that the information form of the Gaussian density NI (x; k, J) \u221d exp(cid:0)\u2212 1\n2 xT Jx + kT x(cid:1)\nemploys the precision matrix J and the potential vector k [13]. If J is invertible, then the standard\nand information representations are equivalent, with \u00b5 = J\u22121k and \u03a3 = J\u22121, but the information\nform with J symmetric positive semide\ufb01nite is also convenient for describing degenerate Gaussian\ndensities. Further, the precision matrix directly reveals dependencies between subsets of variables in\nthe network: xi and xj are conditionally independent, given the values of the remaining components\nof x, iff Ji,j = 0, while, in general, \u03a3i,j 6= 0; this implies that J is typically much sparser than \u03a3\nfor GMRF models, as further discussed in Sec. 2.2.\n\nBy Bayes\u2019 rule the posterior distribution of x given y is the product of the prior and likelihood terms\nand also has Gaussian density\n\nn (y \u2212 c)(cid:1) and \u03a3\u22121 = J = GT \u03a3\u22121\n\np G + HT \u03a3\u22121\n\nn H .\n\n(3)\n\nP (x|y) = N (x; \u00b5, \u03a3) , with\n\u00b5 = J\u22121 (cid:0)GT \u03a3\u22121\np \u00b5p + HT \u03a3\u22121\n\nWe assume J = Jx + Jy|x to be invertible, although we allow for singular Jx and/or Jy|x; in other\nwords, the prior and likelihood jointly de\ufb01ne a normalizable Gaussian density on x, although each\nof them on its own may leave a subspace of x unconstrained.\n\n2.2 Gaussian Markov random \ufb01elds\n\n1 ; . . . ; gT\n\nK ] and the M rows of H = [hT\n\nThe K rows of G = [gT\nM ] can be seen as two\nsets of length-N linear \ufb01lters. The respective \ufb01lter responses Gx and Hx determine the prior and\nlikelihood models of Eq. (1). We de\ufb01ne the \ufb01lter set F = [f T\nL ], L = K+M , as the union of\n{gk} and {hm} and further assume that any two \ufb01lter responses are conditionally independent given\nx or, equivalently, that the covariance matrices in Eq. (1) are diagonal, \u03a3p = diag(\u03a3p,1, . . . , \u03a3p,K )\nand \u03a3n = diag(\u03a3n,1, . . . , \u03a3n,M ). Also let \u00b5p = (\u00b5p,1; . . . ; \u00b5p,K), y = (y1; . . . ; yM ), and c =\n(c1; . . . ; cM ). Then the posterior factorizes as a product of L Gaussian experts\n\n1 ; . . . ; hT\n\n1 ; . . . ; f T\n\nP (x|y) \u221d YL\n\nl=1\n\nexp(cid:0)\u2212 1\n\n2 xT Jlx + kT\n\nl x(cid:1) \u221d YL\n\nl=1\n\nN (f T\n\nl x; \u00b5l, \u03a3l) ,\n\n(4)\n\nwhere the variances are \u03a3l = \u03a3p,l, l = 1 . . . K, for the factors that come from the prior term and\n\u03a3l = \u03a3n,l\u2212K , l = K+1 . . . K+M , for those that come from the likelihood term; the corresponding\nmeans are \u00b5l = \u00b5p,l and \u00b5l = yl\u2212K \u2212 cl\u2212K , respectively. Comparing with Eq. (3), we see that the\nposterior Gaussian information parameters split additively as J = PL\nl=1 kl. The\nindividual Gaussian factors have potential vectors kl = fl\u03a3\u22121\nl \u00b5l and rank-one precision matrices\nJl = fl\u03a3\u22121\nf T\nl . Since J is invertible, L \u2265 N . We see that there is a one-to-one correspondence\nbetween factors and \ufb01lters; moreover, the (i, j) entry of Jl is non-zero iff both i and j entries of\nfl are non-zero. If the \ufb01lter has Tl non-zero elements, then the corresponding Gaussian factor will\ncouple the Tl variables in the clique x[l]. The resulting GMRF is depicted in a factor graph form in\nFig. 1(a). It is straightforward to jointly model conditionally dependent \ufb01lter responses by letting\n\u03a3p or \u03a3n have block diagonal structure, yielding multivariate Gaussian factors in Eq. (4).\n\nl=1 Jl and k = PL\n\nl\n\n2.3\n\nInference: Ef\ufb01ciently computing the posterior mean\n\nConceptually, the Gaussian posterior distribution is fully characterized by the posterior mean \u00b5 and\ncovariance matrix \u03a3, which are given in closed form in Eq. (3): \u00b5 is the solution of a set of linear\nequations whose system matrix is the N \u00d7N precision matrix J, while \u03a3 = J\u22121. However, naively\ncomputing these quantities can be prohibitively expensive when working with high-dimensional\nmodels, requiring O(N 3) computation and O(N 2) space. For example, a typical 1 MP image\nmodel involves N = 106 variables; the corresponding symmetric covariance matrix \u03a3 is generally\ndense and occupies as much space as about 5\u00d7105 equally-sized images.\n\nThankfully, for the GMRF models mostly used in practice, there exist powerful inference algo-\nrithms which avoid explicitly inverting the system matrix J. In certain special cases direct methods\n\n3\n\n\ff1\n\nf2\n\nf3\n\nf4\n\nfL\n\nx1\n\nx2\n\nx3\n\nxN\n\n(a)\n\n\u03a3\u22121\n\n1\n\n\u03a3\u22121\nB\n\n\u03c61\n\n\u03c6B\n\n\u00af\u03c61\n\n\u00af\u03c6B\n\nx\n\nJx\n\n(b)\n\nFigure 1: (a) The factor graph for the posterior GMRF contains the union f1:L of prior and likeli-\nhood factors/\ufb01lters. An edge between a \ufb01lter and a site means that the corresponding coef\ufb01cient is\nnon-zero. The variables connected to each factor comprise a clique of the GMRF. (b) Filterbank\nimplementation of matrix-vector multiplication Jx arising in CG ( \u00af\u03c6 is the spatial mirror of \u03c6).\n\nare applicable for computing the mode, marginal variances, and samples from the posterior. For\nexample, spatially homogeneous GMRFs give rise to a block-circulant precision matrix and exact\ncomputations can be carried out in O(N log N ) complexity with DFT-based techniques [10]. Ex-\nact inference can also be carried out in chain or tree-structured GMRFs using O(N ) Kalman \ufb01lter\nequations which correspond to belief propagation (BP) updates recursively in time or scale [36].\nA related direct approach which in the context of GMRFs has been studied in detail by [21, 22]\nrelies on the Cholesky factorization of the precision matrix by ef\ufb01cient sparse matrix techniques,\nwhich typically re-order the variables in x so as to minimize the bandwidth W of J. The resulting\nalgorithm has O(W 2N ) speed and O(W N ) space complexity, which is still quite expensive for\nvery large scale 2-D lattice image models, since the bandwidth W increases linearly with the spatial\nextent of the image and the support of the \ufb01lters.\n\nl=1 \u03a3\u22121\n\nl x and the backprojection PL\n\nMore generally, for large scale and arbitrarily structured GMRFs one needs to resort to iterative\ntechniques such as conjugate gradients, multigrid, or loopy BP in order to approximately solve\nthe linear system in Eq. (3) and recover the most likely solution \u00b5. Conjugate gradients (CG) [6]\nare generally applicable in our setup since the system matrix is positive de\ufb01nite. Each CG iteration\ninvolves a single matrix-vector multiplication Jx. By Sec. 2.2, this essentially amounts to computing\nthe \ufb01lter responses zl = f T\nl zlfl, which respectively involves\nsending messages from the variables to the factors and back in the diagram of Fig. 1(a). The GMRFs\narising in image modeling are typically de\ufb01ned on the image responses to a bank of linear \ufb01lters\n{\u03c6\u2113}, \u2113 = 1 . . . , B; the spatial translation of each \ufb01lter kernel \u03c6\u2113 induces a subset of factors. In this\ncontext, the matrix-vector multiplication Jx in CG corresponds to convolutions and element-wise\nmultiplications, as shown in the \ufb01lterbank diagram of Fig. 1(b). The time complexity per iteration\nis thus low, typically O(N ) or O(N log N ), provided that the \ufb01lter kernels \u03c6\u2113 have small spatial\nsupport or correspond to wavelet or Fourier atoms for which fast discrete transforms exist, while\ncomputations can also be carried out in the GPU. The memory overhead is also minimal, O(N ), as\nCG employs only 3 or 4 auxiliary length-N vectors. The convergence rate of CG is largely problem-\ndependent, but in many cases a relatively small number of iterations suf\ufb01ce to bring us close enough\nto the solution, especially if an effective preconditioner is used [6]. Multigrid algorithms also apply\nin certain of the GMRF models we consider, especially those related to physics-based variational\nenergy and PDE formulations [29, 31]. When multigrid applies, as in the example of Sec. 3, it\nrecovers the solution after a \ufb01xed number of iterations (independent of the problem size) and has\noptimal O(N ) time and space complexity. Loopy BP is a powerful distributed iterative method for\ncomputing \u00b5 which is guaranteed to converge for certain GMRF classes [13, 33].\n\n3 Gaussian sampling by independent factor perturbations\n\nUnlike direct methods, the iterative techniques discussed in Sec. 2.3 have been typically restricted\nto computing the posterior mode \u00b5 and considered less suited to posterior sampling or variance\ncomputation (but see Sec. 4). However, as the following result shows, exact sampling from a linear\nGaussian model can be reduced to computing the mode of a Gaussian model with identical precision\nmatrix J but randomly perturbed potential vector \u02dck, and thus the powerful iterative methods for\nrecovering the mean can be used unmodi\ufb01ed for sampling in large scale GMRFs. Speci\ufb01cally:\nAlgorithm. A sample xs from the posterior distribution P (x|y) = N (x; \u00b5, \u03a3) of Eq. (3) can be\ndrawn using the following procedure: (1) Perturb the prior mean \ufb01lter responses \u02dc\u00b5p \u223c N (\u00b5p, \u03a3p).\n\n4\n\nb\nb\nb\nb\nb\nb\nb\n\fn (\u02dcy \u2212 c)(cid:1).\n\np \u02dc\u00b5p + HT \u03a3\u22121\n\n(2) Perturb the measurements \u02dcy \u223c N (y, \u03a3n). (3) Use the procedure for computing the posterior\nmode keeping the same system matrix J, only replacing \u00b5p and y with their perturbed versions:\nxs = J\u22121 (cid:0)GT \u03a3\u22121\nIndeed, xs is a Gaussian random vector, as linear combination of Gaussians, and has the desired\nmean E{xs} = \u00b5 and covariance E{(xs \u2212 \u00b5)(xs \u2212 \u00b5)T } = J\u22121 = \u03a3, as can readily be veri-\n\ufb01ed. Clearly, solving the corresponding linear system approximately will only yield an approximate\nsample. The reduction above implies that posterior sampling under the linear Gaussian model is\ncomputationally as hard as mode computation, provided that the structure of \u03a3p and \u03a3n allows\nef\ufb01cient sampling from the corresponding distributions, using, e.g., the direct methods of Sec. 2.3.\nThis algorithm is central to our paper; variations of it have appeared previously [16, 22, 23].\n\nThe sampling algorithm takes a particularly simple and intuitive form for the GMRFs discussed in\nSec. 2.2. In this case \u03a3p and \u03a3n are diagonal and thus for sampling we perturb independently the\nfactor means \u02dc\u00b5l \u223c N (\u00b5l, \u03a3l), l = 1 . . . L, followed by \ufb01nding the mode of the so perturbed GMRF\nin Eq. (4). The perturbation can be equivalently seen in the information parameterization as injecting\nGaussian noise to each potential vector by \u02dckl = kl + fl\u03a3\u22121/2\n\u01ebl, with \u01ebl \u223c N (0, 1), a simple local\noperation carried out independently at each factor of the diagram in Fig. 1(a).\n\nl\n\nTo demonstrate the power of this algorithm, we show in Fig. 2 an image inpainting example in which\nwe \ufb01ll in the occluded parts of an 498\u00d7495 image under a 2-D thin-membrane prior GMRF model\n[12,29,31], in which the Gaussian factors are induced by the \ufb01rst-order spatial derivative \ufb01lters \u03c61 =\n[ \u22121 1 ] and \u03c62 = [ \u22121 1 ]T . The shared variance parameter \u03a3l for the experts has been matched to the\nvariance of the image derivative histogram. The presence of randomly placed measurements makes\nthe problem non-stationary and thus Fourier domain techniques are not applicable. Finding the\nposterior mean of this model amounts to solving a quadratic energy minimization problem in which\nthe non-occluded pixels are clamped to their observed values and corresponds to a Laplace PDE\nproblem with non-homogeneous regularization, which can be tackled very ef\ufb01ciently with multigrid\ntechniques [31]. To transform this ef\ufb01cient MAP computation technique into a powerful sampling\nalgorithm for the thin-membrane GMRF, it suf\ufb01ces to inject noise to the factors, only perturbing\nthe linear system\u2019s right hand side. Using a multigrid solver originally developed for solving PDE\nproblems, we can draw about 4 posterior samples per second from the 2-D thin-membrane model\nof Fig. 2, which is particularly impressive given its size; the multilevel Gibbs sampling technique\nof [30] is the only other algorithm that could potentially achieve such speed in a similar setup, yet it\ncannot produce exact single-shot samples as our algorithm can.\n\n \n\n0.08\n\n0.07\n\n0.06\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\nFigure 2: Image inpainting by exact sampling from the posterior under a 2-D thin-membrane prior\nGMRF model, conditional on the image values at the known sites. From left to right, the masked\nimage (big occluded areas plus 50% missing pixels), the posterior mean, a posterior sample obtained\nby our perturbed GMRF sampling algorithm, and the sample-based estimate of the posterior standard\ndeviation (square root of the variance) using 20 samples (image values are between 0 and 1).\n\n \n\n4 Posterior variance estimation\n\nIt is often desirable not only to compute the mode \u00b5 but also recover aspects of the covariance\nstructure in the posterior distribution. As we have discussed in Sec. 2.1, for very large models the\nfully-dense covariance matrix \u03a3 is impractical to compute or store; however, we might be interested\nin certain of its elements. For example, the diagonal of \u03a3 contains the variance of each variable\nand thus, along with the mean, fully describes the posterior marginal densities [29]. Marginal vari-\n\n5\n\n\fances also need to be computed in Gaussian subproblems that arise in the context of non-Gaussian\nsparse Bayesian learning and relevance vector machine models used for regression, classi\ufb01cation,\nand experimental design [11, 26, 32]. For many of these models variance estimation is the main\ncomputational bottleneck in applications involving large scale datasets.\n\nA number of techniques have been proposed for posterior variance estimation. One approach has\nbeen to employ modi\ufb01ed conjugate gradient algorithms which allow forming variance estimates in\nparallel to computing the posterior mode when solving the linear system in Eq. (3) [15, 24, 27].\nThese techniques utilize the close connection between conjugate gradients and the Lanczos method\nfor determining eigensystems [6, 15] but unfortunately exhibit erratic numerical behavior in prac-\ntice, especially when applied to large scale problems: loss of orthogonality due to \ufb01nite numerical\nprecision requires that one holds in memory the entire sequence of Lanczos vectors and periodically\nreorthogonalize them as the iteration progresses, signi\ufb01cantly increasing the memory and time com-\nplexity relative to ordinary CG; the variance estimates typically converge much slower than mean\nestimates; one often has limited freedom in initializing the iteration and/or selecting the precondi-\ntioner. We refer to [25] for further information.\n\nIt is well known that belief propagation computes exact variances in tree-structured GMRFs [36].\nHowever, in graphs with cycles its loopy version typically underestimates the marginal variances\nsince it overcounts the evidence, even when it converges to the correct means [13, 33]. The variance\nestimator of [28] is only applicable to GMRFs for which just a small number of edges violates the\ngraph\u2019s tree structure. The method in [12] relies on a low-rank approximation of the N \u00d7 N unit\nmatrix, carefully adapted to the problem covariance structure, also employing a wavelet hierarchy\nfor models exhibiting long-range dependencies. One then needs to solve as many linear systems\nas is the approximation rank, which in turn increases with the model size ( [12] reports a rank of\n448 for a relatively smooth model with about 106 variables). This technique is thus still relatively\nexpensive and not necessarily generally applicable.\n\nThe ability to ef\ufb01ciently sample from the Gaussian posterior distribution using the algorithm of\nSec. 3 immediately suggests the following Monte Carlo estimator of the posterior covariance matrix\n\n\u02c6\u03a3 = 1/S XS\n\ns=1\n\n(xs \u2212 \u00b5)(xs \u2212 \u00b5)T .\n\n(5)\n\nIf only the posterior variances are required, one will obviously just evaluate and retain the diagonal\nof the outer-products in the sum; any other selected elements of \u02c6\u03a3 can similarly be obtained. Clearly,\nthe proposed estimator is unbiased. Its relative variance estimation error follows from the properties\n\nof the \u03c72 distribution and is r = \u2206( \u02c6\u03a3i,i)/\u03a3i,i = p2/S. The error drops quite slowly with the\nnumber of samples (S = 2/r2 samples are required to reach a desired relative error r), so the\ntechnique is best suited if rough variance estimates suf\ufb01ce, which is often the case in practical\napplications [26]; e.g., 50 samples suf\ufb01ce to reduce r to 20%. A desirable property of the estimator\nis that its accuracy is independent of the problem size N , in contrast to most alternative techniques.\nThe proposed variance estimation technique can thus be readily applied to every GMRF at a cost\nof S times that of computing \u00b5. We show in Fig. 2 the result of applying the proposed variance\nestimator for the thin-membrane GMRF example considered in Sec. 2.3; within only 20 samples\n(computed in 5 sec.) the qualitative structure of the variance in the model has been captured.\n\n5 Block Gibbs sampling in conditionally Gaussian Markov random \ufb01elds\n\nFollowing the intuition behind Gaussian sampling by local perturbations, one could try to inject\nnoise to the local potentials and \ufb01nd the mode of the perturbed model, even in the presence of non-\nquadratic MRF factors. Although such a randomization process is interesting on its own right and\ndeserves further study, it is not feasible to design it in a way that leads to single shot algorithms for\nexact sampling of non-Gaussian MRFs.\n\nWithout completely abandoning the Gaussian realm, we can get versatile models in which some\nhidden variables q control the mean and/or variance of the Gaussian factors. Conditional on the\nvalues of these hidden variables, the data are still Gaussian\n\nP (x|q) \u221d YL\n\nl=1\n\nN (f T\n\nl x; \u00b5l,q, \u03a3l,q) ,\n\n(6)\n\n6\n\n\fwhere we have dropped the dependence on the measurements y for simplicity. Sampling from\nthis model can be carried out ef\ufb01ciently (but not in a single shot any more) by alternately block\nsampling from P (x|q) and P (q|x), which typically mixes rapidly and is much more ef\ufb01cient than\nsingle-site Gibbs sampling [35]. For large models this is feasible because, given the hidden vari-\nables, we can update the visible units collectively using the GMRF sampling by local perturba-\ntions algorithm, similarly to [16, 23]. We assume that block sampling of the hidden units given\nthe visible variables is also feasible, by considering their conditional distribution independent or\ntree-structured [16]. One typically employs one discrete hidden variable ql per factor fl, leading\nto mixture of Gaussian local experts for which the joint distribution of visible and hidden units is\nP (x, q) \u221d QL\nl x; \u00b5l,j, \u03a3l,j) [4, 16, 23, 34]. Intuitively, the discrete latent unit ql\nturns off the smoothness constraint enforced by the factor fl by assigning a large variance \u03a3l,j to it\nwhen an image edge is detected.\n\nj=1 \u03c0l,jN (f T\n\nl=1 PJl\n\nThe block-Gibbs sampler leads to a rapidly mixing Markov chain which after a few burn-in itera-\ntions generates a sequence of samples {{x1, q1}, . . . , {xS, qS}} that explore the joint distribution\nP (x, q). Summarizing the sample sequence into a unique estimate \u02c6x should be problem dependent.\nIf we strive for minimizing the estimation\u2019s mean square error as typically is the case in image de-\nnoising, our goal should be to induce the posterior mean from the sample sequence [23]. Apart from\nthe standard sample-based posterior mean estimator \u02c6xS = 1/S PS\ns=1 xs, we can alternatively es-\ntimate the posterior mean with the Rao-Blackwellized (RB) estimator \u02c6xRB = 1/S PS\ns=1 E{x|qs}\n[16], which offers increased accuracy but requires \ufb01nding the means of the conditionally Gaussian\nMRFs P (x|q), typically doubling the cost per step. Beyond MMSE, in applications such as image\ninpainting or texture synthesis, the posterior mean can be overly smooth and selecting a single sam-\nple from the simulation as the solution can be visually more plausible [8], as can be appreciated by\ncomparing the MMSE and sample reconstructions of the textured areas in the inpainting example of\nFig. 2.\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221220\n\n \n0\n\n \n\nORIGINAL\nNOISY, 21.9 dB\nTV\u2212MAP, 29.0 dB\nGIBBS SAMPLE, 28.4 dB\nSAMPLE MEAN, 30.0 dB\nRAO\u2212BLACK, 30.3 dB\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nFigure 3: Signal restoration under a total variation prior model and alternative estimation criteria.\n\nThe heavy tailed histograms of natural image \ufb01lter responses are often conveniently approximated by\nkurtotic continuous parametric distributions [10, 19, 35]. We can still resort to block Gibbs sampling\nfor ef\ufb01ciently exploring the posterior distribution of the signal x if each expert can be represented as\na continuous Gaussian scale mixture (GSM) [2], as has been done before for Student-t experts [35].\nMotivated by [14, 23], we show here how this can lead to a novel Bayesian treatment of signal\nrestoration under a total variation (TV) prior P (x) \u221d QN \u22121\nl=1 L(\u2206xl; \u03b1), which imposes an L1\npenalty on the signal diferrences \u2206xl = xl \u2212 xl+1. We rely on the hierarchical characterization of\nthe Laplacian density L(z; \u03b1) = 1/(2\u03b1) exp(\u2212|z|/\u03b1) as a GSM in which the variance follows an\nexponential distribution [2, 17]: L(z; \u03b1) = 1/(2\u03b12)R \u221e\n0 N (z; 0, v) exp(\u2212v/\u03b12)d v. Thanks to the\nGSM nature of this representation and assuming a Gaussian measurement model, the conditionally\nGaussian visible variables are easy to sample. Further, the latent variances vl conditionally decouple\nand have density P (vl|x) \u221d v\u22121/2\nexp(cid:0)\u2212|\u2206xl|2/(2vl) \u2212 vl/(2\u03b12)(cid:1), which can be recognized as a\ngeneralized inverse Gaussian distribution for which standard sampling routines exist. The derivation\nabove carries over to the 2-D TV model, with the gradient magnitude at each pixel replacing |\u2206xl|.\n\nl\n\n7\n\n\fWe demonstrate our Bayesian TV restoration method in a signal denoising experiment illustrated in\nFig. 3. We synthesized a length-1000 signal by integrating Laplacian noise (\u03b1 = 1/8), also adding\njumps of height 5 at four locations (outliers), and subsequently degraded it by adding Gaussian noise\n(with variance 1). We depict the standard TV-MAP restoration result, as well as plausible solutions\nextracted from a 10-step block-Gibbs sampling run with our GSM-based Bayesian algorithm: the\n10-th sample itself, and the two MMSE estimates outlined above (sample mean and RB). As ex-\npected, the two mean estimators are best in terms of PSNR (with the RB one slightly superior).\nThe standard TV-MAP estimator captures the edges more sharply but has lower PSNR score and\nproduces staircase artifacts. Although the random sample performs the worst in terms of PSNR, it\nresembles most closely the qualitative properties of the original signal, capturing its \ufb01ne structure.\nThese \ufb01ndings shed new light in the critical view of [14] on MAP-based denoising.\n\nWe must emphasize that the block Gibbs sampling strategy outlined above in conjunction with our\nGMRF sampling by local perturbations algorithm is equally well applicable when the latent variables\nare distributed, with each hidden variable affecting multiple experts, as illustrated in Fig. 4(a). This\nsituation arises in the context of unsupervised learning of hierarchical models applied on real-valued\ndata, where it is natural to use a Gaussian restricted Boltzmann machine (GRBM) in the \ufb01rst layer of\nthe hierarchy. Training GRBMs with contrastive divergence [7] requires drawing random samples\nfrom the model. Sampling the visible layer given the layer of discrete hidden variables is easy if\nthere are no sideways connections between the continuous visible units, as assumed in [9]. To take\ninto account residual correlations among the visible units, the authors of the factored GRBM in [18]\ndrop the conditional independence assumption, but resort to dif\ufb01cult to tune hybrid Monte Carlo\n(HMC) for sampling. Employing our Gaussian sampling by local perturbations scheme we can\nef\ufb01ciently jointly sample the correlated visible units, which allows us to still use the more ef\ufb01cient\nblock-Gibbs sampler in training the model of [18]. To verify this, we have accordingly replaced\nthe sampling module in the publicly available implementation of [18], and have closely followed\ntheir setup leaving their model otherwise unchanged. For conditionally Gaussian sampling of the\ncorrelated visible units we have used our local perturbation algorithm, coupled with 5 iterations of\nconjugate gradients running on the GPU. Contrastive divergence training was done on the dataset\naccompanying their code, which comprises 10240 16 \u00d7 16 color patches randomly extracted from\nthe Berkeley dataset and statistically whitened. The receptive \ufb01elds learned by this procedure are\ndepicted in Fig. 4(b) and look qualitative the same with those reported in [18], while computation\ntime was reduced by a factor of two. Besides this moderate computation gain, the main interest\nin perturbed Gaussian sampling in this setup lies in its scalability which offers the potential to\nmove beyond the patch-based representation and sample from whole-image factored GRBM models,\nsimilarly to what has been recently achieved in [23] for the \ufb01eld of independent experts model [19].\n\nq1\n\nq2\n\nq3\n\nq4\n\nqJ\n\nf1\n\nf2\n\nf3\n\nf4\n\nfL\n\nx1\n\nx2\n\nx3\n\nxN\n\n(a)\n\n(b)\n\nFigure 4: (a) Each hidden unit can control a single factor (such as the q1 above) or it can affect\nmultiple experts, resulting to models with distributed latent representations. (b) The visible-to-factor\n\ufb01lters arising in the factored GRBM model of [18], as learned using block Gibbs sampling.\n\nAcknowledgments\n\nThis work was supported by the NSF award 0917141 and the AFOSR grant 9550-08-1-0489.\n\n8\n\n\fReferences\n\n[1] D. Ackley, G. Hinton, and T. Sejnowski. A learning algorithm for Boltzmann machines. Cogn. Science,\n\n9(1):147\u2013169, 1985.\n\n[2] D. Andrews and C. Mallows. Scale mixtures of normal distributions. JRSS (B), 36(1):99\u2013102, 1974.\n[3] J. Besag. Spatial interaction and the statistical analysis of lattice systems. JRSS (B), 36(2):192\u2013236, 1974.\n[4] D. Geman and C. Yang. Nonlinear image recovery with half-quadratic regularization. IEEE Trans. Image\n\nProcess., 4(7):932\u2013946, 1995.\n\n[5] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of\n\nimages. IEEE Trans. PAMI, 6(6):721\u2013741, 1984.\n\n[6] G. Golub and C. Van Loan. Matrix Computations. John Hopkins Press, 1996.\n[7] G. Hinton. Training products of experts by minimizing contrastive divergence. Neur. Comp., 14(8):1771\u2013\n\n1800, 2002.\n\n[8] A. Kokaram. Motion Picture Restoration. Springer, 1998.\n[9] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsu-\n\npervised learning of hierarchical representations. In Proc. ICML, 2009.\n\n[10] S. Lyu and E. Simoncelli. Modeling multiscale subbands of photographic images with \ufb01elds of Gaussian\n\nscale mixtures. IEEE Trans. PAMI, 31(4):693\u2013706, Apr. 2009.\n\n[11] D. MacKay. Bayesian interpolation. Neur. Comp., 4(3):415\u2013447, 1992.\n[12] D. Malioutov, J. Johnson, M. Choi, and A. Willsky. Low-rank variance approximation in GMRF models:\n\nSingle and multiscale approaches. IEEE Trans. Signal Process., 56(10):4621\u20134634, Oct. 2008.\n\n[13] D. Malioutov, J. Johnson, and A. Willsky. Walk sums and belief propagation in Gaussian graphical\n\nmodels. J. of Mach. Learning Res., 7:2031\u20132064, 2006.\n\n[14] M. Nikolova. Model distortions in Bayesian MAP reconstruction. Inv. Pr. and Imag., 1(2):399\u2013422, 2007.\n[15] C. Paige and M. Saunders. LSQR: An algorithm for sparse linear equations and sparse least squares. ACM\n\nTrans. on Math. Software, 8(1):43\u201371, 1982.\n\n[16] G. Papandreou, P. Maragos, and A. Kokaram. Image inpainting with a wavelet domain hidden Markov\n\ntree model. In Proc. ICASSP, pages 773\u2013776, 2008.\n\n[17] T. Park and G. Casella. The Bayesian lasso. J. of the Amer. Stat. Assoc., 103(482):681\u2013686, 2008.\n[18] M. Ranzato, A. Krizhevsky, and G. Hinton. Factored 3-way restricted Boltzmann machines for modeling\n\nnatural images. In Proc. AISTATS, 2010.\n\n[19] S. Roth and M. Black. Fields of experts. Int. J. of Comp. Vis., 82(2):205\u2013229, 2009.\n[20] S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neur. Comp., 11:305\u2013345,\n\n1999.\n\n[21] H. Rue. Fast sampling of Gaussian Markov random \ufb01elds. JRSS (B), 63(2):325\u2013338, 2001.\n[22] H. Rue and L. Held. Gaussian Markov random \ufb01elds. Theory and Applications. Chapman & Hall, 2005.\n[23] U. Schmidt, Q. Gao, and S. Roth. A generative perspective on MRFs in low-level vision. In CVPR, 2010.\n[24] M. Schneider and A. Willsky. Krylov subspace estimation. SIAM J. Sci. Comp., 22(5):1840\u20131864, 2001.\n[25] M. Seeger and H. Nickisch. Large scale variational inference and experimental design for sparse general-\n\nized linear models. Technical Report TR-175, MPI for Biological Cybernetics, 2008.\n\n[26] M. Seeger, H. Nickisch, R. Pohmann, and B. Sch\u00a8olkopf. Bayesian experimental design of magnetic\n\nresonance imaging sequences. In NIPS, pages 1441\u20131448, 2008.\n\n[27] J. Skilling. Bayesian numerical analysis. In W. Grandy and P. Milonni, editors, Physics and Probability,\n\npages 207\u2013221. Cambridge Univ. Press, 1993.\n\n[28] E. Sudderth, M. Wainwright, and A. Willsky. Embedded trees: Estimation of Gaussian processes on\n\ngraphs with cycles. IEEE Trans. Signal Process., 52(11):3136\u20133150, Nov. 2004.\n\n[29] R. Szeliski. Bayesian modeling of uncertainty in low-level vision. Int. J. of Comp. Vis., 5(3):271\u2013301,\n\n1990.\n\n[30] R. Szeliski and D. Terzopoulos. From splines to fractals. In Proc. ACM SIGGRAPH, pages 51\u201360, 1989.\n[31] D. Terzopoulos. The computation of visible-surface representations. IEEE Trans. PAMI, 10(4):417\u2013438,\n\n1988.\n\n[32] M. Tipping. Sparse Bayesian learning and the relevance vector machine. J. of Mach. Learning Res.,\n\n1:211\u2013244, 2001.\n\n[33] Y. Weiss and W. Freeman. Correctness of belief propagation in Gaussian graphical models of arbitrary\n\ntopology. Neur. Comp., 13(10):2173\u20132200, 2001.\n\n[34] Y. Weiss and W. Freeman. What makes a good model of natural images? In CVPR, 2007.\n[35] M. Welling, G. Hinton, and S. Osindero. Learning sparse topographic representations with products of\n\nStudent-t distributions. In NIPS, 2002.\n\n[36] A. Willsky. Multiresolution Markov models for signal and image processing. Proc. IEEE, 90(8):1396\u2013\n\n1458, 2002.\n\n[37] S. Zhu, Y. Wu, and D. Mumford. Filters, random \ufb01elds and maximum entropy (FRAME): Towards a\n\nuni\ufb01ed theory for texture modeling. Int. J. of Comp. Vis., 27(2):107\u2013126, 1998.\n\n9\n\n\f", "award": [], "sourceid": 75, "authors": [{"given_name": "George", "family_name": "Papandreou", "institution": null}, {"given_name": "Alan", "family_name": "Yuille", "institution": null}]}