Call Now: (800) 537-1660  
The Algebra Buster
The Algebra Buster


February 11th









February 11th

Frechet derivative

Slide 2
Before I start to explain what adjoint is, what Frechet derivative is
and other issues, let me first try to explain what is a functional. In
short, a functional is a function of a function, it mapps
a function to a scalar.

For example, the generalized radon transform that we talked about last time is a
functional of the image, it mapps an image, which is a function of the
spatial coordinates, to a prediction of the observation at a given
observation parameter.

The difference between a functional and a function is that for a functional the
independent variables are no longer the location in our coordinate
system, the independent variables are now the values of the input
function on a fixed set of locations in our coordinate system . Let's say
those location are x1, x2, xi, xi+1, and so on. If we let this set of
points become infinitly dense, we can just drop those subscripts and
write it as x. But we must bear in mind that R is a function of the
values of the image over a fixed space, rather than a function of the
location in our coordinate system. Once we understand that, it is not
difficult to understand Frechet derivatives.

To solve the imaging problem, we will formulate our inverse problem as
an optimization problem, and the objective function is actually a
functional of the image and we usually call it the misfit functional.
We look for an image that can minimize this misfit functional. This
misfit functional is usually defined as a certain norm of the misfit
measurements. For example, we can use the L2 norm and in this case,
the misfit functional can be written as an inner product of the misfit
measurements. For discrete observation parameter p, this inner product
means summation over those observation parameters. For continuous p,
this summation will become an integration.

Now, to solve the optimization problem using a certain gradient
descent method, we need to ask ourselves the
question "if I change the values of the image a little bit, say delta I, what is
the change in the misfit functional, say delta chi?" The answer to
this question gives the definition of the gradient of the misfit
functional with respect to the image, which is an example of what we
call a "Frechet derivative". A Frechet derivative is not a derivative
with respect to the location in our coordinate system, it is with
respct to the values of the input function at a given location.

How do we obtain this Frechet derivative? A perturbation of chi equals
to the inner product of a perturbation to the misfit measurement with
Delta d. Now, recall our definition of Delta d, which is the
discrepancy between the raw observation and the model-predicted
observation. The raw observation does not change, the only thing that
can change is the model-predicted observation. The perturbation to
Delta d can be expressed as a linear operator acting upon the
perturbation of the image. If we can obtain another operator, the
adjoint operator of M, which acts upon Delta d, so that the inner
product of the image perturbation with the result of M* equals to the
original inner product, we obtain the Frechet derivative of chi with
respect to the image. This inner product in x is an integ ration over x.

This is essentially the big picture. Now let's get into the details.

Slide 3
We'll first talk about how to define and construct the adjoint
operator. Then we'll talk about how to define first and second-order
Frechet derivative. The reason that we need the second-order Frechet
derivative is that sometimes we'll need not only the gradient of the
misfit functional, but also its Hessian. After that, we'll be ready to
talk about iterative reconstruction based on Newton's algorithm and
the steepest descent algorithm. In the last part, we'll look at how to
solve the imaging problem if the forward problem is the Radon
transform.

Slide 4
Suppose we have an inner product, which is defined as the summation of
the product of in dividual components . The same idea can be extended to
functions. If we express a function of t as a vector whose
components are the values of the function at different t , an inner
product is simply a summation of the product of individual
components. If we make our sampling of t dense enough, this summation
becomes an integration over t.

Now let's look at an operator. Suppose C is a matrix or what we call
an operator, applying this operator to a vector means multiplying this
vector with this matrix and the usuall matrix-vector multiplication
rule holds . The result of this multiplication is another vector, in
other words, our operator transforms one vector into another. For a
function, what is an operator? We'll consider two types of
operators, one type involves differentiation operations, the other
type involves integration operations. The result of applying an
operator on a function is another function. For
example, let's consider a first-order differentiation operator. We
can consider a differentiation as the limit of the finit-difference as
Delta t goes to zeros. So the value of v at ti is approximated as the
difference between the values of u at ti+1 and u at ti divided by the
difference beteen ti+1 and ti. As the difference between ti+1 and ti
goes to zero, this approximation becomes exact. We can try to convert
this differetiation operation into a matrix multiplication operation
if we consider our functions as vectors. In this case, the derivative
operator becomes a matrix with all its diagonal elements to be -1 and
the off-diagonal elements to be 1. In the limit that Delta t goes to
zero, this matrix-vector multiplication gives us the same result as
differentiating u directly.

Slide 5
Let's look at an example of an integration operator. This
integration operator has a constant integration kernel, which is
unity, and an upper limit t. The result of this integration is a
function of t. We can convert this integral to another integral with
infinity as the integral limits. Now the integration kernel is given
by a box function, which is unity from zero to t and zero everywhere
else. This box function can be constructed from two Heaviside
functions, which are just step functions with their jumping points at
zero and t. If we want to write this integration as a matrix
multiplication, the matrix is triangular with its lower half to be
unity and upper half to be zero.

The examples above show us that there are some kind of similarity or
correspondence between operations on vectors and operations on
functions. The same is true for the transpose operation.

How do we define a transpose for a matrix? If these two inner products
are exactly the same for all vectors u and v, we say D is a transpose
of C. How do we represent any u and v? We represent a vector as a
linear combination of a set of basis vectors. So this equality holds
for any u and v is equivalent to this equality holds for all basis
vectors. If we choose ui to be the basis vector that is unity on the
i-th component and zero for all other components and vj to be a basis
vector that is unity on jth component and zero for all other
component, we obtain the equality between the element of the jth row
and ith column of C and the element at the jth column and ith row of D.
How do we extend the same idea to linear operators acting on
functions? One way to do that is again to think of a linear operator
as a matrix and then transpose that matrix. In order to implement this
idea, we'll need to discretize our operators and our functions
first. We've seen how to do discretization using
finite-difference. But there are many different means to do
discretization.

The effect of transposing a discretized differential operator is
equivalent to integration by parts. For example, let's take a look at
the first-order differentiation. We can write out the inner product as
an integration over t, integrate by parts, we transfer the derivative
to v. The second term is the inner product between u and dv, with a
minus sign. The whole expression will become an inner product if the
first term disappears. The first term depends on the boundary values
of u and v. If one of them is zero at the boundary, we get the
operator that is adjoint to the first-order differentiation. What if
neither of the two functions is zero at the boundaris? For most of the
functions that we will consider, they are usually compactly supported,
which means they are only nonzero on a finite domain. If our function
is not compactly supported, we can multiply the original function with
a cut-off function, which is unity in a finite domain of our interest
and zero everywhere else. Sometimes we will need to make sure that the
transition from unity to zero is smooth. In that case, we can still
construct the adjoint for the modified function.

Suppose u and v are compactly supported, what is the adjoint of the
second-order differentiation? We can carry out integration by
parts twice and we will see that the adjoint of L equals to L
itself. We call L self-adjoint.

Slide 7
Why is the adjoint operator important? It allows us to construct the
gradient or the Frechet derivative of the misfit functional with respect
to the image, which is needed for solving this optimization problem
using gradient descent methods.

We've introduced most of these symbols in our last lecture. The misfit
functional is defined in terms of a certain norm of all individual
misfit measurements. The individual misfit measurements depend upon
the observation parameter p, the misfit functional involves a
summation or integration over p, so the misfit functional does not
depend on p anymore.

Slide 8
The misfit measurement deserves some additional explanation. We talked
about the forward modeling operator of Delta d based on generalized
radon transform last time. We defined Delta d as some sort of
discrepancy between the raw observation comming from the observation instrument
and a model-predicted observation. How do we define such a
discrepancy? How do we define a model-predicted observation?

We define another forward modeling operator F, which mapps an image to
an observation. This forward modeling operator is different from the
linearized forward modeling operator that we talked about in the last
lecture, which mapps a discrepancy in image to a discrepancy in
observation. There are two important differences between F and the
linearized forward modeling operator. One is that given an input
image, the output of F is not an approximation to d, it is equal to
d. The other difference is that F is now allowed to be nonlinear with
respect to I.

What is the raw observation now? We assume that the raw
observation comes from the forward modeling operator F when the input
is the true image, which we do not know and try to find. One example
of this raw observation is the displacement field sampled at some
observation parameter p, which might denote a particular receiver
location, source location and sampling time. The raw observation can
also be other type of things such as travel time or amplitude, but
other types of observations can usually be derived from the
displacement through some operations, which are also included in the
possibly nonlinear forward modeling operator F.

What is the model-predicted observation? It is simply the output of
our forward modeling operator F from a presumed model, which we
already know and is probably not the same as the true model.

What is the discrepancy between the raw observation and the model
predicted observation? We define this Delta d quite broadly, we only
require that it is zero when the presumed model is equal to the true
model. For example, we can define the discrepancy through
subtraction. In this case, we just subtract the prediction from the
raw observation. Another example is division followed by taking the
log. It is essentially a subtraction of the log. My point is that
there are different ways to parameterize the discrepancy, some of them
might be nonliear with respect to the displacement. Those nonlinearity
introduced through the measurement process might be good or might be
bad depending on if it can cancel out the nonlinearity between the
displacement and the image.

The basic idea behind the optimization is to keep changing our presumed
model until the discrepancy between the model-predition and the raw
observation cannot get any smaller. The discrepancy measurements are
made at many different observation parameters, we measure the size of
the overall discrepancy through some kind of averaging, which could be
a squared summation.

Slide 9
There are many different algorithms for optimization. Those
optimization algorithms can be loosely classified into three
categories:

1. Searching algorithms, in which we try to cover all
possible choices of the input model or image, compute the misfit
functional for those input images and select the one that
gives the minimum misfit functional.

2. Descent algorithms based on gradient of the misfit functional only,
steepest descent and conjugate gradient are too examples.

3. Descent algorithms based on both the gradient information and the
second-order derivatives or Hessian of the misfit functional. This
type of methods are usually named as Newton's method.

Let's motivate our discussion using one example of applying steepest
descent to solve a minimization problem of an ordinary function. The
extension to minimization of a functional is straight forward except
that we need to replace ordinary derivatives with Frechet derivatives.

Before we talk about steepest descent, let's talk about Newton's
method. In Newton's method, we first Taylor expand the objective
function about a starting point, say x_i-1, to second order. We
evaluate the first and second-order derivatives in this Taylor
expansion. Then we ask ourselves: "Where is the minimum of this
second-order Taylor expansion?" This is a quadratic function in delta
x, its minimum is found by taking derivative with respect to delta x
and set that derivative to zero. By solving this equation, we obtain
the delta x, which is a perturbation to the starting point, that is
needed to reach the minimum of the quadratic second-order Taylor
expansion. If the objective function itself is a quadratic function,
this second-order Taylor expansion is exact. No matter where our
starting point is, one iteration of the Newton's method will bring us
to the minimum. If the objective function is not a quadratic function,
we will need multiple iterations of the Newton's method. But the
number of iterations needed is less for functions that are closer to a
quadratic function, which means the higher-order terms ignored in this
Taylor expansion is smaller. There are two situations under which the
higher-order terms are small. One is when we are already very close to
the minimum, the other is when those misfit measurements are
approximately linear with respect to the image and we are using a
square summation of those individual misfit measurements as our misfit
functional.

The problem with Newton's method is that the second-order derivative
is sometimes very hard to evaluate. We will see why it is so hard to
compute the second-order derivatives for our imaging problems once
we've introduced the Born series. So we would like to avoid evaluating
the Hessian or second-order derivative. If we look at the perturbation
obtained in the Newton's method, it is simply a scaled version of the
gradient or first-order derivative. In Newton's method, the scaling is
given by the inverse of the Hessian. If we just choose an arbitrary constant
as our scaling factor, will the method still work? Yes, it will still
work given that the scaling factor is not too big. What do we mean by
"too big"? The value of the objective function at the next point
should not be larger than its current value. If the scaling factor is
too big, we risk the possibility of over-shooting. The scaling factor
should not be too small either, because if it is too small, we'll
need to take many small steps to reach the minimum. What is the
optimal scaling factor? It is the inverse of the Hessian. When the
scaling factor is just a constant, we call our optimization method
"steepest descent", usually it is less efficient than Newton's
method.

Slide 10
In order to solve our imaging problems using steepest descent or
Newton's methods, we will need to take derivatives of the misfit
functional with respect to the image at some reference image. We call
those derivatives Frechet derivatives.

Suppose we have an ordinary function with many independent variables,
we denote them as x1, x2, xi, etc. For an ordinary function, we can do
a Taylor expansion at some reference point, we denote it as
, etc. The Taylor expansion to the first order can be expressed as
the function value at the reference point plus the inner product
between the gradient of the function evaluated at this reference point
with a perturbation to the reference point. This gradient is actually
a vector, the elements of this vector are partial derivatives of the
function with respect to individual independent variables. This
summation over i is actually an inner product.

We can do the same thing to the misfit functional at
some reference image The misfit functional is a function of the
values of the image at all spatial points, we denote those spatial
points as x1, x2, xi and so on. Now we need to take the derivatives of
the misfit functional with respect to those image values at all
spatial points. Those derivatives are evaluated at a certain reference
image value. Again, the first-order Taylor expansion can now be
expressed as the misfit functional evaluated at a certain reference
image plus a summation over i. If we make our spatial locations, x1,
x2, xi, and so on, infinitly dense, this summation over i becomes an
integration over all spatial locations. This integration is an
inner product between the integration kernel and a perturbation to the
image. This integration kernel represents the partial derivative of the misfit
functional with respect to the image value at spactial location x. In
an ordinary function, the derivative is a derivative with respect to
the spatial location x.

Now, let's come back to the question that we asked ourselves at the
beginning of this class. If we change the input image by a small
amount say, delta I, how much will the misfit functional change? To
the first order, the change in misfit functional is given by this
integral. Second- and higher-order terms of delta I are included in
the small o term. We call this integral a "Frechet derivative" and the
integration kernel "Frechet kernel".

Slide 11
Now let's look at how to take second-order Frechet derivatives. We
will need them if we want to use the Newton's method for optimization.

We Taylor expand our misfit functional to the second order at a
reference image I0. The second-order Frechet derivative involves the
second-order partial derivatives of the misfit functional with respect
to the image values at two locations, which could be the same or
different. To represent the summation over all possible combinations,
we need to introduce two indexes, i and j, both xi and xj run through
all spatial points. If we make those spatial points dense enough, the
two summations become two integrals. To distinguish that the partial
derivatives are taken with respect to the image values at two possibly
different spatial locations, we use two different spatial variables x
and y. Again, we must emphasize that this integration kernel is the
partial derivative of the misfit functional with respect to the values
of the image at two spatial locations.

The second-order Frechet derivative of the misfit functional is
usually called Hessian.

Slide 12
There are different notations of the Frechet derivative. We call it a
"derivative", but it is actually an integration operator, the
integration is carried out over the location x. So we can also write this
integratioin as an inner product between the integration kernel and a
perturbation to the image. Sometimes, we also denote it using the same
partial derivative notation as we do in ordinary calculus. The Frechet
derivative of the misfit functional with respect to the image gives us
the gradient that we need for using the steepest descent method.

Slide 13
Once we've obtained the first and second-order Frechet derivatives,
extending the steepest descent or the Newton's method for minimizing a
functional is straight-forward. For the steepest descent method, the
formula for updating the image is simply the first-order Frechet kernel of the
misfit functional times a small constant and then put a minus sign in
front. We can write this scalar multiplication as a convolution with a
delta function with a constant weight. The reason for writing a simple
formula into a complicated form is that this way we can make our
notation later on more consistent.

If we want to use Newton's method, the formula for updating the model
becomes integrating the first-order Frechet kernel against the inverse of the
second-order Frechet kernel. Recall that the integration kernel for

the Hessian of the misfit functional is a function of two spatial
variables, integrating over x' gives us a function of x.

The difference between steepest descent and the Newton's method is
essentially in the integration kernel, whether it is a delta function
or the inverse of the Hessian.

In fact, by changing the form of this integration kernel, we can
obtain a varity of optimization methods, including the Gauss-Newton method,
the Conjugate-gradient method and many more.

Albert Tarantola calls this integration kernel the
"preconditioner". We'll use the same termino logy in this course . The
point is that by carefully choosing the appropriate preconditioner, we
can greatly improve the convergence rate of our optimization procedure, which
means less number of iterations to obtain the minimum of the misfit
functional.

Slide 14
Now let's look at how adjoint comes into our optimization
procedure.

Let's recall our definition of the inner product. For
observations, the inner product is represented using this square
bracket, which is an integration or summation over all observation
parameters. We put a conjugate on top just in case our observation is
a complex number. This pointed bracket represents an inner product
between images, which is an integration over all spatial
locations.

Suppose there is a linear operator that mapps image I to observation
f, the adjoint of the linear operator E is defined so that this inner
products in observation parameters is equal to the inner product in
spatial locations.

Slide 15
Now we let both f and g be our misfit measurements. The inner product
in observation parameter gives us a squared summation of all misfit
measurements, which is our misfit functional.

A perturbation to our misfit functional now becomes a summation or
integration of the perturbation to the square of individal misfit
measurements over all observation parameters. This perturbation can be
evaluated to become the misfit measurement times a perturbation to the
misfit measurement. The integration or summation over p now becomes an
inner product between the perturbations to individual misfit
measurements and those misfit measurements.

Suppose we can write our misfit measurement as a subtraction between
the raw observation and the model predicted observation, the raw
observation does not change, and the perturbation to Delta d equals to
the perturbation to model-predicted observation with a minus sign. A
perturbation to the model-predicted observation equals to the Frechet
derivative of our possibly-nonlinear forward modeling operator F
evaluated at a certain reference image I0 applied on a perturbation to
the reference image.

We bring the expression for delta d into the expression for delta chi,
the inner product over p can be expressed between the Frechet
derivative of the forward modeling operator applied on the image
perturbation and the misfit measurements.

One point we need to make here is that even though the forward
modeling operator F might be nonlinear with respect to the input image
I, the Frechet derivative of F evaluated at a given reference image I0
does not depend upon the image perturbation. So this is a linear
operator.

If we can find the adjoint of this linear operator, we can express the
inner product over p using an inner product over spatial location x
and this inner product gives the exact expression for the Frechet
derivative of the misfit functional with respect to the image.
What have we achieved through this derivation? We expressed the
Frechet derivative of the misfit functional with respect to the image
evaluated at a certain reference image using the adjoint of the
Frechet derivative of the forward modeling operator evaluated at the
reference image.

Slide 16
We will talk about how to construct the Frechet derivative of a
nonlinear forward modeling operator when we come to the Born
approximation. For linear forward modeling operators, such as a
generalized Radon transform, the Frechet derivative of the forward
modeling operator has exactly the same integration kernel as the
integration kernel in the generalized Radon transform.

If you do not see what I mean, let's try to write the generalized
Radon transform into a summation over all spatial locations. The
Frechet kernel or the integration kernel of the Frechet derivative
operator is given by the partial derivative of the misfit measurement
with respect to the image value at a given spatial location. This
partial derivative is exactly the integration kernel of the
generalized Radon transform evaluated at that spatial location. The
Frechet derivative is then obtained by summing or integrating over all
spatial locations.

The perturbation to the misfit functional can be expressed as an inner
product between a perturbation to the model-predicted observation and
the misfit measurement. The perturbation to the model-predicted
observation can be expressed as the Frechet derivative applied on a
perturabtion to the image. This Frechet derivative has the same
integration kernel as the generalized Radon transform used for
forward modeling.

We write out the inner product over p as an integration or summation
over p, and we switch the order of these two integrations. For now, we
can ignore the conjugate over C since all our examples so far only
involve real C. The integration over x is now moved to the outside and
we can write it as an inner product between the image perturbation and
an integral over p. If we define this integration operator as the
adjoint of the generalized Radon transform, we can verify that it
satisfies the definition of the adjoint.

Prev Next
 
Home    Why Algebra Buster?    Guarantee    Testimonials    Ordering    FAQ    About Us
What's new?    Resources    Animated demo    Algebra lessons    Bibliography of     textbooks
 

Copyright © 2009, algebra-online.com. All rights reserved.