gradient
– Symbolic Differentiation¶
Symbolic gradient is usually computed from gradient.grad()
, which offers a
more convenient syntax for the common case of wanting the gradient of some
scalar cost with respect to some input expressions. The grad_sources_inputs()
function does the underlying work, and is more flexible, but is also more
awkward to use when gradient.grad()
can do the job.
Gradient related functions¶
Driver for gradient calculations.
-
class
theano.gradient.
DisconnectedGrad
[source]¶ -
R_op
(inputs, eval_points)[source]¶ This method is primarily used by tensor.Rop
Suppose the op outputs
[ f_1(inputs), …, f_n(inputs) ]
- Parameters
inputs (a Variable or list of Variables) –
eval_points – A Variable or list of Variables with the same length as inputs. Each element of eval_points specifies the value of the corresponding input at the point where the R op is to be evaluated.
- Returns
- rval[i] should be Rop(f=f_i(inputs),
wrt=inputs, eval_points=eval_points)
- Return type
list of n elements
-
-
exception
theano.gradient.
DisconnectedInputError
[source]¶ Raised when grad is asked to compute the gradient with respect to a disconnected input and disconnected_inputs=’raise’.
-
class
theano.gradient.
DisconnectedType
[source]¶ A type indicating that a variable is a result of taking the gradient of c with respect to x when c is not a function of x. A symbolic placeholder for 0, but to convey the extra information that this gradient is 0 because it is disconnected.
-
filter
(data, strict=False, allow_downcast=None)[source]¶ Required: Return data or an appropriately wrapped/converted data.
Subclass implementation should raise a TypeError exception if the data is not of an acceptable type.
If strict is True, the data returned must be the same as the data passed as an argument. If it is False, and allow_downcast is True, filter may cast it to an appropriate type. If allow_downcast is False, filter may only upcast it, not lose precision. If allow_downcast is None (default), the behaviour can be Type-dependent, but for now it means only Python floats can be downcasted, and only to floatX scalars.
- Raises
MethodNotDefined – Subclass doesn’t implement this function.
-
-
exception
theano.gradient.
GradientError
(arg, err_pos, shape, val1, val2, abs_err, rel_err, abs_tol, rel_tol)[source]¶ This error is raised when a gradient is calculated, but incorrect.
-
theano.gradient.
Lop
(f, wrt, eval_points, consider_constant=None, disconnected_inputs='raise')[source]¶ Computes the L operation on f wrt to wrt at eval_points.
Mathematically this stands for the jacobian of f wrt to wrt left muliplied by the eval points.
- Parameters
f (
Variable
or list of Variables) – f stands for the output of the computational graph to which you want to apply the L operatorwrt (
Variable
or list of Variables) – variables for which you compute the L operator of the expression described by feval_points (
Variable
or list of Variables) – evalutation points for each of the variables in f
- Returns
Symbolic expression such that L_op[i] = sum_i (d f[i] / d wrt[j]) eval_point[i] where the indices in that expression are magic multidimensional indices that specify both the position within a list and all coordinates of the tensor element in the last If f is a list/tuple, then return a list/tuple with the results.
- Return type
Variable
or list/tuple of Variables depending on type of f
-
theano.gradient.
Rop
(f, wrt, eval_points, disconnected_outputs='raise', return_disconnected='zero')[source]¶ Computes the R operation on f wrt to wrt at eval_points.
Mathematically this stands for the jacobian of f wrt to wrt right muliplied by the eval points.
- Parameters
f (
Variable
or list of Variables) – f stands for the output of the computational graph to which you want to apply the R operatorwrt (
Variable
or list of Variables) – variables for which you compute the R operator of the expression described by feval_points (
Variable
or list of Variables) – evalutation points for each of the variables in wrtdisconnected_outputs (str) –
Defines the behaviour if some of the variables in f have no dependency on any of the variable in wrt (or if all links are non-differentiable). The possible values are:
’ignore’: considers that the gradient on these parameters is zero.
’warn’: consider the gradient zero, and print a warning.
’raise’: raise DisconnectedInputError.
return_disconnected ({'zero', 'None', 'Disconnected'}) –
‘zero’ : If wrt[i] is disconnected, return value i will be wrt[i].zeros_like()
’None’ : If wrt[i] is disconnected, return value i will be None
’Disconnected’ : returns variables of type DisconnectedType
- Returns
Symbolic expression such that R_op[i] = sum_j (d f[i] / d wrt[j]) eval_point[j] where the indices in that expression are magic multidimensional indices that specify both the position within a list and all coordinates of the tensor element in the last. If wrt is a list/tuple, then return a list/tuple with the results.
- Return type
Variable
or list/tuple of Variables depending on type of f
-
class
theano.gradient.
UndefinedGrad
[source]¶ -
R_op
(inputs, eval_points)[source]¶ This method is primarily used by tensor.Rop
Suppose the op outputs
[ f_1(inputs), …, f_n(inputs) ]
- Parameters
inputs (a Variable or list of Variables) –
eval_points – A Variable or list of Variables with the same length as inputs. Each element of eval_points specifies the value of the corresponding input at the point where the R op is to be evaluated.
- Returns
- rval[i] should be Rop(f=f_i(inputs),
wrt=inputs, eval_points=eval_points)
- Return type
list of n elements
-
-
class
theano.gradient.
ZeroGrad
[source]¶ -
R_op
(inputs, eval_points)[source]¶ This method is primarily used by tensor.Rop
Suppose the op outputs
[ f_1(inputs), …, f_n(inputs) ]
- Parameters
inputs (a Variable or list of Variables) –
eval_points – A Variable or list of Variables with the same length as inputs. Each element of eval_points specifies the value of the corresponding input at the point where the R op is to be evaluated.
- Returns
- rval[i] should be Rop(f=f_i(inputs),
wrt=inputs, eval_points=eval_points)
- Return type
list of n elements
-
-
theano.gradient.
consider_constant
(x)[source]¶ DEPRECATED: use zero_grad() or disconnected_grad() instead.
Consider an expression constant when computing gradients.
The expression itself is unaffected, but when its gradient is computed, or the gradient of another expression that this expression is a subexpression of, it will not be backpropagated through. In other words, the gradient of the expression is truncated to 0.
- Parameters
x – A Theano expression whose gradient should be truncated.
- Returns
The expression is returned unmodified, but its gradient is now truncated to 0.
New in version 0.7.
-
theano.gradient.
disconnected_grad
(x)[source]¶ Consider an expression constant when computing gradients.
It will effectively not backpropagating through it.
The expression itself is unaffected, but when its gradient is computed, or the gradient of another expression that this expression is a subexpression of, it will not be backpropagated through. This is effectively equivalent to truncating the gradient expression to 0, but is executed faster than zero_grad(), which stilll has to go through the underlying computational graph related to the expression.
-
theano.gradient.
format_as
(use_list, use_tuple, outputs)[source]¶ Formats the outputs according to the flags use_list and use_tuple.
If use_list is True, outputs is returned as a list (if outputs is not a list or a tuple then it is converted in a one element list). If use_tuple is True, outputs is returned as a tuple (if outputs is not a list or a tuple then it is converted into a one element tuple). Otherwise (if both flags are false), outputs is returned.
-
theano.gradient.
grad
(cost, wrt, consider_constant=None, disconnected_inputs='raise', add_names=True, known_grads=None, return_disconnected='zero', null_gradients='raise')[source]¶ Return symbolic gradients of one cost with respect to one or more variables.
For more information about how automatic differentiation works in Theano, see
gradient
. For information on how to implement the gradient of a certain Op, seegrad()
.- Parameters
cost (
Variable
scalar (0-dimensional) tensor variable orNone
) – Value that we are differentiating (that we want the gradient of). May be None if known_grads is provided.wrt (
Variable
or list of Variables) – Term[s] with respect to which we want gradientsconsider_constant (list of variables) – Expressions not to backpropagate through
disconnected_inputs ({'ignore', 'warn', 'raise'}) –
Defines the behaviour if some of the variables in wrt are not part of the computational graph computing cost (or if all links are non-differentiable). The possible values are:
’ignore’: considers that the gradient on these parameters is zero.
’warn’: consider the gradient zero, and print a warning.
’raise’: raise DisconnectedInputError.
add_names (bool) – If True, variables generated by grad will be named (d<cost.name>/d<wrt.name>) provided that both cost and wrt have names
known_grads (OrderedDict, optional) – A ordered dictionary mapping variables to their gradients. This is useful in the case where you know the gradient on some variables but do not know the original cost.
return_disconnected ({'zero', 'None', 'Disconnected'}) –
‘zero’ : If wrt[i] is disconnected, return value i will be wrt[i].zeros_like()
’None’ : If wrt[i] is disconnected, return value i will be None
’Disconnected’ : returns variables of type DisconnectedType
null_gradients ({'raise', 'return'}) –
Defines the behaviour if some of the variables in wrt have a null gradient. The possibles values are:
’raise’ : raise a NullTypeGradError exception
’return’ : return the null gradients
- Returns
Symbolic expression of gradient of cost with respect to each of the wrt terms. If an element of wrt is not differentiable with respect to the output, then a zero variable is returned.
- Return type
variable or list/tuple of variables (matches wrt)
-
theano.gradient.
grad_clip
(x, lower_bound, upper_bound)[source]¶ This op do a view in the forward, but clip the gradient.
This is an elemwise operation.
- Parameters
x – The variable we want its gradient inputs clipped
lower_bound – The lower bound of the gradient value
upper_bound – The upper bound of the gradient value.
Examples
>>> x = theano.tensor.scalar() >>> z = theano.tensor.grad(grad_clip(x, -1, 1)**2, x) >>> z2 = theano.tensor.grad(x**2, x) >>> f = theano.function([x], outputs = [z, z2]) >>> print(f(2.0)) [array(1.0), array(4.0)]
Note
We register an opt in tensor/opt.py that remove the GradClip. So it have 0 cost in the forward and only do work in the grad.
-
theano.gradient.
grad_not_implemented
(op, x_pos, x, comment='')[source]¶ Return an un-computable symbolic variable of type x.type.
If any call to tensor.grad results in an expression containing this un-computable variable, an exception (NotImplementedError) will be raised indicating that the gradient on the x_pos’th input of op has not been implemented. Likewise if any call to theano.function involves this variable.
Optionally adds a comment to the exception explaining why this gradient is not implemented.
-
theano.gradient.
grad_scale
(x, multiplier)[source]¶ This op scale or inverse the gradient in the backpropagation.
- Parameters
x – The variable we want its gradient inputs scale
multiplier – Scale of the gradient
Examples
>>> x = theano.tensor.fscalar() >>> fx = theano.tensor.sin(x) >>> fp = theano.tensor.grad(fx, wrt=x) >>> fprime = theano.function([x], fp) >>> print(fprime(2)) -0.416... >>> f_inverse=grad_scale(fx, -1.) >>> fpp = theano.tensor.grad(f_inverse, wrt=x) >>> fpprime = theano.function([x], fpp) >>> print(fpprime(2)) 0.416...
-
theano.gradient.
grad_undefined
(op, x_pos, x, comment='')[source]¶ Return an un-computable symbolic variable of type x.type.
If any call to tensor.grad results in an expression containing this un-computable variable, an exception (GradUndefinedError) will be raised indicating that the gradient on the x_pos’th input of op is mathematically undefined. Likewise if any call to theano.function involves this variable.
Optionally adds a comment to the exception explaining why this gradient is not defined.
-
theano.gradient.
hessian
(cost, wrt, consider_constant=None, disconnected_inputs='raise')[source]¶ - Parameters
cost (Scalar (0-dimensional) variable.) –
wrt (Vector (1-dimensional tensor) 'Variable' or list of) –
(1-dimensional tensors) Variables (vectors) –
consider_constant – a list of expressions not to backpropagate through
disconnected_inputs (string) –
Defines the behaviour if some of the variables in
wrt
are not part of the computational graph computingcost
(or if all links are non-differentiable). The possible values are:’ignore’: considers that the gradient on these parameters is zero.
’warn’: consider the gradient zero, and print a warning.
’raise’: raise an exception.
- Returns
The Hessian of the cost with respect to (elements of) wrt. If an element of wrt is not differentiable with respect to the output, then a zero variable is returned. The return value is of same type as wrt: a list/tuple or TensorVariable in all cases.
- Return type
Variable
or list/tuple of Variables
-
theano.gradient.
jacobian
(expression, wrt, consider_constant=None, disconnected_inputs='raise')[source]¶ Compute the full Jacobian, row by row.
- Parameters
expression (Vector (1-dimensional)
Variable
) – Values that we are differentiating (that we want the Jacobian of)wrt (
Variable
or list of Variables) – Term[s] with respect to which we compute the Jacobianconsider_constant (list of variables) – Expressions not to backpropagate through
disconnected_inputs (string) –
Defines the behaviour if some of the variables in wrt are not part of the computational graph computing cost (or if all links are non-differentiable). The possible values are:
’ignore’: considers that the gradient on these parameters is zero.
’warn’: consider the gradient zero, and print a warning.
’raise’: raise an exception.
- Returns
The Jacobian of expression with respect to (elements of) wrt. If an element of wrt is not differentiable with respect to the output, then a zero variable is returned. The return value is of same type as wrt: a list/tuple or TensorVariable in all cases.
- Return type
Variable
or list/tuple of Variables (depending upon wrt)
-
class
theano.gradient.
numeric_grad
(f, pt, eps=None, out_type=None)[source]¶ Compute the numeric derivative of a scalar-valued function at a particular point.
-
static
abs_rel_err
(a, b)[source]¶ Return absolute and relative error between a and b.
The relative error is a small number when a and b are close, relative to how big they are.
- Formulas used:
abs_err = abs(a - b)
rel_err = abs_err / max(abs(a) + abs(b), 1e-8)
The denominator is clipped at 1e-8 to avoid dividing by 0 when a and b are both close to 0.
The tuple (abs_err, rel_err) is returned
-
abs_rel_errors
(g_pt)[source]¶ Return the abs and rel error of gradient estimate g_pt
g_pt must be a list of ndarrays of the same length as self.gf, otherwise a ValueError is raised.
Corresponding ndarrays in g_pt and self.gf must have the same shape or ValueError is raised.
-
max_err
(g_pt, abs_tol, rel_tol)[source]¶ Find the biggest error between g_pt and self.gf.
What is measured is the violation of relative and absolute errors, wrt the provided tolerances (abs_tol, rel_tol). A value > 1 means both tolerances are exceeded.
Return the argmax of min(abs_err / abs_tol, rel_err / rel_tol) over g_pt, as well as abs_err and rel_err at this point.
-
static
-
theano.gradient.
subgraph_grad
(wrt, end, start=None, cost=None, details=False)[source]¶ With respect to wrt, computes gradients of cost and/or from existing start gradients, up to the end variables of a symbolic digraph. In other words, computes gradients for a subgraph of the symbolic theano function. Ignores all disconnected inputs.
This can be useful when one needs to perform the gradient descent iteratively (e.g. one layer at a time in an MLP), or when a particular operation is not differentiable in theano (e.g. stochastic sampling from a multinomial). In the latter case, the gradient of the non-differentiable process could be approximated by user-defined formula, which could be calculated using the gradients of a cost with respect to samples (0s and 1s). These gradients are obtained by performing a subgraph_grad from the cost or previously known gradients (start) up to the outputs of the stochastic process (end). A dictionary mapping gradients obtained from the user-defined differentiation of the process, to variables, could then be fed into another subgraph_grad as start with any other cost (e.g. weight decay).
In an MLP, we could use subgraph_grad to iteratively backpropagate:
x, t = theano.tensor.fvector('x'), theano.tensor.fvector('t') w1 = theano.shared(np.random.randn(3,4)) w2 = theano.shared(np.random.randn(4,2)) a1 = theano.tensor.tanh(theano.tensor.dot(x,w1)) a2 = theano.tensor.tanh(theano.tensor.dot(a1,w2)) cost2 = theano.tensor.sqr(a2 - t).sum() cost2 += theano.tensor.sqr(w2.sum()) cost1 = theano.tensor.sqr(w1.sum()) params = [[w2],[w1]] costs = [cost2,cost1] grad_ends = [[a1], [x]] next_grad = None param_grads = [] for i in xrange(2): param_grad, next_grad = theano.subgraph_grad( wrt=params[i], end=grad_ends[i], start=next_grad, cost=costs[i] ) next_grad = dict(zip(grad_ends[i], next_grad)) param_grads.extend(param_grad)
- Parameters
wrt (list of variables) – Gradients are computed with respect to wrt.
end (list of variables) – Theano variables at which to end gradient descent (they are considered constant in theano.grad). For convenience, the gradients with respect to these variables are also returned.
start (dictionary of variables) – If not None, a dictionary mapping variables to their gradients. This is useful when the gradient on some variables are known. These are used to compute the gradients backwards up to the variables in end (they are used as known_grad in theano.grad).
cost (
Variable
scalar (0-dimensional) variable) –Additional costs for which to compute the gradients. For example, these could be weight decay, an l1 constraint, MSE, NLL, etc. May optionally be None if start is provided.
Warning
If the gradients of cost with respect to any of the start variables is already part of the start dictionary, then it may be counted twice with respect to wrt and end.
details (bool) – When True, additionally returns the list of gradients from start and of cost, respectively, with respect to wrt (not end).
- Returns
Returns lists of gradients with respect to wrt and end, respectively.
- Return type
Tuple of 2 or 4 Lists of Variables
New in version 0.7.
-
theano.gradient.
undefined_grad
(x)[source]¶ Consider the gradient of this variable undefined.
This will generate an error message if its gradient is taken.
The expression itself is unaffected, but when its gradient is computed, or the gradient of another expression that this expression is a subexpression of, an error message will be generated specifying such gradient is not defined.
-
theano.gradient.
verify_grad
(fun, pt, n_tests=2, rng=None, eps=None, out_type=None, abs_tol=None, rel_tol=None, mode=None, cast_to_output_type=False, no_debug_ref=True)[source]¶ Test a gradient by Finite Difference Method. Raise error on failure.
Raises an Exception if the difference between the analytic gradient and numerical gradient (computed through the Finite Difference Method) of a random projection of the fun’s output to a scalar exceeds the given tolerance.
Examples
>>> verify_grad(theano.tensor.tanh, ... (np.asarray([[2, 3, 4], [-1, 3.3, 9.9]]),), ... rng=np.random)
- Parameters
fun (a Python function) – fun takes Theano variables as inputs, and returns a Theano variable. For instance, an Op instance with a single output.
pt (list of numpy.ndarrays) – Input values, points where the gradient is estimated. These arrays must be either float16, float32, or float64 arrays.
n_tests (int) – number of times to run the test
rng (numpy.random.RandomState, optional) – random number generator used to sample the output random projection u, we test gradient of sum(u * fun) at pt
eps (float, optional) – stepsize used in the Finite Difference Method (Default None is type-dependent). Raising the value of eps can raise or lower the absolute and relative errors of the verification depending on the Op. Raising eps does not lower the verification quality for linear operations. It is better to raise eps than raising abs_tol or rel_tol.
out_type (string) – dtype of output, if complex (i.e., ‘complex32’ or ‘complex64’)
abs_tol (float) – absolute tolerance used as threshold for gradient comparison
rel_tol (float) – relative tolerance used as threshold for gradient comparison
cast_to_output_type (bool) – if the output is float32 and cast_to_output_type is True, cast the random projection to float32. Otherwise it is float64. float16 is not handled here.
no_debug_ref (bool) – Don’t use DebugMode for the numerical gradient function.
Note
This function does not support multiple outputs. In tests/test_scan.py there is an experimental verify_grad that covers that case as well by using random projections.
-
theano.gradient.
zero_grad
(x)[source]¶ Consider an expression constant when computing gradients.
The expression itself is unaffected, but when its gradient is computed, or the gradient of another expression that this expression is a subexpression of, it will be backpropagated through with a value of zero. In other words, the gradient of the expression is truncated to 0.
List of Implemented R op¶
See the gradient tutorial for the R op documentation.
- list of ops that support R-op:
- with test [Most is tensor/tests/test_rop.py]
SpecifyShape
MaxAndArgmax
Subtensor
IncSubtensor set_subtensor too
Alloc
Dot
Elemwise
Sum
Softmax
Shape
Join
Rebroadcast
Reshape
Flatten
DimShuffle
Scan [In scan_module/tests/test_scan.test_rop]
- without test
Split
ARange
ScalarFromTensor
AdvancedSubtensor1
AdvancedIncSubtensor1
AdvancedIncSubtensor
Partial list of ops without support for R-op:
All sparse ops
All linear algebra ops.
PermuteRowElements
Tile
AdvancedSubtensor
TensorDot
Outer
Prod
MulwithoutZeros
ProdWithoutZeros
CAReduce(for max,… done for MaxAndArgmax op)
MaxAndArgmax(only for matrix on axis 0 or 1)