Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
CNTK matrix product.
A * B
Times (A, B, outputRank=1)
TransposeTimes (A, B, outputRank=1)
Parameters
Afirst argument of matrix product. Can be a time sequence.Bsecond argument of matrix product. Can be a time sequence.outputRank(default: 1): number of axes ofAthat constitute the output dimension. See 'Extended interpretation for tensors' below.
Return Value
Resulting matrix product (tensor). This is a time sequence if either input was a time sequence.
Description
The Times() function implements the matrix product, with extensions for tensors. The * operator is a short-hand for it. TransposeTimes() transposes the first argument.
If A and B are matrices (rank-2 tensor) or column vectors (rank-1 tensor), A * B will compute the common matrix product, just as one would expect.
TransposeTimes (A, B) computes the matrix product A^T * B, where ^T denotes transposition. TransposeTimes (A, B) has the same result as Transpose (A) * B, but it is more efficient as it avoids a temporary copy of the transposed version of A.
Time sequences
Both A and B can be either single matrices or time sequences. A common case for recurrent networks is that A is a weight matrix, while B is a sequence of inputs.
Note: If A is a time sequence, the operation is not efficient, as it will launch a separate GEMM invocation for every time step. The exception is TransposeTimes() where both inputs are column vectors, for which a special optimization exists.
Sparse support
Times() and TransposeTimes() support sparse matrix. The result is a dense matrix unless both are sparse. The two most important use cases are:
Bbeing a one-hot representation of an input word (or, more commonly, an entire sequence of one-hot vectors). Then,A * Bdenotes a word embedding, where the columns ofAare the embedding vectors of the words. The following is the recommended way of realizing embeddings in CNTK:``` Embedding (x, dim) = Parameter (dim, 0/*inferred*/) * x e = Embedding (input, 300) ```Abeing a one-hot representation of an label word. The popular cross-entropy criterion and the error counter can be written usingTransposeTimes()as follows, respectively, wherezis the input to the top-level Softmax() classifier, andLthe label sequence which may be sparse:``` CrossEntropyWithSoftmax (L, z) = ReduceLogSum (z) - TransposeTimes (L, z) ErrorPrediction (L, z) = BS.Constants.One - TransposeTimes (L, Hardmax (z)) ```
Multiplying with a scalar
The matrix product can not be used to multiply a matrix with a scalar. You will get an error regarding mismatching dimensions. To multiply with a scalar, use the element-wise product .* instead. For example, the weighted average of two matrices could be written like this:
z = Constant (alpha) .* x + Constant (1-alpha) .* y
Multiplying with a diagonal matrix
If your input matrix is diagonal and stored as a vector, do not use Times() but an element-wise multiplication (ElementTimes() or the .* operator).
For example
dMat = ParameterTensor {(100:1)}
z = dMat .* v
This leverages broadcasting semantics to multiply every element of v with the respective row of dMat.
Extended interpretation of matrix product for tensors of rank > 2
If A and/or B are tensors of higher rank, the * operation denotes a generalized matrix product where all but the first dimension of A must match with the leading dimensions of B, and are interpreted by flattening. For example a product of a [I x J x K] and a [J x K x L] tensor (which we will abbreviate henceforth as [I x J x K] * [J x K x L]) gets reinterpreted by reshaping the two tensors as matrices as [I x (J * K)] * [(J * K) x L], for which the matrix product is defined and yields a result of dimension [I x L]. This makes sense if one considers the rows of a weight matrix to be patterns that activation vectors are matched against. The above generalization allows these patterns themselves to be multi-dimensional, such as images or running windows of speech features.
It is also possible to have more than one non-matched dimension in B. For example [I x J] * [J x K x L] is interpreted as this matrix product: [I x J] * [J x (K * L)] which thereby yields a result of dimensions [I x K x L]. For example, this allows to apply a matrix to all vectors inside a rolling window of L speech features of dimension J.
If the result of the product should have multiple dimensions (such as arranging a layer's activations as a 2D field), then instead of using the * operator, one must say Times (A, B, outputRank=m) where m is the number of dimensions in which the 'patterns' are arranged, and which are kept in the output. For example, Times (tensor of dim [I x J x K], tensor of dim [K x L], outputRank=2) will be interpreted as the matrix product [(I * J) x K] * [K x L] and yield a result of dimensions [I x J x L].