<<< back

Streaming Reduction Circuit

Publication

This page summarizes the paper Streaming Reduction Circuit, and gives a short visualization of the algorithm in this paper. The publication can be cited as:

- Marco Gerards, J. Kuper, A. Kokkeler, and B. Molenkamp. Streaming reduction circuit. In Proceedings of the 12th EUROMICRO Conference on Digital System Design, Architectures, Methods and Tools, Patras, Greece, pages 287-292, 2009. [ DOI | http ]

Proceed to the demonstration of the algorithm

Problem statement

The publication describes an algorithm and hardware implementation that supports floating point sparse matrix-vector multiplication (SM×V) on an FPGA.

To demonstrate the complexity of the problem, we start with a short example of such an SM×V calculation. A sparse matrix is a matrix that contains many zero elements. For matrix multiplication, these elements are skipped for efficiency reasons. The following calculation determines the product of a sparse matrix and a vector:

For each row in the result entry, the multiplications with zero elements are skipped. This reduces the required calculations significantly. The number of non-zero elements in the a sparse matrix row is assumed to be variable, hence the number of multiplications and additions for each row are variable as well. For this example, triples that contain the matrix value, vector value and result row index are created:

(1,1,1), (1,5,1), (1,2,1), (1,5,2), (4,2,2), (1,2,3), (2,3,3), (1,3,4), (3,1,5), (1,8,5), (1,8,6)

The values are streamed into a floating point multiplier that gives results in a stream of output values:

(1,1), (5,1), (2,1), (5,2), (8,2), (6,3), (3,4), (3,5), (8,5), (8,6)

It remains to add these values using an off-the-shelf floating point adder. Due to pipelining of floating point adders, the output appears p clock cycles after the inputs were fed into the adder. For example, to calculate 1.0+2.0+5.0, first 1.0 and 2.0 are placed at the input, and after p clock cycles, the adder output 3.0 and 5.0 together enter the adder such that the result 8.0 is produced after another p clock cycles. As stalling the input until the result is available is not efficient, we designed an algorithm that schedules the adder pipeline and implemented this algorithm in hardware. The hardware overhead of this scheduler is less than an additional adder.

Our solution

The algorithm works as follows. Each clock cycle, a value is placed in the input FIFO (leftmost buffer in the picture below). The algorithm begins when the second value is in the input buffer. Whenever a result appears at the output of the adder, it is placed in the output buffer, i.e, a shift register (rightmost buffer in the picture below). The floating point pipeline is scheduled using a few simple rules:
  1. If the row of the current adder output matches a value in the output buffer, they are added; else,
  2. If the row of the first element in the input buffer matches that of the current adder output, they are added; else,
  3. If the input buffer contains at least two elements, and the top two elements have the same row index, they are added; else,
  4. If the input buffer contains at least two elements, the top element enters the adder together with 0.0; else,
  5. No operation takes place, and a dummy addition is performed.
Using these five rules, and by using proper (but rather small) buffers, the input can be streamed into the reduction circuit and the adder is scheduled with high utilization. The paper proves mathematically that the algorithm works and never overflows the buffers, and describes a hardware implementation.

Demonstration of the algorithm

This demonstration shows how to add values from the same row (by color). Rows appear consecutively at the input and are scheduled to be added using a single pipelined floating point adder. The pipeline depth and input rows can be configured below. Click on "Restart simulation" to restart after changing these parameters.

Pipeline depth (p):
Length of input rows:


Last updated: 2015-11-27