| While hardware designers have long exploited pipelining and caches to improve performance, these proven methods go largely unexploited by programmers at the software level. When multiple processors are utilized, pipelining and caching in software at the procedural level provides performance benefits in the same way that pipelining and caching in hardware does. Furthermore, complex control logic and cache memories too difficult to implement in hardware are feasible in software, making high level strategies and abstractions for pipelining and prefetching possible. This paper discusses these methods in the context of exploiting parallelism, and are illustrated in the design and implemention of a prefetching and pipelined movie player, Hum. |
|
Precedents For Pipelining |
The precedents for pipelining in computing comes largely from hardware,
where pipelined designs are used in processor functional units, memory
hierarchies, networks, graphics cards, and numerous other subsystems.
Many of the historically significant performance improvements to these
devices can be credited to designing circuits which utilize pipelining.
Pipelining has become so prevalent in computer hardware that the term pipe
has become inextricably tied to processor functional units and networks.
Despite the proven benefits and ubiquity of pipelining in computing,
application programmers rarely implement them.
The primary advantage of pipelining is the latency hiding it can exhibit by masking the time of stages of the pipeline. Performing each stage of the pipeline concurrently exposes as many degrees of parallelism as there are segments. While sequentially the total time of performing a sequence of $T$ tasks on $D$ units of data is $T * D$, pipelined concurrent tasks use only $T + (D - 1)$ time. In this way, pipelines can be utilized to substantially reduce the processing time of data. Performance anomalies for pipelines include stalls and bubbles. The former occurs when the next pipeline segment is still busy when new data becomes available, and the latter occurs in a segment which is ready for data but has not recieved any from the previous segment. In the following example, three units of data are processed by a four stage pipeline. Data is injected every other timestep (steps 1, 3, and 5), but segment B's time complexity varies with data so that data block 0 requires one period, block 1 requires three periods, and block 2 two periods. Such a sequence would progress as:
This example also illustrates the importance of having balanced pipeline stages in order to achieve high utilization. |
|
Prefetching, Queuing, and Caching |
Pipelines are most effective at maximizing resource utilization when
they are kept full, and they reduce latency most when started early.
Both these goals can be met by speculatively issuing work to pipelines.
Prefetching, in concert with queuing and caching, can be used to
implement a tunable, and adaptive strategy for improving average
case latency and resource utilization. When operating at full
efficiency, images can be produced as quickly as they are consumed
and have zero latency.
Hum's cache stores decoded images for immediate display, provides a place for speculatively loaded images to reside until they are displayed, and stores recently used images which may need to be redisplayed soon. The cache is arranged as lines each large enough to hold all the components (f.e.: stereo pairs and vector overlay) of a single frame. The smallest part of the cache which can be allocated is a cache line, and there must be at least one cache line for each pipeline to provide a destination for the output of each pipeline. |
|
Implementation of Hum |
Hum implements pipelining, prefetching, queueing and caching to exploit parallelism and reduce latency of display of moving pictures. This section describes how Hum implements the task of displaying a sequence of images using these features, and discusses how these methods are similar and dissimilar to their hardware counterparts. |
|
Pipelining in Software |
In software, the term pipelining often refers to a technique that seeks
to overlap loop iterations by partitioning the loop body into stages with
zero or more instructions in each stage. Software pipelining of loops
without a recurrence is often performed automatically by an optimizing
compiler. Another form of pipelining in software is data pipelining
in which large data is streamed through small memory. This approach
is also referred to as the out-of-core method, and usually exploits no
parallelism.
While software pipelining can be done automatically by an optimizing compiler, it is only currently practical on small, relatively simple loops. Procedural pipelining is similar to software pipelining in that it is segmenting a process into stages which will be performed concurrently but takes place at the level of procedures rather than processor instructions. Because procedural pipelining is applied at an abstraction level higher than automatic optimization, pipelining almost always presents an unexploited optimization opportunity. Procedural pipelining can be used by itself, or in conjunction with other types of parallelism to maximize the number of degrees of available parallelism in an application. |
|
Load - Decompress - Decode Pipeline |
Hum uses a procedural pipeline:
The size of the pipeline segments is configurable at runtime to allow tuning for installation specific disk and computational characteristics. Adjusting the buffer size of a segment changes the amount of work that a segment can do before synchronizing with the previous segment to get more data. The size of the input stage of the pipeline should be chosen to be a multiple of the disk block size to maximize the efficiency of the filesystem. Subsequent stage sizes can be chosen to match the first segment, or may be smaller to facilitate data reuse. Segments which are larger than their predecessors never fill any further than the size of the smallest previous segment. Synchronization between the pipeline segment threads is implemented using mutual exclusion locks. A single lock is associated with the input buffer of each segment. Segments which still have data in their input buffers hold the lock which prevents the previous segment from updating the input buffer of the busy segment. The first segment has no input buffer since it is reading the data from disk rather than getting data from a previous segment, but it still has a mutual exclusion lock which controls when the first segment can start to perform work. This lock is controlled by the dequeueing and display routines which load frames resulting from speculation or display. Hum allows unique buffer sizes for each segment, and a future modification will make it more efficient by adding fixed buffer sizes with double buffering. |
|
Prefetch, Queue, and Dequeue Processes |
Hum is implemented using independent prefetching, queuing, and dequeing
processes to increase the amount of potential concurrency while
enjoying the benefits of reduced latency and higher machine utilization.
Prefetching can be initiated by any of the image display threads when
a displayed image is being replaced by another image. Zero
or more images which are not already in the cache, queue or
currently in a pipeline, can be requested
in advance of their display. The prefetching mechanism can
also work in a synchronous fashion, blocking until the
requested images are ready for display.
Prefetch requests from all display threads are aggregated in nondeterministic order into a single queue from which frames are dequed by a seperate thread. It is the deque thread's responsibility to dispatch requests from the queue to available resources. The deque thread also implements Hum's gang scheduling functionality. Multi-component movies, such as stereo or vector overlay movies, have all the components of the frames gang scheduled to ensure affinity towards completing a single frame before starting a new frame. Components may be issued concurrently or are tailgated as pipelines become available. Hum implements a mechanism to suppress worst case performance of the prefetching mechanism by coupling the cache to the queue, causing a flushing of the queue when cache replacements are unreferenced. This coupling of the cache to the queue prevents a sequence of likely bad guesses (assumed because the cache is full of unused data) from filling the queue, and thus increases the chances for urgently needed frames to enter the queue at the front. |
|
Decoded Image Cache |
Decoded images are stored in a cache of user specified size using a
modified least recently used replacement cache line strategy.
In addition to the
least recently used list, each image stored in the cache has counters to
indicate the number of outstanding references and the number of times it
has been referenced. The former counter indicates that the frame is currently
being displayed and could be redrawn/reused at any time, and the latter
indicates if a frame has been loaded but not shown yet. These counters
are used to optimize cache utilization by preventing a frame currently
being displayed from being evicted (number of outstanding references is
greater than zero) and from being evicted before it is used (number of
reads equal to zero). Hum's order of preference for cache replacement is:
A cache line consists of buffer space for all the components of a single frame. The components are not contiguous in memory, but are allocated by the decoder after the sizes of the components are discovered. The first component of a frame to enter a decode segment causes the reclamation of a cache line which the frame is then assigned to. Subsequent components of a frame entering the decode segment use the cache line assigned to the first component. High cache pressure can cause the cache to interact with the queue, flushing all pending requests if the cache is full of unreferenced data. Flushing the queue has the desireable side effect of temporarily superceding the prefetching and queing configurations with a best-case configuration. |
|
Miscellanious Parallelism |
Hum invokes a seperate thread for every screen on the display.
On systems which have dedicated hardware for each screen, Hum is
able to drive all the screens concurrently so the number of
screens which Hum can support is scalable. Multiple windows
on a single screen are drawn sequentially, ensuring there are
no unnecessary context switches on the graphics hardware.
All the controls across windows and screens can be linked to
allow panning and zooming in a power wall type multiscreen
display.
Each window has a semitransparent user interface made up of iconifiable and adjustable interface tiles which display state, performance, a probe, and various controls. The entire interface can be toggled off and on. Hum can also open borderless windows which occupy the entire display, allowing seamless tiling or full screen presentation. Processing of input events is computationally insignificant relative to the other tasks Hum is doing, nevertheless, input events are processed by a seperate thread allowing good interactive response under all but the most pathological conditions. In all, the total number of threads executing in an invocation of Hum is: $Npipes_{read} [+ Npipes_{Decompress}] + Npipes_{Decode} + Nscreens_{OpenGL} + InputEventHandler + Deque$ |
|
Tuning Hum |
There are several tuning parameters for Hum, including
pipeline segment buffer sizes, number of pipelines,
and frame rate.
Effective tuning can be achieved simply by changing the
number of pipelines used.
Segment buffer sizes can be changed to accomodate filesystem
physical block sizes and processing capability.
Minor extensions would make
it possible to dynamically prioritize each pipeline
segments relative to any other segment in any pipeline,
allowing more urgently needed frames to take precedence
when the machine is oversubsribed.
The size of the cache determines the useful extents of prefetching and queuing parameters, and should be chosen to accomodate meaningful prefetching and queuing paramaters, not vice versa. When the requested playback rate exceeds either the rate at which requests can be dequeued or the rate which frames can be redrawn, speculation is likely to be imperfect. In this situation, Hum tends to prefetch near misses. This can be used to the user's advantage for tuning: a deep queue will favor clustered guessing without side effects. The deep queue either reduces latency by sucessfully prefetching the correct frame, or induces the flushing of the queue resulting in the frame's retrieval in pipelined time using a blocking re-request. Ultimately, fine-tuning is dataset-installation specific, and the size of the image cache will limit the amount of latency which can be hidden by Hum. Tuning is further confounded by cutoffs and limitations in the runtime system and underlying hardware architecture. It is common for a runtime or hardware scheduling mechanism to fail nonlinearly, such that a miniscule change in playback rate can dramatically change the suitability of a particular tuning strategy. |
|
Testing Procedures |
Both performance and correctness tests are used on
Hum continuously during development and when our machines
are not otherwise in use. Our goal is to use every available
cycle on Hum testing, and we routinely run instrumented versions
of Hum for hundreds of hours on all our machines.
Once correctness is established,
performance is measured under a variety of ideal and pathological
conditions. Modifications which diminish performance are either
corrected or rejected.
There are two aspects to performance testing: measuring the amount of unnecessary work, and measuring the unutilized resources which would have improved performance. Hum must balance the incompatible goals of improving performance by speculatively issuing as much work as possible, while eliminating as much work as possible to allow resources to be dedicated to relevant work. The simple metric of frames per second can be used to measure both, but if performance is poor the cache miss rate and pipeline utilization must be examined to determine what is at fault. Correctness testing includes extensive exception testing and portability tests. All conditional branches in Hum have an {\it else} or {\it default} clause which act as correctness assertions. If an exception case is a potentially valid state, synthetic tests are created which cause the exception case to be exercised. This is usually done by inducing a load imbalance at appropriate places in the code. Hum is implemented strictly adhering to ANSI-C, POSIX, X11, and OpenGL standards to ensure portability, and has been ported to Solaris, Intel and PPC Linux, IRIX, and DEC OSF Unix. The lack of machine-dependent code serves as an additional correctness test in that different compilers, operating systems, and computers will stress exactly the same source code in different ways, exposing weaknesses in the software not present on all systems. Hum's source code makes extensive use of automatic code generation using the macro language M4. Unlike the C Preprocessor, M4 is a complete macro language featuring recursion, variable argument lists, and whitespace conservation. Automatic code generation reduces redundancies in the code to be maintained while still allowing explicit inlining of performance critical code. Code generation also provides a means to implement a routine once, but have code generated for multiple intrinsic types. Using M4 improves both the maintainability of the source code and it's robustness via reuse. |
|
Tiling and Wavelets |
Hum implements internal tiling which allows a single image to
be represented internally as a set of smaller tiled images.
This feature is not intended to improve performance, but rather
as a workaround for graphics hardware with limits on the size
of the image which can be rasterized. For example, SGI
InfiniteReality series graphics cards cannot rasterize an
image wider than 4096 pixels, so display of images wider than
that must be tiled into smaller images less than 4096 pixels
wide.
Another type of tiling of interest in Hum is tiled input images. Tiled images can be loaded concurrently reducing the latency of loading a single large image. If all the tiles of an image will not be displayed it is not necessary to load unseen tiles, allowing resources to be dedicated to the data which will be used. The prefetching and gang scheduling mechanisms can be extended to schedule displayed tiles to be processed first, speculatively loading nearby unseen tiles into cache as time allows. The ability to efficiently load large amounts of data in a scalable fashion can potentially result in a bottleneck at the graphics display device(s). While it is possible to dedicate many processes and I/O devices to the task of loading and decoding data, all the data must still funnel through each graphics device. Thus, it is necessary to reduce the size of the images to make balanced loading and display possible. Image formats using wavelet compression methods have the advantage of encoding increasingly higher resolution versions of the image successively in the file, which is synergistic with the pipelined decoder in Hum. Effectively implementing levels of detail via wavelets enables arbitrarily large images to be displayed without overwhelming graphics devices with data. Hum can take advantage of progressively decoded images by displaying partially decoded images using the {\tt -loose} option, which ensures the best possible quality image is always available for display as decoding proceeds. |
|
Experimental Results |
Hum has been tested on a variety of Unix workstations ranging from
a 90MHz Intel Pentium with 24MB of memory to a 32 processor
Origin 2000 with 32GB of memory. On monoprocessor computers, the
heterogenous load of disk, computation, and graphics exploit most
architectures' innate ability to drive external interrupt driven
devices asynchrnously. Speedups of 20-80\% are seen when Hum's
pipelining (and hence, parallelism) is enabled compared to a
sequential run on the same system.
Depending on the dataset and computer, it is possible to completely mask several seconds of latency, allowing very large images to be animated indefinately. Using 18 pipelines on a 24 processor Origin, Hum was able to animate 4096x2304 images to three screens using lossless compression at 8.7 frames per second, using approximately 130MB/s of disk bandwidth. Furthermore, the 5 gigabyte memory cache made it possible to mask approximately 7 seconds of latency. Since this experiment, enhancements to Hum have likely improved this to over 10 frames per second. At SuperComputing2002, ROTANG animated 8192x4600 images on three displays at 4fps while spooling from disk. |
|
Conclusions |
Procedural pipelining excels at masking the cost of serialism and can be
used to exploit parallelism where data parallelism is not possible,
and can expose heterogeneous parallelism which is
synergistic with modern architectures using asynchronous peripherals.
The relatively small number of degrees of parallelism are also well suited
for relatively small number of processors in shared memory multiprocessor
computers.
Hum's basic architecture requires nothing of the image encoding format other than it does not have inter-frame dependencies. While it is possible to use multi-frame encoding methods, they would cause the pipeline to become very deep, defeating the latency hiding, and could require the decoding of unused frames reducing the effectiveness of the cache and resources used. Hum's architecture and implementation is extendible to other image encoding formats, and we are particularly interested in wavelet methods due to their synergy with pipelining and myriad other benefits. The use of symmetric multiprocessor computers with multiple displays is already well established in scientific visualization and digital cinema. Specialized software such as Hum is needed in order to maximize the potential of these larger systems, as well as address the limitations of computability and the capabilities of hardware. Hum is designed and implemented for arbitrarily large datasets on parallel computers with performance as a primary consideration, and we expect it will serve users well through the next several generations of such comptuer systems. |