HUM

A Parallel Pipelined Motion Picture Program -- By Jace A Mogill


D R A F T
While hardware designers have long exploited pipelining and caches to improve performance, these proven methods go largely unexploited by programmers at the software level. When multiple processors are utilized, pipelining and caching in software at the procedural level provides performance benefits in the same way that pipelining and caching in hardware does. Furthermore, complex control logic and cache memories too difficult to implement in hardware are feasible in software, making high level strategies and abstractions for pipelining and prefetching possible. This paper discusses these methods in the context of exploiting parallelism, and are illustrated in the design and implemention of a prefetching and pipelined movie player, Hum.

Introduction
The metaphor of the pipeline is used pervasively in computing to describe the overlapping of sequential operations in time. The analogy is most commonly used to describe the instruction units in processors and data passing through networks, but can also be applied to software at the procedural level. On multiprocessor computers pipelining in software at the procedural level provides performance benefits in the same way that pipelining processor instructions or network packets in hardware does. Pipelining is most effective and versitile when used in conjunction with prefetching to keep pipelines full, queuing to facilitate gang scheduling, and caching to store speculated results.

The movie player software Hum implements all of these features, and functionalities such as image display and input event processing are also handled by individual threads. The design attempts to maximize the number of degrees of parallelism in a task with little intrinsic parallelism. Further parallelism within segments may be implemented in cases where image format decoding can be done concurrently, or input requests can be decomposed and performed concurrently. Hum is designed to replay arbitrarily long series of large images, even in the presence of imperfect data flow, and allows interactive panning and zooming of images larger than the display device(s). Hum is implemented using Pthreads which allows it to utilize multiple processors in shared memory multiprocessor computers, and presents a heterogeneous concurrent load of computation, disk, and graphics work which is synergistic with modern architectures in which these subsystems are typically asynchronous and interrupt driven.

To illustrate how all the threads work together, this block diagram of Hum depicts each thread as a labeled box. This figure depicts an invocation of ten threads: two pipelines (three threads each), the queue, the two screen drivers, and the user input handler. Execution begins at the right of the figure with a user input or the display of the first image. Initially the cache is empty, so the first frames are queued then dequeued and sent to the pipelines. The decoded data is stored in the cache, and a decision is made whether or not to flush the queue. When the data is available in the cache, the image is displayed immediately, prompting expected frames to be prefetched.

Bulk image data flow is represented by thick lines. The thin lines represent control information, generally just a frame number. Multiplexors serialize flow of concurrent inputs to the queue's input and flushing mechanism. The queue itself operates asynchronously, dequeueing frames as pipelines become available, and flushing when requested.

Hum is intended to be an interactive scientific tool in addition to a passive movie player. A semi-transparent user interface allows control of the frame rate and displays other annotation and performance statistics. Using the mouse the user can zoom and pan around the image. Other features include a location and value probe, duplication of the image for a wrapping effect, and multiple display windows which can be linked together. An input file listing the frame's stereo imagery and vector overlay components and annotation may be specified.



D R A F T

Precedents For Pipelining
The precedents for pipelining in computing comes largely from hardware, where pipelined designs are used in processor functional units, memory hierarchies, networks, graphics cards, and numerous other subsystems. Many of the historically significant performance improvements to these devices can be credited to designing circuits which utilize pipelining. Pipelining has become so prevalent in computer hardware that the term pipe has become inextricably tied to processor functional units and networks. Despite the proven benefits and ubiquity of pipelining in computing, application programmers rarely implement them.

The primary advantage of pipelining is the latency hiding it can exhibit by masking the time of stages of the pipeline. Performing each stage of the pipeline concurrently exposes as many degrees of parallelism as there are segments. While sequentially the total time of performing a sequence of $T$ tasks on $D$ units of data is $T * D$, pipelined concurrent tasks use only $T + (D - 1)$ time. In this way, pipelines can be utilized to substantially reduce the processing time of data.

Performance anomalies for pipelines include stalls and bubbles. The former occurs when the next pipeline segment is still busy when new data becomes available, and the latter occurs in a segment which is ready for data but has not recieved any from the previous segment. In the following example, three units of data are processed by a four stage pipeline. Data is injected every other timestep (steps 1, 3, and 5), but segment B's time complexity varies with data so that data block 0 requires one period, block 1 requires three periods, and block 2 two periods. Such a sequence would progress as:

Time Stage A Stage B Stage C Stage D
1 0
2 Bubble 0
3 1 bubble 0
4 Bubble 1 Bubble 0
5 2 1 Bubble Bubble
6 2(stalled) 1 Bubble Bubble
7 2 Bubble 1
8 2 Bubble 1
9 2 bubble
10 2

This example also illustrates the importance of having balanced pipeline stages in order to achieve high utilization.



D R A F T

Prefetching, Queuing, and Caching
Pipelines are most effective at maximizing resource utilization when they are kept full, and they reduce latency most when started early. Both these goals can be met by speculatively issuing work to pipelines. Prefetching, in concert with queuing and caching, can be used to implement a tunable, and adaptive strategy for improving average case latency and resource utilization. When operating at full efficiency, images can be produced as quickly as they are consumed and have zero latency.

Hum's cache stores decoded images for immediate display, provides a place for speculatively loaded images to reside until they are displayed, and stores recently used images which may need to be redisplayed soon. The cache is arranged as lines each large enough to hold all the components (f.e.: stereo pairs and vector overlay) of a single frame. The smallest part of the cache which can be allocated is a cache line, and there must be at least one cache line for each pipeline to provide a destination for the output of each pipeline.



D R A F T

Implementation of Hum
Hum implements pipelining, prefetching, queueing and caching to exploit parallelism and reduce latency of display of moving pictures. This section describes how Hum implements the task of displaying a sequence of images using these features, and discusses how these methods are similar and dissimilar to their hardware counterparts.
D R A F T

Pipelining in Software
In software, the term pipelining often refers to a technique that seeks to overlap loop iterations by partitioning the loop body into stages with zero or more instructions in each stage. Software pipelining of loops without a recurrence is often performed automatically by an optimizing compiler. Another form of pipelining in software is data pipelining in which large data is streamed through small memory. This approach is also referred to as the out-of-core method, and usually exploits no parallelism.

While software pipelining can be done automatically by an optimizing compiler, it is only currently practical on small, relatively simple loops. Procedural pipelining is similar to software pipelining in that it is segmenting a process into stages which will be performed concurrently but takes place at the level of procedures rather than processor instructions. Because procedural pipelining is applied at an abstraction level higher than automatic optimization, pipelining almost always presents an unexploited optimization opportunity. Procedural pipelining can be used by itself, or in conjunction with other types of parallelism to maximize the number of degrees of available parallelism in an application.

D R A F T

Load - Decompress - Decode Pipeline
Hum uses a procedural pipeline:
  • Read data from a data stream (disk, pipe, socket, etc.).
  • Decompress from a data stream (LZW decompression not complete yet)
  • Decode the image format to a native display format
Using decompression, Hum's pipeline is three segments deep and exposes three degrees of parallelism, without decompression the decompression stage is bypassed and the pipeline exposes two degrees of parallelism. Data encapsulation is complete: the read segment does not know if the stream is compressed or not, or what file format the data is in. Similarly, the decoder does not know if the input stream was compressed, or from where it was read.

The size of the pipeline segments is configurable at runtime to allow tuning for installation specific disk and computational characteristics. Adjusting the buffer size of a segment changes the amount of work that a segment can do before synchronizing with the previous segment to get more data. The size of the input stage of the pipeline should be chosen to be a multiple of the disk block size to maximize the efficiency of the filesystem. Subsequent stage sizes can be chosen to match the first segment, or may be smaller to facilitate data reuse. Segments which are larger than their predecessors never fill any further than the size of the smallest previous segment.

Synchronization between the pipeline segment threads is implemented using mutual exclusion locks. A single lock is associated with the input buffer of each segment. Segments which still have data in their input buffers hold the lock which prevents the previous segment from updating the input buffer of the busy segment. The first segment has no input buffer since it is reading the data from disk rather than getting data from a previous segment, but it still has a mutual exclusion lock which controls when the first segment can start to perform work. This lock is controlled by the dequeueing and display routines which load frames resulting from speculation or display. Hum allows unique buffer sizes for each segment, and a future modification will make it more efficient by adding fixed buffer sizes with double buffering.

D R A F T

Prefetch, Queue, and Dequeue Processes
Hum is implemented using independent prefetching, queuing, and dequeing processes to increase the amount of potential concurrency while enjoying the benefits of reduced latency and higher machine utilization. Prefetching can be initiated by any of the image display threads when a displayed image is being replaced by another image. Zero or more images which are not already in the cache, queue or currently in a pipeline, can be requested in advance of their display. The prefetching mechanism can also work in a synchronous fashion, blocking until the requested images are ready for display.

Prefetch requests from all display threads are aggregated in nondeterministic order into a single queue from which frames are dequed by a seperate thread. It is the deque thread's responsibility to dispatch requests from the queue to available resources. The deque thread also implements Hum's gang scheduling functionality. Multi-component movies, such as stereo or vector overlay movies, have all the components of the frames gang scheduled to ensure affinity towards completing a single frame before starting a new frame. Components may be issued concurrently or are tailgated as pipelines become available.

Hum implements a mechanism to suppress worst case performance of the prefetching mechanism by coupling the cache to the queue, causing a flushing of the queue when cache replacements are unreferenced. This coupling of the cache to the queue prevents a sequence of likely bad guesses (assumed because the cache is full of unused data) from filling the queue, and thus increases the chances for urgently needed frames to enter the queue at the front.

D R A F T

Decoded Image Cache
Decoded images are stored in a cache of user specified size using a modified least recently used replacement cache line strategy. In addition to the least recently used list, each image stored in the cache has counters to indicate the number of outstanding references and the number of times it has been referenced. The former counter indicates that the frame is currently being displayed and could be redrawn/reused at any time, and the latter indicates if a frame has been loaded but not shown yet. These counters are used to optimize cache utilization by preventing a frame currently being displayed from being evicted (number of outstanding references is greater than zero) and from being evicted before it is used (number of reads equal to zero). Hum's order of preference for cache replacement is:
  • Replace the least recently used image which does not have pending references and has already been displayed at least once ($N_{refs} == 0$ and $N_{reads} > 0$)
  • Replace the least recently used image which does not have any pending references ($N_{refs} == 0$)
  • Replace the least recently used image
This caching strategy mimics page demand virtual memory in that it allows more data to be used than can be held in memory. For example, if the decoded image cache is large enough for three images and four windows are open displaying four unique frames, Hum first loads, decodes and displays the three cacheable frames, then evicts the oldest of the three cached frames, replacing it with the fourth frame.

A cache line consists of buffer space for all the components of a single frame. The components are not contiguous in memory, but are allocated by the decoder after the sizes of the components are discovered. The first component of a frame to enter a decode segment causes the reclamation of a cache line which the frame is then assigned to. Subsequent components of a frame entering the decode segment use the cache line assigned to the first component.

High cache pressure can cause the cache to interact with the queue, flushing all pending requests if the cache is full of unreferenced data. Flushing the queue has the desireable side effect of temporarily superceding the prefetching and queing configurations with a best-case configuration.

D R A F T

Miscellanious Parallelism
Hum invokes a seperate thread for every screen on the display. On systems which have dedicated hardware for each screen, Hum is able to drive all the screens concurrently so the number of screens which Hum can support is scalable. Multiple windows on a single screen are drawn sequentially, ensuring there are no unnecessary context switches on the graphics hardware. All the controls across windows and screens can be linked to allow panning and zooming in a power wall type multiscreen display.

Each window has a semitransparent user interface made up of iconifiable and adjustable interface tiles which display state, performance, a probe, and various controls. The entire interface can be toggled off and on. Hum can also open borderless windows which occupy the entire display, allowing seamless tiling or full screen presentation.

Processing of input events is computationally insignificant relative to the other tasks Hum is doing, nevertheless, input events are processed by a seperate thread allowing good interactive response under all but the most pathological conditions.

In all, the total number of threads executing in an invocation of Hum is: $Npipes_{read} [+ Npipes_{Decompress}] + Npipes_{Decode} + Nscreens_{OpenGL} + InputEventHandler + Deque$



D R A F T

Tuning Hum
There are several tuning parameters for Hum, including pipeline segment buffer sizes, number of pipelines, and frame rate. Effective tuning can be achieved simply by changing the number of pipelines used. Segment buffer sizes can be changed to accomodate filesystem physical block sizes and processing capability. Minor extensions would make it possible to dynamically prioritize each pipeline segments relative to any other segment in any pipeline, allowing more urgently needed frames to take precedence when the machine is oversubsribed.

The size of the cache determines the useful extents of prefetching and queuing parameters, and should be chosen to accomodate meaningful prefetching and queuing paramaters, not vice versa. When the requested playback rate exceeds either the rate at which requests can be dequeued or the rate which frames can be redrawn, speculation is likely to be imperfect. In this situation, Hum tends to prefetch near misses. This can be used to the user's advantage for tuning: a deep queue will favor clustered guessing without side effects. The deep queue either reduces latency by sucessfully prefetching the correct frame, or induces the flushing of the queue resulting in the frame's retrieval in pipelined time using a blocking re-request.

Ultimately, fine-tuning is dataset-installation specific, and the size of the image cache will limit the amount of latency which can be hidden by Hum. Tuning is further confounded by cutoffs and limitations in the runtime system and underlying hardware architecture. It is common for a runtime or hardware scheduling mechanism to fail nonlinearly, such that a miniscule change in playback rate can dramatically change the suitability of a particular tuning strategy.



D R A F T

Testing Procedures
Both performance and correctness tests are used on Hum continuously during development and when our machines are not otherwise in use. Our goal is to use every available cycle on Hum testing, and we routinely run instrumented versions of Hum for hundreds of hours on all our machines. Once correctness is established, performance is measured under a variety of ideal and pathological conditions. Modifications which diminish performance are either corrected or rejected.

There are two aspects to performance testing: measuring the amount of unnecessary work, and measuring the unutilized resources which would have improved performance. Hum must balance the incompatible goals of improving performance by speculatively issuing as much work as possible, while eliminating as much work as possible to allow resources to be dedicated to relevant work. The simple metric of frames per second can be used to measure both, but if performance is poor the cache miss rate and pipeline utilization must be examined to determine what is at fault.

Correctness testing includes extensive exception testing and portability tests. All conditional branches in Hum have an {\it else} or {\it default} clause which act as correctness assertions. If an exception case is a potentially valid state, synthetic tests are created which cause the exception case to be exercised. This is usually done by inducing a load imbalance at appropriate places in the code. Hum is implemented strictly adhering to ANSI-C, POSIX, X11, and OpenGL standards to ensure portability, and has been ported to Solaris, Intel and PPC Linux, IRIX, and DEC OSF Unix. The lack of machine-dependent code serves as an additional correctness test in that different compilers, operating systems, and computers will stress exactly the same source code in different ways, exposing weaknesses in the software not present on all systems.

Hum's source code makes extensive use of automatic code generation using the macro language M4. Unlike the C Preprocessor, M4 is a complete macro language featuring recursion, variable argument lists, and whitespace conservation. Automatic code generation reduces redundancies in the code to be maintained while still allowing explicit inlining of performance critical code. Code generation also provides a means to implement a routine once, but have code generated for multiple intrinsic types. Using M4 improves both the maintainability of the source code and it's robustness via reuse.



D R A F T

Tiling and Wavelets
Hum implements internal tiling which allows a single image to be represented internally as a set of smaller tiled images. This feature is not intended to improve performance, but rather as a workaround for graphics hardware with limits on the size of the image which can be rasterized. For example, SGI InfiniteReality series graphics cards cannot rasterize an image wider than 4096 pixels, so display of images wider than that must be tiled into smaller images less than 4096 pixels wide.

Another type of tiling of interest in Hum is tiled input images. Tiled images can be loaded concurrently reducing the latency of loading a single large image. If all the tiles of an image will not be displayed it is not necessary to load unseen tiles, allowing resources to be dedicated to the data which will be used. The prefetching and gang scheduling mechanisms can be extended to schedule displayed tiles to be processed first, speculatively loading nearby unseen tiles into cache as time allows.

The ability to efficiently load large amounts of data in a scalable fashion can potentially result in a bottleneck at the graphics display device(s). While it is possible to dedicate many processes and I/O devices to the task of loading and decoding data, all the data must still funnel through each graphics device. Thus, it is necessary to reduce the size of the images to make balanced loading and display possible. Image formats using wavelet compression methods have the advantage of encoding increasingly higher resolution versions of the image successively in the file, which is synergistic with the pipelined decoder in Hum. Effectively implementing levels of detail via wavelets enables arbitrarily large images to be displayed without overwhelming graphics devices with data. Hum can take advantage of progressively decoded images by displaying partially decoded images using the {\tt -loose} option, which ensures the best possible quality image is always available for display as decoding proceeds.



D R A F T

Experimental Results
Hum has been tested on a variety of Unix workstations ranging from a 90MHz Intel Pentium with 24MB of memory to a 32 processor Origin 2000 with 32GB of memory. On monoprocessor computers, the heterogenous load of disk, computation, and graphics exploit most architectures' innate ability to drive external interrupt driven devices asynchrnously. Speedups of 20-80\% are seen when Hum's pipelining (and hence, parallelism) is enabled compared to a sequential run on the same system.

Depending on the dataset and computer, it is possible to completely mask several seconds of latency, allowing very large images to be animated indefinately. Using 18 pipelines on a 24 processor Origin, Hum was able to animate 4096x2304 images to three screens using lossless compression at 8.7 frames per second, using approximately 130MB/s of disk bandwidth. Furthermore, the 5 gigabyte memory cache made it possible to mask approximately 7 seconds of latency. Since this experiment, enhancements to Hum have likely improved this to over 10 frames per second. At SuperComputing2002, ROTANG animated 8192x4600 images on three displays at 4fps while spooling from disk.

D R A F T

Conclusions
Procedural pipelining excels at masking the cost of serialism and can be used to exploit parallelism where data parallelism is not possible, and can expose heterogeneous parallelism which is synergistic with modern architectures using asynchronous peripherals. The relatively small number of degrees of parallelism are also well suited for relatively small number of processors in shared memory multiprocessor computers.

Hum's basic architecture requires nothing of the image encoding format other than it does not have inter-frame dependencies. While it is possible to use multi-frame encoding methods, they would cause the pipeline to become very deep, defeating the latency hiding, and could require the decoding of unused frames reducing the effectiveness of the cache and resources used. Hum's architecture and implementation is extendible to other image encoding formats, and we are particularly interested in wavelet methods due to their synergy with pipelining and myriad other benefits.

The use of symmetric multiprocessor computers with multiple displays is already well established in scientific visualization and digital cinema. Specialized software such as Hum is needed in order to maximize the potential of these larger systems, as well as address the limitations of computability and the capabilities of hardware. Hum is designed and implemented for arbitrarily large datasets on parallel computers with performance as a primary consideration, and we expect it will serve users well through the next several generations of such comptuer systems.


ROTANG -- ZOOM! NOT BOOM! Find out more about who ROTANG is and how it is powered by M4.
ROTANG produces the world's finest optional software.
This browsing experience is Copyright 2003, ROTANG.