Module 10.2

The Future of Digital Image Processing: Parallel Processing

Nickolas Faust
The Electro-Optics, Environment, and Materials Laboratory
Georgia Tech Research Institute
Georgia Institute of Technology
Atlanta, GA 30332

Direct comments to:  diana@bismarck.gtri.gatech.edu
jairo@bismarck.gtri.gatech.edu
 

Contents

Introduction Supercomputing Techniques Parallel Processing Power Parallel Processing Software

Custom Designed Silicon Chips


Introduction

Characteristics

Major changes have occurred in the computing environment that fundamentally alter the way in which we process image and GIS data. Interactive techniques are replacing complex pattern recognition and enhancement techniques that , in the past, required long processing times. Image enhancement using spatial filters, geometric correction of images, and pattern recognition functions can now be applied interactively with immediate feedback to the image analyst. Techniques that were computationally prohibitive are now becoming reasonable within a digital image processing environment. Spatial image enhancement using Fast Fourier Transforms, the incorporation of expert system inference engines, and the full use of the three dimensional nature of the real world in spatial analysis are example of techniques that have previously not been available to the general user, but are becoming more useful as computation speeds rise. Several commercial vendors will begin providing up to 1 meter data of the earth's surface from satellites within the next year or two. The massive amounts of data that will become available from these sensors will necessitate the use of advanced computing technology to allow processing at such a high resolution. Decisions will need to be made as to the appropriate resolution of multispectral imagery that is necessary to achieve accurate mapping at various scales. NASA will be providing satellite hyperspectral sensing capability within the next two years that will for the first time provide a generally available hyperspectral data set for multiple applications. New analysis techniques will be developed using advanced computing technology that will allow the detailed investigation of hyperspectral data and the application of pattern recognition techniques to provide highly accurate spectral discrimination. GIS has been built upon the concepts of spatial data fusion. The integrated used of GIS data into the image analysis process will be a start into the technical capability for the extraction of features from multi-source, multi-sensor, data sets. The ability to employ the same spatial reasoning attributes that are present in human interpreters in automated or semi-automated image analysis is a research area that is heavily dependent on the implementation of advanced computing techniques and systems.

Rapid Growth in Remote Sensing

Remote sensing techniques and data are becoming more widely used due to a number of factors. People are becoming used to seeing locational information overlaid on top of images. This is occurring increasingly as low end spatial display products such as ESRI's ARCVIEW and ERDAS' MAPSHEETS are incorporated into decision making at the personal , corporate, and governmental levels. The capability of viewing satellite images via Internet has greatly enlarged the base of potential users of remote sensing information. This growth is in part due to the major advances in the capability of computer hardware to perform rapid geospatial analysis. This growth in capability is partially due to the design of faster and faster central processing units (cpu's). The speed of cpu's on the other hand is due to the capability of computer vendors to pack more and more capability on a limited amount of space on a computer chip. The fabrication of cpu chips as well as Application Specific Integrated Circuits (ASIC's) is approaching theoretical limits of physics in how close microscopically individual electronic elements such as transistors may be placed within a silicon chip. The advance in processing rate mirrors a rule of thumb rule known as Joy's Law which states that cpu speed in terms of millions of instructions per second will double every year.
 

 
Advanced special purpose architectures such as ASIC's may be used to create dedicated very high speed functionality on a computer chip without the chip having to implement the generalized capability necessary for a cpu. ASIC's have been developed for a number of image processing and computer graphics functions such as image convolution and three dimensional perspective scene rendering. Advances in computer networks have been made that allow multiple cpu's on different workstations to operate together in the solution of a well designed problem. Many problems may be separated into a number of discrete steps that may be sent to individual cpu's tied together by a network. For an image processing task such as pattern recognition, a large image may be processed by multiple cpu's, each considering only a small area of the total area to be processed. The raw data from the separate geographic areas and the results from the individual processors would have to be sent via the network to a cpu that is managing the whole classification task. A UNIX protocol has been developed that allows a master cpu process to send queries to other cpu's on a network to determine which of them are available for assignment to a part of a multi-cpu analysis. Once an available cpu is found, part of the overall process is downloaded along with its necessary data to the remote cpu. The resulting processed data is sent back to the originating cpu when the remote process is completed. These tools have become known as the Parallel Virtual Machine (PVM) system. Parallel processing is generally a more complex process than multi-processing in that complex algorithms are analyzed and decomposed into discrete parts which can run simultaneously on multiple cpu's to achieve major speedups in performance. Early parallel processing systems required extensive user interaction and essentially the rewriting of code to take advantage of the parallel systems. While some tailoring of algorithms still has the potential for greater speedups, optimizing compilers have been developed that will accept FORTRAN and C code, analyze it, and create code that takes advantage of the parallel hardware. This will be described in more detail later. Another major factor in the advances in the use of Remote Sensing data involves the access to low cost digital storage at a reasonable price. CDROM's have become part of most PC systems and UNIX workstations, and CDROM writers have become available at relatively low prices for people who wish to write their own CDROMs. New technologies are becoming available that will greatly increase the storage capacity and access rates available on the desktop. As memory and disk storage continue to get cheaper, the bottlenecks that used to constrain the use of remote sensing data to large systems are disappearing. Desktop computing has become the terminology for the long awaited merger of personal computer and workstation capability. The development of the Pentium processor and other advanced cpu's has provided substantial raw cpu capability at the desktop level. This power is harnessed by image processing software to provide a very effective tool for image analysis on a low cost platform. All image processing functions that have been implemented on workstations will also exist on the PC platforms. The price structure of the PC market will force digital image processing vendors to create lower cost products that may be used effectively in the desktop market. Image processing functions at the desktop should be as available as word-processing and spreadsheet functions and should allow the direct and simple inclusion of enhanced images in documents. New advances in networking and communications will greatly affect the way remote sensing data are accessed and perceived. The availability of ISDN lines at the home will for the first time allow relatively high speed transfer of image data from archival storage at an office to a PC/workstation system at the home. The rapid expansion of INTERNET use will provide a flexible and robust source for image data along with the capability of collaborative analysis of image data.

Software Innovations

Advances in operating systems for lower-end computers has greatly improved the ability of performing image processing tasks in a desktop environment. Windows 95 and Windows NT are very strong multi-processing operating systems that are being exploited by image processing vendors to provide capabilities and interactions very similar to those available on UNIX boxes. LINIX is a UNIX derivative for PC's that allows a workstation environment with X windows support to be implemented on a relatively robust PC hardware system. Unfortunately, one cannot run LINIX simultaneously with Windows. A user could have both operating systems on his disk, however, and boot to one or the other on startup. The speed of computing also allows access to advanced software techniques such as expert systems that prior to this time were too computationally intense to be available on low cost workstations or PC's. Investigations as to the application of expert system tools to the process of image classification are ongoing and potentially will lead to commercial products with built in knowledge bases. Inference engines will be available that will allow for analysis and assessment of confidence in the classification process. Sophisticated operating systems are being developed that optimize FORTRAN and C code to take advantage of parallel processing opportunities. Optimizing compilers on several workstation systems look for operations that may be performed in parallel (i.e. do not have dependencies on one another and create optimized code that distributes the processing to numerous cpu's).

Return to Table of Contents


Supercomputing Techniques

Vector Processing

Vector processing has been the basis for most “supercomputer” systems in the last decade (Cheng, 1989). Scientific computing has a need for fast computation using floating point or greater precision. A vector processor is generally a single controlling processor which sequences a long data vector through a number “pipelined” stages. A pipeline operation, as the name suggest, is the process by which a complex operation is broken into a number of independent sub steps that can be implemented sequentially. Each step in pipeline operation may be handled by a dedicated processing element (an adder, multiplier, etc.) and the results passed as input directly to the next processing element. each step within an operation such as a vector multiply may also be broken down into simpler functions such as fetch, add, and store. Each operation in this sequential process has a inherent execution time, and the total time for processing the first element of a vector is the sum of individual execution times. However, once the first vector element has exited the first sequential step and enter the second step, the second vector element is entered into the first step. After the sequential pipeline is full the time for processing a vector element is equal only to the time taken by the longest of the individual sub-steps. For long vectors, therefore, the pipeline processor gives significant speedups. For short vector, however, the speedups may be much less and, in some cases, may not justify the use of a vector pipeline operation. Vector pipeline assumes that the same operation is being applied to a large amount of data. The control and execution of the next instruction is sequential in nature, but the CPU must know whether the last batch of data has passed through the total pipeline. A timing interrupt or message passing strategy must provide this information.

Vector Supercomputing

Vector pipelining was introduced in the late 1960’s and the early 1970’s on the Control Data STAR 100, the Texas Instruments Advanced Scientific Computer-TI-ASC, and the Cray-1 (Cragon, 1989); August, 1989). These systems were legitimately known as supercomputers (Rau, 1989; Jones 1989). A performance measure for floating point operations based on a set of computers programs known as Linpack was developed by Dongarra (1987). To measure the effectiveness of such computers with the measure computed in million of floating point operations per second (MFLOPS). The Cray 1S, supercomputer was evaluated as having a performance of 12 MFLOPS. Current computer vector architecture has greatly expended the power of floating point computation with the Cray X-MP having a performance measure of 235 MFLOPS per processor with up to four processor. Other supercomputer systems have the performance measures listed in table 1. (Insert table pg. 661). The currently available workstations, with 300 to 400 MIPS normally have a floating point performance of between 100 and 300 MFLOPS. These are the systems that will be applied most directly to GIS/RS problems in the near future. The acquisition of Cray by SGI in 1996 will likely reduce any distinction between high power workstations and Supercomputing.
 

SYSTEM  MFLOPS per processor  MAX PROCESSORS 
1  Cray-2  488  4 
2  Cray Y-MP  333  8 
3  CDC/ETA 10g  133  8 
4  IBM 3090s  1710  1 
5  Hitachi S-820-80  3000  1 
6  NEC SX 2  1300  1 
Vector Array Processors

One method that can be used to add more floating point performance to a workstation or stand-alone CPU is the addition of an attached vector array processor. Common array processing system include those by Floating Point Systems, Mercury Data Systems, CSPI, and SKY. Normally, these systems will be attached by direct plug in the bus of the workstation of through a parallel input/output channel. For maximum performance, a direct memory access, (DMA) interface is necessary to minimize data transfer bottlenecks. Array processors are normally implemented to intense vector pipelining, so only problems that can be approached in a way to guarantee long vectors will be efficiently implemented. If short vectors are used, there is a danger of spending more time transferring data than actually operating on the data. The efficiency of an array processer implementation of a particular problem is inversely proportional to the amount of time that the array processor spends idle.

Return to Table of Contents


Parallel Processing Power

When multiple CPUs or vector processors are linked together, the major differentiation between systems relates to the method of synchronization between the various processors and their memory.  For a synchronous system all operation are coordinating through a timing clock. The vector processing architecture shown above depend explicitly on timing to send the input data stream to multiple processor and various subparts of a pipeline. Multiple processors may operate through local memories or a global memory that is shared by all processors. If a processing algorithm needs only data that is not needed by any other processor, local memory may be used because communication across processors is minimized. If however an algorithm is implemented that requires that data be shared between processors, a complex addressing scheme may be used to avoid collisions in memory access and update. If four processor need to access the same memory location, then care must be taken to lock out other processor during the instant that data are being read and to allow the next processor to read the memory as soon as it is available. If the algorithm is allowed to modify the contents of memory the relative access order of the multiple multiprocessor could determine the final value. This result would be clearly undesirable. Dasgupta (1990) has developed a taxonomy for computing which represents serial and parallel processing alternatives.

SIMD

Another kind of synchronous parallel processing environment is the single instruction, multiple data (SIMD) system (Duncan, 1990). For this type of system, multiple processor are required to execute the same instruction on multiple data streams. Image processing systems have been designed to take advantage of this architecture, and the same kind of SIMD system may be applied to most GIS/RS data sets. Synchronous timing is used to move data in and out of a SIMD system and to move data within the system. Two-dimensional and three dimensional arrays of processors can be assembled with mesh and cross bar methods of memory addressing. For a Two-dimensional array, a single processor may be assigned to operate on each picture element (pixel) of an input image and to write the results into an output pixel array if there were not data dependencies on the output values on another processor. if such data dependencies exist, the problem would be applicable to SIMD synchronous processing. Image array processors have been implemented in a SIMD mode in which multiple processors operate independently on a number of individual image pixels (Hogan, 1990). If the operation to be performed involves not only point operations (those which act on a single x,y image pixel with multiple layers of information) but also area operations (the output pixel’s value is dependent not only on the x,y pixel’s value, but also its neighbor’s values), a memory access system must be employed to assure that no memory address conflicts arise. If the data to be processes are only available to individual processors through local memory, a mechanism must be employed to avoid boundary effects due to local memory boundaries. If the memory available to all processes is global, then only direct memory conflicts will potentially cause access problems. The experimental CLIP7A image processor uses SIMD elements with a certain level of autonomy (Fountain, 1988).

MIMD

Multiple instruction, multiple data (MIMD) computer systems are the classic case that most of us think of when parallel computing is discussed. MIMD systems have multiple processors which may operate on different instructions and different data. MIMD machines do not have synchronous timing with the same instruction being performed; therefore, a sophisticated intercommunication scheme is necessary to tell each processor when to execute its instruction and on which data set to operate. Asynchronous processing allows each processor to perform a number of different operations on its own local data without concern of the neighboring processors. Each processor acts alone, but when it finishes its process, it must notify the other processors. Message passing between adjacent nodes on the architecture is normally the method by which one processor talks to its neighbors. This is normally considered as loose coupling between the processors and memory (Hornstein, 1986). (Because these systems employ local memory only, local memory to local memory transfers are necessary to update the state of the overall process). The topology of the MIMD architecture may employ a ring, mesh, tree, or hypercube structure. MIMD processing is especially applicable to “coarse grained” parallelism in which an application may be broken down into functional sub units that then can be implemented on a number of processors. This is a high level parallel structure that may have one system performing total different operations from another processor. For example, in image processing, one set of processors may perform edge enhancement and detection while another will perform edge chaining for polygons. The vector chaining procedure in this case cannot happen before the edge operations have been completed. A message must be passed from the processor performing the edge operations to all other processors saying that the edge operation is available for further processing.

Return to Table of Contents


Parallel Processing Software

One of the greatest challenges in parallel computing is the development of the software that will allow full use of the hardware capabilities of the new hardware systems (Prasanna-Kumar, 1989). Most of the new systems are trying to avoid the development of special languages for implementation of applications code and instead rely on optimization of FORTRAN and C code. SIMD systems have been developed with reasonably efficient parallelizing compilers because the synchronization between processors is a vital part of the architecture (Little, 1988). MIMD systems, on the other hand, often have to have special tailoring of the application program to achieve speedups. For example, in MIMD programs, the software developer must identify variables and sections of the code that must be kept in global memory with sophisticated access lockout protection. Other portions of the code may have variables in local memory that do not affect the other processors. Transputers were designated along with the OCCAM programming language to allow efficient concurrent execution of loosely coupled processes with extensive message passing along one-way paths. A transputer may only address other transputers to which it has a direct connection. OCCAM is a especial purpose parallel language which seeks to optimize the performance of loosely coupled parallel systems (Dinning, 1989). A parallel system may have “coarse grained” or “fine grained” parallelism in its software implementation. Coarse grained parallelism involves the identification of whole code segments such as functions, expressions, or do loops that can be assigned to multiple processors. Fine grained parallelism, on the other hand, requires the definition of individual variables that must be shared between processors. MIMD machines require a detailed understanding of the application code, and often require a rewrite of the applications to adequately take advantage of the parallel hardware. The implementation of an algorithm on parallel hardware may be divided into a number of processes. A heavy-weight process such as the operating system may occupy resources and have high priority while “threads”, or lightweight process such as message passing, may also be implemented. There may be a number of threads within a process, each sharing memory and resources. Threads are one implementation of the fine grained parallelism (Feitelson, 1990).

Return to Table of Contents


Custom Design Silicon Chips

Another method of obtaining significant speedups in operation involves the design and implementation of special purpose chips that may be dedicated to a specific processing task. These chips, known as ASIC chips (application specific integrated circuits), have been designed for a number of applications including image processing, communications, and synthetic scene rendering. Silicon Graphics developed their proprietary “shading engine” and “transformation engine” for application in the simulation and visualization markets. We have seen above that the VITEK image processor has a dedicated, programmable, custom chip for performance of thigh speed computations on image data (Keller, 1990). The technology advances in recent years have allowed great expansion of the number of integrated circuits that may be placed on a chip. Very large scale integration (VLSI) and a new DoD mandated VHISC technology are being used for implementation of more and more complex processes.
 

Return to Table of Contents


Notes:

Geographic Information Systems and Remote Sensing Future Computer Environment.

Faust Nickolas L., Anderson William H., Star Jeffrey L.

1991 American Society for Photogrammetry and Remote Sensing.


Return to Module 10 Main Page