# A LARGE SCALE IMAGE PROCESSING SYSTEM TIP-4 PROTOTYPE

Yoshihiro Fujita Masao Iwashita Tsutomu Temma

C&C Information Technology Research Laboratories NEC Corporation

4-1-1. Miyazaki. Miyamae. Kawasaki. Kanagawa. 213 Japan

### ABSTRACT

The architecture of a large scale image processing system TIP-4 and performance result of its prototype system is presented.

TIP-4 is a data flow computer that has 512 static data flow VLSI processor ImPPs(Image Pipelined Processor: $\mu$ PD7281). ImPP capable of performing 5 million instructions per second(5 MIPS), so TIP-4 has maximum processing speed of 2560 million instructions per second(2.5GIPS). The prototype system TIP-4P that has actually been developed has 64 ImPPs, thus the peak computation rate for TIP-4P is 320 MIPS.

To use 512 ImPPs efficiently. TIP-4 has the following hardware and software characteristics. (1)Processors are devided into clusters. Each cluster consists of 8 ImPPs. interface LSI MAC and 512K words local memory. ImPPs and MAC are connected in a pipeline token ring bus. (2)The pipeline token ring bus is also there through clusters to support intercluster comunication. (3)There is a large global memory that holds a large image and can exchange data with a local memory through a very fast block transfer data bus. (4)To process a set of programs efficiently, a data flow processes and resourcess cheduling. A catalog language has also been developed to define parallel and serial process accomplishment.

### INTRODUCTION

A static data flow VLSI processor, ImPP, was made in 1984. Figure 1 shows an ImPP blockdiagram. An ImPP is a VLSI processor, which has a flexible pipeline architecture and data flow architecture. The data flow architecture makes it easy to set up the multiprocessor configration. ImPPs are designed to be connected sequentially in an array. Figure 2 shows a TIP-3 system blockdiagram, as an example of an ImPP multiprocessor system. TIP-3 consists of 8 ImPPs. MAGIC(Memory Access and General Bus Interface Chip). memory and host computer. MAGIC is a peripheral LSI that supports memory access, host computer access and program load. TIP-3 has a 40 MIPS maximum processing speed. However, the processing power is not enough to process larger images or video images. So, the authers decided to make TIP-4. which uses TIP-3 as a cluster. and multiple clusters work together.

This paper first describes the TIP-4 hardware architecture that enables 512 ImPPs to work together. Second. software environment is shown. Then performance evaluation for a prototype system is presented.

### SYSTEM ARCHITECTURE

Figure 3 shows a TIP-4 cluster. The cluster IPU (Image Processing Unit) consists of 8 ImPPs.interface LSI MAC(Memory Access Controller), local memory and a few interface busses.

Figure 4 shows the basic TIP-4 configuration. Clusters are conected with the main bus, image bus and inter-cluster pipeline ring bus.





IAPR Workshop on CV - Special Hardware and Industrial Applications OCT.12-14, 1988, Tokyo



Figure 2. TIP-3 Image Processing System





<u>a)WAC</u> MAC is a new interface LSI. which works like MAGIC( $\mu$ PD9305) with extended functions, such as inter-cluster communications and block data transfer controls between local and global memory. It is a lOK gate CMOS gate-array with 208 pins.

b)Inter-cluster Pipeline Ring Bus The inter-cluster pipeline ring bus is used to synchronize processes in clusters or small data transfer between clusters. It has 8 bit data lines and transfer 40 bit token divided in 5 stages. Standard tokens for ImPPs are 32 bits long. When an inter-cluster token request comes from an ImPP to MAC. MAC adds an 8 bit destination cluster number to that token to make a 40 bit intercluster token and sends it to the inter-cluster pipeline ring bus. When MAC receives a token from inter-cluster pipeline ring bus, the cluster number for that token is compared that for the cluster. It is brought into the cluster, if the cluster number is matched.

<u>c)Local Memory</u> When computing power increases, memory access usually becomes a processing bottleneck, especially in a mutiprocessor system. To avoid such a memory bottleneck, TIP-4 has a distributed local memory in each cluster. The processors can only access their own local memory. Each cluster has a 512K word (lword=18bit) local memory.

<u>d)Global Image Memory</u> The global image memory is not directly accessed by ImPPs. It is used to exechange data between local memories or local memory and host computer. The host computer can access the global image memory through the main bus. Then DMA (Direct Memory Access) data transfer is provided. The prototype system has an 8M word image memory.



Figure 4. Basic TIP-4 Configuration

e)Main Bus Main bus stands for image bus arbitration. access between MAC and host computer and block transfer adrress bus. It has 27 bit address lines, 18 bit data lines and other control lines.

<u>f)Image Bus</u> The image bus is a high speed block transfer bus between local and global image memory. It has 144 bit.50ns cycle data lines, so the transfer rate is 320M bytes per second.

**g)Memory** Architecture To achieve a fast block transfer rate. a dual port memory is used for local and global memory, that have a 4 bit serial port and a 4 bit parallel port. Parallel port of the local memory is accessed by ImPPs. The parallel port of the global image memory is accessed by the host computer. The serial ports for both local and global memory are connected to the image bus and are used for block transfer.

<u>h)Block Transfer Control</u> Block transfer between local and global memory is also controlled by ImPP under the data flow scheme. MAC supports the function. When an ImPP program requires a data transfer. it sends a block transfer request token to MAC. Then MAC controls the block transfer under the bus arbiter direction. When the block transfer is completed. MAC sends a return token to the ImPP. Therefore ImPP can use the data immidiately.

<u>i)Display Module</u> A display module was developed which makes it possible to display 1024x1024 pixels of 24 bit full color images. It has dual frame buffers, so that it can display a processing result in real time(60 images per second). Synchronization between the process and display is controlled by a cluster, which is specially designed for this purpose. MAC has a sync input and a double buffer select output to support the function. When a sync signal is fed to the sync input. MAC can send a sync token to ImPP. When a dubble buffer select token comes from ImPP. MAC switches the buffer.

j)Video Input Module A video input module was developed which can get 640x480 pixels or 512x512 pixels for 24 bit full color images from NTSC video input in video rate. Synchronization between the process and video input is also controlled by a cluster as well as display module.

The overall blockdiagram of TIP-4P is shown in Figure 5.

# DATA FLOW PROCESS SCHEDULING

To accomplish multiple processes efficiently, data flow process scheduling monitor was developed, which is capable of series and parallel controlled process implementation. cluster allocation to the processes, condition branch and loop control for process implementation. One cluster is used for the process scheduling monitor. To discribe the process sequence, a catalog language and its compiler were developed. Figure 6 shows an example of data sequence which shows series and parallel dependency for process implementation. Figure 7 shows catalog language examples, which discribes processes in Fig. 6. Catalog language is compiled into ImPP assembler, and is carried out in the process scheduling monitor cluster. The language also supports 'FOR' and 'IF' sentences.

As mentioned above. in the TIP-4 system. all processes are is controlled by tokens, including block transfer of data and programs. display sync signals synchronization and process scheduling.



Figure 5. TIP-4 Prototype System

IAPR Workshop on CV - Special Hardware and Industrial Applications OCT.12-14, 1988, Tokyo



Figure 6. Series and Parallel Process Dependancy

#### PERFORMANCE

In advance of making TIP-4 system, a prototype system TIP-4P was developed, and some performance data were evaluated.

<u>a)Inter-cluster Pipeline Ring Bus</u> Inter-cluster pipeline ring bus cycles was 0.9µs. Therefore passing 8 MACs takes 7.2µs and passing 64 MACs takes 57.6µs.

The inter-cluster pipeline ring bus becomes more important, when many clusters are used, and this result is sufficient.

<u>b)Block Transfer</u> The block transfer of 64K words data, between local memory and global image memory, takes  $614\mu$ s for 1 cluster and  $447\mu$ s for  $2\sim7$  clusters. The reason is that block transfers. by more than one cluster, hide block transfer overhead from each other. <u>c)Implementation Overhead</u> When seven programs, each of which need six start tokens are carried out under the control of a host computer through the main bus, the implementation overhead is up to  $13010\mu$ s. When the same thing is accomplished by the ImPP program in one cluster, through the inter-cluster pipeline ring bus, 151µs are required. This result shows the importance of the data flow process scheduling monitor.

<u>d)Spatial Filter</u> Spatial filter operation with a 3x3 kernel on a 256x256 and 512x512 pixel image with 16 bit data takes the times shown in Table 1 with a speed up ratio. Almost linear speed up was obtained. The image is distributed to each cluster, and the same programs are processed in each cluster.

| Table 1. | 3x3 | spati | al f. | i 1 | ter  | impleme | entatio | on time |
|----------|-----|-------|-------|-----|------|---------|---------|---------|
| (a)256x2 | 256 | pixel | image | e   | (b)5 | 12x512  | pixel   | image   |

| Clusters | (a) 256x2 | 56 Image | (b) 512x512 Image |          |  |
|----------|-----------|----------|-------------------|----------|--|
| clustels | time(ms)  | speed up | time(ms)          | speed up |  |
| 1        | 224.9     | 1.00     | 912.7             | 1.00     |  |
| 2        | 113.2     | 1.99     | 459.4             | 1.99     |  |
| 4        | 57.6      | 3.90     | 231.7             | 3.94     |  |
| 8        | 30.0      | 7.50     | 117.9             | 7.74     |  |

## CONCLUSION

The architecture is proposed for a large scale image processing system TIP-4, especially its cluster example(value\_1, value\_2, value\_3, value\_4, value\_5)

| p1 | process_1 | value_1 | value_2 |     |     |
|----|-----------|---------|---------|-----|-----|
| p2 | process_2 | value_1 | value_3 |     |     |
| р3 | process_3 | value_2 | value_3 | *p1 | *p2 |
| p4 | process_4 | value_1 | value_4 | *p3 |     |
| p5 | process_5 | value_1 | value_4 | *p3 |     |
| p6 | process_6 | value_1 | value_4 | *p3 |     |
| nd |           |         |         |     |     |

#### Figure 7. Catalog Language Example

architecture and memory architecture. The cluster architecture enables dynamic process scheduling under the data flow scheme, using the cluster as processing unit.

Local and global memory configuration and high speed block transfer between them enables efficient memory access and data exchange. A new interface LSI MAC is presented, which supports inter-cluster communication and block transfer control. The inter-cluster pipeline ring bus is very important for TIP-4 system to synchronize processes and process scheduling controls.

A data flow process scheduling monitor was developed which controls scheduling for processes and resources. A catalog language has also been developed to define parallel and series process implementation.

The prototype system TIP-4P performance was evaluated. Some basic functions and 3x3 spacial filters to 256 x 256 and 512x512 images were implemented.

# ACKNOWLEDGEMENTS

The authors would like to thank Kingo Takahashi. Takao Ishiguro. Masao Yamazaki and Yoshinori Kimura for their immense contribution to the work. They are also grateful to the laboratory members for their encouragement and advice.

# REFERENCES

- T. Temma et al. "Template-Controlled Image Processor TIP-1 Performance Evaluation." Proc. of IEEE CVPR, 1983, 468-473.
- [2] M. Iwashita, T. Temma et al. "Modular Data Flow Image Processor," Proc. IEEE COMPCON Spring'83, 1983.464-467.
- [3] T. Temma et al. "Data Flow Processor Chip for Image Processing." IEEE Trans. Electron Devices. vol. ED-32, Sep. 1985, 1784-1791.
- [4] T. Temma et al. "Chip-Oriented Data-Flow Image Processor TIP-3," Proc. IEEE COMPCON Fall'84, 1984, 245-254.
- [5] M. Iwashita and T. Temma. "Data Flow Chip ImPP and Its System for Image Processing." Proc. IEEE ICASSP' 86, Tokyo, Apr. 1986, 785-788.