I. Introduction
Efficient system architecture design for the discrete wavelet transform (DWT) has received a lot of attention recently [2]–[4], [1], [5]–[8] due to the success of DWT-based techniques in areas as diverse as signal processing, digital communications, numerical analysis, computer vision and computer graphics [9]. Two important parameters have been used to measure the efficiency of practical DWT system designs: 1) the memory necessary for the DWT computation (mostly in sequential algorithms) and 2) the communication overhead required by parallel DWT algorithms. As a matter of fact, memory efficiency is one major design factor for wavelet-based image compression applications in printers, digital cameras and space-borne instruments where large size memory leads to high cost and demands more chip design area [1], [10], [11]. Similarly, communication efficiency is critical to the success of parallel DWT systems built upon the network of workstations (NOWs) or local area multicomputers (LAMs), since in these systems cheap but slower communication links are used (as compared with dedicated parallel systems) [12], [8], [13], [14].