Loading [MathJax]/jax/output/HTML-CSS/autoload/mtable.js
An Efficient Graphics Processing Unit Scheme for Complex Geometry Simulations Using the Lattice Boltzmann Method | IEEE Journals & Magazine | IEEE Xplore

An Efficient Graphics Processing Unit Scheme for Complex Geometry Simulations Using the Lattice Boltzmann Method


Cyclic pointer addressing scheme for Complex Geometry Simulations Using the Lattice Boltzmann Method.

Abstract:

The lattice Boltzmann method has been fully discretized in space, time, and velocity; its inherent parallelism makes it outstanding for use in accelerated computation by ...Show More

Abstract:

The lattice Boltzmann method has been fully discretized in space, time, and velocity; its inherent parallelism makes it outstanding for use in accelerated computation by graphics processing unit in large-scale simulations of fluid dynamics. When the lattice Boltzmann method is used to simulate a fluid system with complex geometry, the flow field is usually compressed to reduce memory consumption, and fluid nodes are accessed indirectly to improve computational efficiency. We designed a pointer array that is the same size as the flow field and is based on the Compute Unified Device Architecture platform's unified memory technology. The addresses of the fluid nodes are stored in this array, and the other nodes, which are unallocated, are marked as null. For obtaining the coordinates of the fluid nodes in the original flow field, we stored the addresses of the pointer array units whose values were not null as part of the lattice attribute at the end of the lattice attribute array, forming a cyclic pointer structure to track geometric information. We validated the feasibility of this addressing scheme using an experimental simulation of aqueous humor in the anterior segment of the eye, and tested its performance on the graphics processing unit of Pascal, Volta, and Turing architecture. The present method carefully distributes data to generate fewer memory transactions and to reduce access times of the global memory, thus achieving approximately 18% performance improvement.
Cyclic pointer addressing scheme for Complex Geometry Simulations Using the Lattice Boltzmann Method.
Published in: IEEE Access ( Volume: 8)
Page(s): 185158 - 185168
Date of Publication: 09 October 2020
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Over the past 30 years, the lattice Boltzmann method, which originated from cellular automata, has developed into an alternative and promising numerical scheme for the simulation of complex fluid flows [1]. In the lattice Boltzmann method, the collision step is only related to the current node, and the propagation step is related to the next time step of the neighboring node; therefore, the lattice Boltzmann method is particularly suitable for large-scale parallel architectures, such as the graphics processing unit (GPU) shown in [2]. Nvidia CUDA (Compute Unified Device Architecture) technology has reduced the difficulty of GPU-oriented programming, and the efficient implementation of various lattice Boltzmann methods on GPUs has been explored.

Because of the power consumption, interconnect delay, and design complexity of traditional single-core processors, integrating more and more transistors in traditional single-core processors can no longer be relied upon to provide stronger performance. This was the most essential reason for the birth of multicore processors. The design of multicore processors can be roughly divided into three categories: 1) bus or switch as the basic interconnect architecture; 2) stream processors and graphics processors; 3) network interconnect processors. Among them, the path represented by stream processors and general-purpose computing on graphics processing units (GPGPU) is a completely new design that bypasses the traditional processor design and learns from other dedicated processors, such as digital signal processors (DSP), for new applications. Traditional applications include a large number of program loops and jumps, irregular memory address access, but with the ubiquity of computing technology in people’s work and life, another computing mode has become prominent —large-scale parallel data computing. Typical applications include image processing, video processing, and physical model simulations (such as computational fluid dynamics). Data parallelism, the application of the same instruction on parallel data, is one of the basic classifications of parallel computing. Single-instruction multiple- data (SIMD) architecture makes use of this kind of parallelism. Nvidia GPUs use single-instruction multiple-thread (SIMT) architecture. SIMT architecture is an extension of SIMD architecture. Therefore, using GPUs to accelerate applications depends on the parallelism of the applications themselves. Only applications with a large amount of regular data parallelism can make use of the great advantages of GPUs. The lattice Boltzmann method is particularly suitable for GPU execution because of its inherent parallelism.

CUDA solves the problem of how to transparently scale parallel-application software to take advantage of the increasing number of processor cores. At its core are three key abstractions: 1) a hierarchy of thread groups; 2) shared memories; 3) barrier synchronization. These abstractions provide fine-grained data parallelism and thread parallelism nested in coarse-grained data parallelism and task parallelism. They instruct programmers to divide problems into coarse-grained sub-problems that can be solved with thread blocks, and each sub-problem is further divided into fine-grained parts for execution by threads within the thread block. Each thread block can be scheduled to any available multiprocessor in the GPU; the order is arbitrary and can be concurrent or continuous, so the compiled CUDA program can be executed on any number of multiprocessors. Only the runtime system needs to know the number of physical multiprocessors. Nvidia’s GPU architecture is built around scalable multithreaded stream multiprocessor (SM) arrays, and the more processors that a GPU has, the more thread blocks it can schedule. Therefore, by simply expanding the number of multiprocessors, we can accelerate applications. Optimization based on CUDA applications mainly revolves around three basic strategies: 1) maximizing parallelism; 2) optimizing memory access; 3) optimizing instruction usage.

Simulation applications that are based on the lattice Boltzmann method are input–output intensive. The gap between computing and data access speeds will further widen in the foreseeable future [3]. Therefore, when using CUDA to accelerate lattice Boltzmann method applications, efficient access to memory becomes particularly important, and most optimization algorithms are also dedicated to this. For example, a single-step double-grid implementation is recommended [4]. The collision step and propagation step are fused by a single-step algorithm, and the data are stored by a double grid, namely AB mode. The typical single-grid implementation is AA mode [5], and [6] points out that AB mode is more efficient on the GPU of the new architecture. Regarding the memory layout, it is recommended that the structure of arrays (SoA) [7] instead of the array of structures (AoS) layout be chosen, so that the threads in the same wrap access the continuous address space to generate fewer memory transactions. However, in the latest literature, a new memory layout with optimal performance, called the collected structure of arrays [8], has been proposed. It was reported in [9] that misaligned memory reads take less time than misaligned memory writes. The commonly used lattice update method in the lattice Boltzmann method is the push method [10]. The push method reads the distribution function value of the current node and writes it to the neighboring node. [9] Using the pull method and shared memory to read the distribution function value of the neighboring node and writing it to the current node, greatly improves computational efficiency.

The above optimization involves dense geometry. For the complex geometric flow fields such as those in biomedical and porous media simulations, misaligned access to GPU global memory takes even longer. In complex geometric flow fields, fluid nodes occupy only a small part of the flow field, and other nodes not required by evolution are intermingled between fluid nodes. If conventional simulation methods are used, this wastes memory and lowers computational efficiency. There are currently two main solutions: 1) indirect addressing [11] and 2) semidirect addressing [12]. Both methods allocate memory only for the fluid nodes. Indirect addressing maintains a list of neighbors for all fluid nodes, usually in the form of a matrix. Semidirect addressing maintains a structure called a fluid index array, in which the indices of fluid nodes that are actually assigned are stored, and those of other nodes that are not assigned are marked as −1. The auxiliary arrays maintained by both methods are integer arrays, which are not conducive to expansion. When the fluid node and the solid node in the flow field are transformed into one another, indirect addressing methods must update the neighboring vectors of the corresponding node, whereas semidirect addressing must renumber almost the entire fluid index array. In this paper, we propose a memory addressing scheme with a cyclic pointer structure, based on CUDA unified memory technology, using eight-byte hexadecimal pointer variables to replace integer variables. Pointers in two directions, forward pointers and reverse pointers are used to determine the actual and original coordinates, respectively, of the fluid nodes in the complex geometric flow field. When the flow field changes, only the pointer of the corresponding node needs to be updated. Furthermore, for complex geometric flow fields with multiple node types, simply storing the nodes needed for evolution in sequence will cause the threads in the warp to branch and generate more memory transactions when accessing the reverse pointers of each node, resulting in reduced program performance. Therefore, we store reverse pointers by node type to meet the requirement of coalesced access to global memory.

The paper structure is as follows. Section II briefly introduces background knowledge, including the lattice Boltzmann method with double-distribution models and modeling of the anterior segment. Section III contains the principles and implementations of several addressing schemes (including two typical schemes and the methods proposed in this paper). Section IV gives the simulation results and presents comparisons of time and memory space consumption for several algorithms. Section V contains conclusions.

SECTION II.

Background

A. The Lattice Boltzmann Method with Double-Distribution Function

The lattice Boltzmann method was initially applied only to isothermal incompressible flows, whereas real physical phenomena, such as natural convection, often involve temperature changes. The lattice Boltzmann method is used to simulate models containing temperature changes that belong to nonisothermal lattice Boltzmann models and are generally divided into three categories: 1) multispeed models; 2) double-distribution-function (DDF) models; 3) mixed models combined with difference methods. In this paper, a coupled lattice Bhatnagar-Gross-Krook model based on the Boussinesq approximation proposed by [13] is used to couple the velocity field and temperature field through buoyancy.

The evolution equation of velocity field of the coupled lattice Bhatnagar-Gross-Krook model is \begin{align*} f_{i}\left ({\boldsymbol {x}+\boldsymbol {e}_{i} \Delta t, t+\Delta t}\right)\!-\!f_{i}(\boldsymbol {x}, t)=-\frac {1}{\tau _{f}}\left [{f_{i}\!-\! f_{i}^{e q}(\boldsymbol {x}, t)}\right]\!+\!\boldsymbol {F}_{i}, \\\tag{1}\end{align*}

View SourceRight-click on figure for MathML and additional features. where f_{i} is the particle distribution function, \boldsymbol {x} is the discrete space, {t} is the discrete time, \boldsymbol {e} is the discrete velocity, \tau _{f} is the relaxation factor of the velocity field, i represents the indices of discrete velocity sets, f_{i}^{e q} is the equilibrium distribution function, in the three-dimension nineteen-vector (D3Q19) model, the corresponding equilibrium distribution function is calculated by (2), as \begin{align*}&\hspace {-0.5pc}f_{i}^{e q}=\rho \omega _{i}\left [{1+\frac {3}{c^{2}}\left ({\boldsymbol {e}_{i} \cdot \boldsymbol {u}}\right)+\frac {9}{2 c^{4}}\left ({\boldsymbol {e}_{i} \cdot \boldsymbol {u}}\right)^{2}- \frac {3}{2 c^{2}} \boldsymbol {u}^{2}}\right], \\&\qquad \qquad\qquad\qquad {{\displaystyle {i=0,1,2, \ldots, 18} }}\tag{2}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where \omega _{0}=1 / 3, \omega _{1,2, \ldots, 6}=1 / 18, \omega _{7,8, \ldots, 18}=1 / 36 , \rho is fluid density, \boldsymbol {u} is fluid velocity, and c is the reference speed, which is usually taken as 1. The external force term \boldsymbol {F}_{i} [14] is \begin{equation*} \boldsymbol {F}_{i}\!=\!\left ({1\!-\!\frac {1}{2 \tau _{f}}}\right) \omega _{i}\left [{\frac {\boldsymbol {e}_{i}\!-\!\boldsymbol {u}}{c_{s}^{2}}\!+\!\frac {\left ({\boldsymbol {e}_{i} \cdot \boldsymbol {u}}\right)}{c_{s}^{4}}}\right] \!\cdot \!-\mathrm {g} \beta \left ({T-T_{0}}\right),\tag{3}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where g is the acceleration of gravity, \beta is the thermal collision coefficient, T is the temperature, and T_{0} is the reference temperature.

The evolution equation of the temperature field is:\begin{equation*} g_{i}\left ({\boldsymbol {x}\!+\!\boldsymbol {e}_{i} \Delta t, t\!+\!\Delta t}\right)\!-\!g_{i}(\boldsymbol {x}, t)\!=\!-\frac {1}{\tau _{g}}\left [{g_{i}\!-\!g_{i}^{e q}(\boldsymbol {x}, t)}\right],\tag{4}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where g_{i} is the particle distribution function, \tau _{g} is the relaxation factor of the temperature field, in the three-dimension seven-vector (D3T7) model, the distribution function of the temperature equilibrium state g_{i}^{e q} is \begin{equation*} g_{i}^{e q}=\eta _{i} T\left ({1+\frac {\boldsymbol {e}_{i} \cdot \boldsymbol {u}}{c_{s}^{2}}}\right),\quad i=0,1,2, \ldots, 6\tag{5}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \eta _{0}=1 / 4, \eta _{1,2, \ldots, 6}=1 / 8 . The expressions of fluid density, temperature, and velocity are \begin{align*} \rho=&\sum _{i=0}^{18} f_{i}, \tag{6}\\ T=&\sum _{i=0}^{6} g_{i}, \tag{7}\\ \rho \boldsymbol {u}=&\sum _{i=0}^{18} \boldsymbol {e}_{i} f_{i}-\frac {\Delta t}{2} \mathrm {g} \beta \left ({T-T_{0}}\right),\tag{8}\end{align*}
View SourceRight-click on figure for MathML and additional features.
repectively.

Taking (1) as an example, the standard lattice Boltzmann method solution is divided into two steps:

  1. Collision step, \begin{equation*} \tilde {f}_{i}(\boldsymbol {x}, t)-f_{i}(\boldsymbol {x}, t)=-\frac {1}{\tau _{f}}\left [{f_{i}(\boldsymbol {x}, t)-f_{i}^{e q}(\boldsymbol {x}, t)}\right];\tag{9}\end{equation*}

    View SourceRight-click on figure for MathML and additional features.

  2. Propagation step, \begin{equation*} f_{i}\left ({\boldsymbol {x}+\boldsymbol {e}_{i} \Delta t, t+\Delta t}\right)=\tilde {f}_{i}(\boldsymbol {x}, \mathrm {t});\tag{10}\end{equation*}

    View SourceRight-click on figure for MathML and additional features. where f_{i} and \tilde {f}_{i} are the particle distribution functions before and after collision, respectively.

B. Three-Dimensional Modeling of Flow Fields of the Anterior Segment

Aqueous humor dynamics mainly studies the flow of aqueous humor in human eyes. Domestic and foreign scholars have used numerous methods to study aqueous humor dynamics [15]–​[23], mainly clinical medical experiments and traditional numerical simulation methods such as finite element and finite difference analyses. In this paper, the lattice Boltzmann method is used to simulate the flow of aqueous humor. As the basis for studying aqueous humor dynamics, it is generally accepted, at present, that the temperature difference of the anterior chamber drives the flow of aqueous humor. Therefore, a lattice Boltzmann model with double-distribution-coupled velocity and temperature fields was used to study the flow of aqueous humor.

The human eye has a complex structure; The structural section of the anterior segment of the eye is shown in Fig. 1(a). It is mainly composed of the cornea, the iris, the lens, the pupil, suspensory ligaments, the ciliary body, and trabecular meshwork. There are many types of aqueous humor flow. The model in this paper simulates natural convection in the anterior chamber caused by the temperature difference between the cornea and the iris. We established a geometric model of the anterior segment structure in Fig. 1(a) and used idealized treatment for each tissue (e.g., cornea, iris, lens), that is, regular geometric modeling. Fig. 1(b) represents the three-dimensional geometric model of the anterior segment, with the coronal plane section at {x} = X / 2 (where {x} is {x} -axis and X represents the length of flow field). Fig. 1(c) represents the section corresponding to Fig. 1(a). The size of each tissue in the anterior segment is shown in Table 1. The lattice ratio was set to 30 (i.e., 1 mm is equivalent to 30 lattices), and the dimensions of the entire flow field were 384\times 384\times120 .

TABLE 1 Geometric Dimensions of the Anterior Segment
Table 1- 
Geometric Dimensions of the Anterior Segment
FIGURE 1. - Modeling of the anterior segment of the eye.
FIGURE 1.

Modeling of the anterior segment of the eye.

SECTION III.

Implementation

Simulations based on the lattice Boltzmann method usually adopt a rectangular grid to contain all nodes in the flow field, and each node uses a direct addressing scheme to find neighboring nodes. In the propagation step of the lattice Boltzmann method, the current node determines the coordinates of the neighbor node according to the discrete velocity set, such as D3Q19, without other auxiliary information, which is refer to direct addressing. If the direct addressing scheme is used to simulate the natural convection experiment of the aqueous humor in the anterior chamber of the three-dimensional human eye, then approximately 60% of the memory is not used. In the GPU-based simulation, approximately 60% of the threads in the block do not perform any calculations. Compressing the flow field of the anterior segment for storage can improve the data locality of memory access to increase computational efficiency. The direct addressing scheme is then no longer applicable. Here, we first introduce two main solutions.

A. Typical Solutions

Direct addressing tracks local geometry by enumerating discrete velocity sets (D3Q19 model). After the complex flow field is compressed, the neighbor relationship of each node is broken, and this method is no longer applicable. One of the typical solutions is to store the each node’s neighbors in a vector, and the neighbor-node vectors of all nodes form a matrix [25]. In simulations based on indirect addressing, the coordinates of the current node neighbors are obtained by accessing the neighbor matrix during the propagation of each time step. The neighbor matrix can have the same memory layout (AoS, SoA, etc.) as the distribution function. Although memory space is required to store the neighbor matrix, memory consumption is negligible compared with the space saved after compressing the entire flow field. The schematic diagram of the indirect addressing scheme is shown in Fig. 2, where NF is the number of needed nodes, −1 is the flag of solid nodes. Natural convection of the aqueous humor in the anterior chamber of the three-dimensional human eye simulated by the indirect addressing scheme is shown in Algorithm 1. Another solution, called semidirect addressing, is to use an auxiliary array—a fluid index array—that is the same size as the flow field to record the actual storage location of each node during the process of compressed storage, as shown in Fig. 3. The semidirect addressing scheme essentially inherits the addressing method of the direct addressing scheme, so when using CUDA acceleration, we still need to start full matrix-scale threads, which is different from that used in the indirect addressing scheme. The implementation is shown in Algorithm 2.

Algorithm 1 Implementation of Indirect Addressing Scheme

1:

p = current thread id

2:

if node p is fluid then

3:

for each f \in [{0,18}] do

4:

Calculate distribution function coordinate pf from f and p.

5:

Neighbor node pp = Neighbor[pf]

6:

if node pp is fluid then

7:

Execute collide and stream, single step dual grid.

8:

else if node pp is boundary then

9:

apply half way bounce-back boundary condition.

10:

else if node pp is inlet or outlet then

11:

apply non-equilibrium extrapolation boundary condition.

12:

end if

13:

end for

14:

end if

Algorithm 2 Implementation of Semidirect Addressing Scheme

1:

tid = current thread id

2:

n = Index[tid]

3:

if node n is fluid then

4:

for each f \in [{0,18}] do

5:

pp = (p/DY/DZ+Ex[f])*DY*DZ+((p/DZ)%DY+ Ey [f])*DZ+(p%DZ)+Ez[f]

6:

nn = Index[pp]

7:

if node nn is fluid then

8:

Execute collide and stream, single step dual grid.

9:

else if node nn is boundary then

10:

apply half way bounce-back boundary condition.

11:

else if node nn is inlet or outlet then

12:

apply non-equilibrium extrapolation boundary condition.

13:

end if

14:

end for

15:

end if

FIGURE 2. - Two-dimensional diagram of sparse lattice representation and indirect addressing [11].
FIGURE 2.

Two-dimensional diagram of sparse lattice representation and indirect addressing [11].

FIGURE 3. - Semidirect addressing and fluid index array [12].
FIGURE 3.

Semidirect addressing and fluid index array [12].

B. Cyclic Pointer Addressing Scheme

The auxiliary arrays maintained by the indirect addressing scheme and the semidirect addressing scheme are all integers, and there is a strong correlation between the current node and the neighbor node. For example, consider a simple particle migration program. If an indirect addressing scheme is used, as the particles move in the flow field, the neighbor matrix needs to be updated accordingly. The degree of the update is approximately the same as the number of lattices occupied by the particles. If a semidirect addressing scheme is used, it means renumbering the entire fluid index array. Secondly, because only the neighbor nodes of the fluid are stored, when using indirect addressing, it is difficult to restore the original coordinates of the nodes in the complex geometric flow field without using other auxiliary arrays. For simple boundary conditions that do not involve multiple nodes, an efficient implementation of the indirect addressing scheme has been given in [11].

Unified memory is a component of the CUDA programming model, that defines a managed memory space in which all processors see a single coherent memory image with a common address space. Based on CUDA unified memory technology, we stored the address of the node after complex geometric flow field compression in a pointer array and marked other nodes with unallocated memory as null. When the neighbor information of nodes changes, only the pointer at the corresponding position in the pointer array needs to be updated. To avoid starting full matrix-scale threads, we also store the addresses of each cell of the pointer array as lattice attributes, forming a cyclic pointer structure. Use of a reverse pointer avoids searching. After verification, the search scheme greatly reduces the computational efficiency, so it is abandoned. In the simulation of natural convection of the aqueous humor in the anterior chamber of a three-dimensional human eye, in addition to the fluid nodes that need to be stored, the trabecular meshwork inlet nodes and the posterior chamber outlet nodes also need to be stored to adjust the temperature difference. In future work, we may store other boundary nodes to study the flow of aqueous humor. Therefore, in addition to storing the reverse pointer of the fluid node, some boundary nodes also need to be considered. We carefully arranged the various types of reverse pointers that need to be stored at the end of the lattice attribute array with a SoA layout, as shown in Fig. 4.

FIGURE 4. - Schematic diagram of the cyclic pointer addressing scheme.
FIGURE 4.

Schematic diagram of the cyclic pointer addressing scheme.

This cyclic pointer addressing scheme was used in simulating natural convection of the aqueous humor in the anterior chamber of the three-dimensional human eye; its implementation is shown in Algorithm 3. First, we used the reverse pointer and the current thread id to obtain the coordinates of the node in the original complex geometric flow field. Second, we obtained the neighboring node of the current node by enumerating the discrete velocity set, and finally, we obtained the actual storage location of the neighbor node according to the forward pointer. Currently, since both the forward pointer and the reverse pointer are stored in a linear structure, the original coordinate offset of the node in the complex geometric flow field and the actual stored position index can be obtained by comparing pointer with the starting address. The kernel function of lattice attribute calculation is completely local and does not require the information of neighbor nodes, it is not listed in the algorithm.

Algorithm 3 Cyclic Pointer Addressing Scheme. The Variable poiArr Is an Array of Pointers of the Same Size as the Simulated Flow Field; Member Variable back Is a Pointer to a Specific Location in the poiArr Array

1:

kernel 1 Fluid_collide

2:

tid = current thread id

3:

offset = getOffset(poiArr, maco.back[tid])

4:

for each f \in [{0,18}] do

5:

pp = (p/DY/DZ+Ex[f])*DY*DZ+((p/DZ)%DY+ Ey [f])*DZ+(p%DZ)+Ez[f]

6:

indexf = tid + f * NF

7:

offset_stream = (offset/DY/DZ+Ex[f])*DY*DZ+ ((offset/DZ)%DY +Ey[f])*DZ+offset%DZ+Ez[f]

8:

index_stream = getIndex(maco.den,poiArr[offset_ stream])

9:

if node index_stream is fluid then

10:

index_streamf = index_stream + f*NF

11:

Execute collide and stream, single step dual grid.

12:

else if node index_stream is boundary then

13:

apply half way bounce-back boundary condition.

14:

end if

15:

end for

16:

kernel 2 Inoutlet_collide

17:

tid = current thread id

18:

**m = maco.back[NF + tid]

19:

offset = getOffset(poiArr, m)

20:

for each f \in [{0,18}] do

21:

Get the offset_stream after flow.

22:

index_stream = getIndex(maco.den,poiArr[offset_ stream])

23:

if node index_stream is fluid then

24:

calculate distribution function coordinate from index_stream and f.

25:

apply non-equilibrium extrapolation boundary condition.

26:

end if

27:

end for

SECTION IV.

Result

The GPU used in this article is shown in Table 2. The physical parameters of aqueous humor that were used in the simulation are shown in Table 3. Initially, the anterior chamber was filled with aqueous humor, the corneal temperature T_{c}=34^{\circ } \mathrm {C} , the temperature of remaining tissue T_{h}=37^{\circ } \mathrm {C} , and the aqueous humor temperature T_{0}=\left ({T_{h}+T_{c}}\right) / 2.0=35.5^{\circ } \mathrm {C} . The velocity field used a half-way bounce-back boundary condition, and the temperature field used nonequilibrium extrapolation.

TABLE 2 GPU Hardware Specifications
Table 2- 
GPU Hardware Specifications
TABLE 3 Physical Property of Aqueous Humor (AH)
Table 3- 
Physical Property of Aqueous Humor (AH)

The results of the experiment to simulate natural convection of the aqueous humor in the anterior chamber of the three-dimensional human eye using the cyclic pointer addressing scheme are shown in Fig. 5. Fig. 5(a) shows the results for the standing position, Fig. 5(b) shows the results for the supine position, Fig. 5(a1) and Fig. 5(b1) show the temperature distribution profile diagrams, and Fig. 5(a2) and Fig. 5(b2) are the aqueous humor velocity streamline diagrams. It can be seen that temperature is asymmetrically distributed when in the standing position and symmetrically distributed when in the supine position. For the velocity distribution of aqueous humor, a vortex is formed in the anterior chamber when standing, and two symmetrical opposite vortices are formed when supine. Two large vortices in opposite directions are formed in the posterior chamber when standing, and multiple vortices resembling petals are formed when supine.

FIGURE 5. - Temperature and velocity distribution during natural convection of aqueous humor: (a) standing and (b) supine.
FIGURE 5.

Temperature and velocity distribution during natural convection of aqueous humor: (a) standing and (b) supine.

For the sake of convenience, the temperature profile and velocity vector distribution of the aqueous humor in the coronal plane when standing were compared with those from literature, as shown in Fig. 6. Fig. 6(a) is the result of the simulation in this paper, Fig. 6(b) and Fig. 6(c) are the results from [20] and [16], respectively. It can be seen that the temperature profile and velocity vector distribution trends in this paper were consistent with those from results presented in the literature. In addition, the comparison maximum aqueous humor flow rate when standing is shown in Table 4. Because of differences in model and in parameter selection, the maximum flow rate was also different but was within a reasonable range.

TABLE 4 Comparison of Maximum Speed of Aqueous Humor When Standing
Table 4- 
Comparison of Maximum Speed of Aqueous Humor When Standing
FIGURE 6. - Temperature and velocity vectors of aqueous humor in the coronal plane when standing are compared with literature: (a) This paper, (b) Heys [20], (c) Zhao [16].
FIGURE 6.

Temperature and velocity vectors of aqueous humor in the coronal plane when standing are compared with literature: (a) This paper, (b) Heys [20], (c) Zhao [16].

Our experiments use a one-dimensional thread grid and a one-dimensional thread block because it has the best performance. The number of threads contained in each thread block will affect the occupancy of the stream multiprocessor, and will affect the performance of the program in a computationally intensive program. In the simulation based on the lattice Boltzmann method, data transmission does not happen frequently and is trivial. The significant difference between the three addressing schemes is the way to find neighbor nodes. Therefore, in order to accurately compare their performance, we do not consider the time consumption of data transmission. The lattice-Boltzmann-method based simulations accelerated by CUDA are usually most concerned with the memory layout of the distribution function, but the lattice attribute cannot be ignored. In the experiment, the lattice attributes to be stored included density, temperature, and velocity. Since the velocity vector has three-dimensional components, using a AoS memory layout to store these lattice attributes will cause the threads in the warp to generate thirty-two-byte spacing, which is not conducive to merged access. Therefore, we also adopted the SoA memory layout to store these lattice attributes. The efficient use of registers can also, including the use of calculations to replace register variables and saving frequently used global memory data in registers, speed up processing. For convenience, the careful program design is named an optimized scheme, the million lattice updates per second(MLUPS) are chosen as performance indicators. The comparison of speeds before and after optimization of the indirect addressing scheme is shown in Fig. 7, and the comparison of speeds before and after semidirect addressing is shown in Fig. 8. We see that approximately 8% of performance acceleration can be achieved through some minor changes.

FIGURE 7. - Comparison before and after optimization of the indirect addressing scheme.
FIGURE 7.

Comparison before and after optimization of the indirect addressing scheme.

FIGURE 8. - Comparison before and after optimization of the semidirect addressing schem.
FIGURE 8.

Comparison before and after optimization of the semidirect addressing schem.

The speed comparison between the cyclic pointer addressing scheme proposed in this paper and typical two schemes are shown in Table 5. As mentioned in Section I, the lattice-Boltzmann-method based simulation is an input–output intensive application, so whether it is memory layout or addressing scheme, it is dedicated to improving the efficiency of memory access. The cyclic pointer addressing scheme carefully arranges the data layout and stores the same type of nodes together, which improves the spatial locality of data access and better meets the requirement of coalesced access to global memory; therefore, it shows a significant speed advantage. The two most important performance indicators of Nvidia GPUs are memory throughput and SM throughput; therefore, the main means of using GPU acceleration is to increase the throughput of both. Since the performance bottleneck of the lattice Boltzmann method is memory throughput, improving SM throughput will not bring significant performance improvement. The SM throughput comparison of the three addressing schemes using Tesla V100 GPU acceleration is shown in Fig. 9, and the memory throughput comparison is shown in Fig. 10. The memory space consumption of the three addressing schemes is shown in Table 6. The memory space consumption of the neighbor matrix maintained by the indirect addressing scheme is equal to NF*DQ*sizeof(int) bytes, the memory space consumption of the fluid index array maintained by the semidirect addressing is equal to DX*DY*DZ*sizeof(int) bytes (where DX , DY , and DZ represent the dimension of the flow field), and this scheme is equal DX*DY*DZ*sizeof(void *)+NF*sizeof(void *) bytes.

TABLE 5 Comparison of Time Consumption (in Million Lattice Updates Per Second) of the Three Addressing Schemes
Table 5- 
Comparison of Time Consumption (in Million Lattice Updates Per Second) of the Three Addressing Schemes
TABLE 6 Comparison of Memory Consumption of the Three Addressing Schemes
Table 6- 
Comparison of Memory Consumption of the Three Addressing Schemes
FIGURE 9. - Comparison of stream multiprocessor throughput of the three addressing schemes.
FIGURE 9.

Comparison of stream multiprocessor throughput of the three addressing schemes.

FIGURE 10. - Comparison of memory throughput of the three addressing schemes.
FIGURE 10.

Comparison of memory throughput of the three addressing schemes.

SECTION V.

Conclusion

This paper proposes an efficient GPU scheme for complex geometric flow field simulation. Based on CUDA unified memory, we eliminated the connection between the lattice in the auxiliary array. To avoid searching, we use a reverse pointer to point to the address of each cell of the fluid address array, forming a cyclic pointer structure, which improves computation efficiency. For the lattice Boltzmann method simulations with multiple lattice types, such as those in the simulation of natural convection of the aqueous humor in the three-dimensional human eye, we carefully arranged the data distribution of each type of lattice to better meet the requirement of coalesced access to global memory. Next, according to the general principles of using CUDA to accelerate the lattice Boltzmann method, such as making reasonable use of registers, minimizing access to global memory, and using SoA layout for the lattice attribute, we improved the simulation program. After verification of two typical addressing schemes, such small changes can improve performance considerably.

Considering the well-known concept—the no free lunch theorem—the performances of the three addressing schemes were not the same for all issues. As the three addressing schemes algorithms in Section III show, the significant difference is the number of accesses to the auxiliary array. For simple geometric flow fields with only fluid nodes and solid nodes, or those in which the boundary conditions do not involve access to multiple nodes, indirect addressing algorithms may be more suitable. The cyclic pointer addressing scheme proposed in this paper is suitable for complex geometric flow fields that need to store multiple types of nodes or have complex boundary conditions.

The main contribution of this paper is a new addressing scheme, but it also provides a new idea for the simulation of dynamic complex geometric flow fields because of its feature of data structure. Rapid eye movement is a physiological phenomenon that occurs during human active sleep. It is of great significance to study the effect of rapid eye movement on aqueous humor flow. This is also one of the future works planned by our research group.

References

References is not available for this document.