Performance Analysis for Heterogeneous & Reconfigurable Computing Based on Scheduling

Yiming Tan, Guosun Zeng, Shuixia Hao

Department of Computer Science and Technology, Tongji University, Shanghai, China

Abstract

Right now, heterogeneous & reconfigurable computing is a research hot in the area of high performance computing. Due to the heterogeneity of application tasks and reconfigurability of system architecture, performance analysis for heterogeneous & reconfigurable computing becomes rather difficult. Unfortunately, the existing techniques and methods are no longer suitable for use. This paper presents a performance analysis method based on task scheduling. It builds on system architecture model and task model of heterogeneous & reconfigurable computing. By making use of heterogeneity matching matrix and reconfigurability coupling matrix we achieve optimal selection and matching between computational tasks and processing units. Through task scheduling algorithm, the completion time of application task run on heterogeneous & reconfigurable computing system can be calculated. Finally, we carry out case study.

Index Terms: component; heterogeneous & reconfigurable computing, performance analysis, heterogeneity matching matrix, reconfigurability coupling matrix

© 2011 Published by MECS Publisher. Selection and/or peer review under responsibility of the International Conference on E-Business System and Education Technology

1. Introduction

Driven by demand of scientific research and engineering computing, the exploring of high performance computing has never stopped. In the 35th Top500 supercomputer list which was announced on May 31, 2010, the first is "Jaguar" which was developed by Cray Inc. This supercomputer posted a 1.75 petaflop/s performance speed running the Linpack benchmark and the theoretical peak capability reached 2.3 petaflop/s. Homogeneous computing, which was the main computing paradigm of high-performance computing traditionally, had already turned into heterogeneous computing (HC) \[1\]. For example, the supercomputer “Roadrunner”\[2\] was the representative of heterogeneous computing system. Obviously, HC has become one of the trends of high performance computing. On the other hand, reconfigurable computing (RC)\[3\] has been introduced into high performance computing, because it can offers users with high performance and flexibility for a wide range of computationally demanding problems. Thus, heterogeneous & reconfigurable computing...
Performance Analysis for Heterogeneous & Reconfigurable Computing Based on Scheduling

(HRC)\(^4\), which employs various hardware accelerators and reconfigurable units, will be the new trend of high performance computing. HRC has all the advantages of HC and RC, such as, efficiency, flexibility, cost-effective, low power and stability etc. As a new computing paradigm, performance analysis for heterogeneous & reconfigurable Computing becomes very difficult, because of the heterogeneity of tasks and reconfigurability of architecture. However, performance analysis is the key step of selecting proper type and quantity processing unit, reconfiguring a appropriate architecture and predicting the performance bottleneck. Thus, this paper presents a performance analysis method for HRC.

2. The basic concept of Heterogeneous & Reconfigurable Computing

Heterogeneous & reconfigurable computing (HRC) is developed on the basis of heterogeneous computing (HC) and reconfigurable computing (RC), that is, HRC is a combination of HC and RC. The definitions of HRC, HC and RC are the following.

Definition 1: Heterogeneous Computing (HC) is a computing environment, which employs various performance and functionality processing units (such as PC, vector, SIMD, MIMD, FPGA, GPU, DSP and CELL etc.) and these processing units are connected by high-speed interconnection. This computing environment makes good use of the heterogeneity of tasks and processing units to execute an application collaboratively. The goal is to minimize the execution time. Figuratively speaking, HC seems like a symphony orchestra uses various musical instruments to play a wonderful music.

Definition 2: Reconfigurable Computing (RC) is a variable computing platform, which can rewrite the reconfigurable units at a proper timing according to the characteristics of application tasks under the control of software. The RC system can reach or approach the performance of specific processing unit. The reuse of processing unit makes RC system high performance, flexibility, stability and low power.

Definition 3: Heterogeneous & Reconfigurable Computing (HRC) is a hybrid computing paradigm, which combines all these characteristics of HC and RC. HRC makes use of these advantages of HC and RC to achieve high performance computing. The main idea of HRC is to select the CPU, FPGA, GPU, Cell, DSP and other types of processing unit flexibly according to the characteristic of tasks and rewrite the reconfigurable units to create a new computing platform. The new computing platform can be created through rewriting these reconfigurable logics and rebuilding interconnection.

2.1 Heterogeneous & Reconfigurable Computing System Model

Heterogeneous & reconfigurable computing systems (HRCS) is a new high performance computing system, which may contain general-purpose processor (CPU), application-specific integrated circuit (ASIC) and reconfigurable unit (FPGA). In order to carry out performance analysis, it needs modeling HRCS system.

Definition 4: Heterogeneous & Reconfigurable Computing Systems (HRCS) can be defined as a two-tuples HRCS = (VP, EP). VP = {p\(_1\), p\(_2\), … , p\(_M\)} represents a set of processing units, p\(_i\) represents the i-th computing units. EP = {e\(_1\), e\(_2\), … , e\(_L\)} represents a set of interconnection structure.

2.2 Heterogeneous & Reconfigurable DAG Model

In order to carry out performance analysis, besides the HRCS system model, it also need establish task model. Here, we use heterogeneous & reconfigurable DAG (HR-DAG) to depict the application.

Definition 5: Heterogeneous & reconfigurable DAG (HR-DAG) can be defined as a six-tuples HR-DAG = (VT, ET, W, D, H, R). VT represents a set of tasks. ET represents a set of edges and it also depicts a partially ordered set, that is, ET \(\in\) VT \(\times\) VT. W represents a set of computation workload of task. D represents a set of communication workload between two tasks. H represents a set of heterogeneous characteristics of task. R represents a set of reconfigurable characteristics between two tasks.
The heterogeneous characteristic $h_i \in H$ is a description of the task type, computing needs or execution model of a task. For example, according to the parallelism, various tasks can be divided into SIMD, MIMD, vector, superscalar, pipeline, dataflow and so on [1].

The reconfigurable characteristic $r_i \in R$ is a description of the requirement about communication mode (such as point to point, point to many, many to many) or topology (such as linear Array, Mesh-Connected, Tree-Connected and Cross-Bar) between tasks. These reconfigurable characteristics are intelligent decision-making of reconfigurable logics.

![HR-DAG Diagram](image)

**Fig 1. HR-DAG**

Every program can be described by HR-DAG. Fig. 1 is a HR-DAG, these parameters on nodes are task number/computation workload/heterogeneous characteristic and these parameters on edges are partially ordered/communication workload/reconfigurable characteristic.

### 3. Heterogeneous & Reconfigurable Computing Performance Measurement Matrix

#### 3.1 Heterogeneity Matching Matrix

In the heterogeneous & reconfigurable computing (HRC), whether it can achieve high performance, not only depend on execution speed of processing unit and the quality of parallel algorithms, but also depend on the matching degree between task and processing unit. The latter is particularly important sometimes. For example, control codes can not perform effectively on FPGA and vector tasks may perform poor on non-vector machines.

In order to describe the optimal matching between tasks and processing units, we propose the parameter about matching degree [6].

**Definition 6:** Heterogeneity Matching Matrix (Ma) depicts the matching degree between various tasks and different processing units. Here, $Ma = (v_{ix})N \times M$, $v_{ix}$ represents the execution speed of task $t_i$ running on processing unit $p_x$, $1 \leq i \leq N$, $1 \leq x \leq M$. 
Fig 2. Heterogeneity Matching Matrix (Ma)

Fig. 2 is heterogeneity matching matrix (Ma). By means of physical measurements, the execution speed $v_{ix}$ is determined.

According to heterogeneity matching matrix (Ma), it can achieve optimal matching between tasks and processing units. For example, when task $t_i$ need to be mapped or scheduled, the processing unit $p_x$ which has the maximum execution speed $v_{ix}$ will be the target component. Next, the actual execution time of $t_i$ can be calculated according to the (1).

$$T_{comp}(t_i) = \frac{w_i}{v_{ix}}.$$  \hspace{1cm} (1)

Here, Let $w_i$ denote the computation workload of task $t_i$, $1 \leq i \leq N$, $1 \leq x \leq M$. The optimal matching between tasks and processing units reflects the efficiency of heterogeneous & reconfigurable computing (HRC).

3.2 Reconfigurability Coupling Matrix

Interconnection is another important factor which affects the performance of heterogeneous & reconfigurable computing (HRC) performance. Different interconnections have different communication capability and execution mode, which can be denoted by coupling degree. Generally speaking, the greater coupling degree the smaller communication overhead [3].

In heterogeneous & reconfigurable computing system (HRCS), the interconnection may be reconfigured according to the task characteristic. It can make the interconnect resources used effectively and reduce the communication overhead.

After interconnection reconfiguration, communication ability will change. In order to describe the impact of interconnection on the communication capability, we propose the parameter about coupling degree.

Definition 6: Reconfigurability Coupling Matrix $C_o$ depicts the coupling degree between communication characteristic and interconnection. Here, $C_o = (c_{i,j})_{K \times L}$, $1 \leq i \leq K$, $1 \leq j \leq L$. $c_{ij}$ denotes the coupling degree between communication characteristic and interconnection, $0 \leq c_{ij} \leq 1$.

$$C_o = \begin{bmatrix} c_{11} & c_{12} & \cdots & c_{1L} \\ c_{21} & c_{22} & \cdots & c_{2L} \\ \vdots & \vdots & \ddots & \vdots \\ c_{K1} & c_{K2} & \cdots & c_{KL} \end{bmatrix}$$

Fig 3. Reconfigurability Coupling Matrix $C_o$
Fig. 3 is reconfigurability coupling matrix Co. The coupling degree \( c_{ij} \) can be determined by measuring the performance of task execution on different interconnection.

Let \( d_{ab} \) denote the communication workload between task \( t_a \) and \( t_b \). \( B \) the bandwidth, the actual communication time between task \( t_a \) and \( t_b \) can be calculated according to the (2).

\[
T_{\text{comm}}(t_a, t_b) = \frac{d_{ab}}{Bc_{i,j}}.
\]  

(2)

Here, \( 1 \leq a, b \leq N \). The reconfiguration of interconnection reflects the flexibility and adaptability of heterogeneous & reconfigurable computing (HRC).

4. Performance Analysis Method Based on Scheduling

Completion time is the key metric to evaluate the performance of heterogeneous & reconfigurable computing system (HRCS). However, because of the heterogeneity of tasks and reconfigurability of architecture, it is difficult to create a mathematical formula about completion time. So, we try to calculate the completion time by a scheduling algorithm.

4.1 Parameters Assumption

Performance analysis for heterogeneous & reconfigurable tasks running on the HRCS is the most common and difficult condition. The main idea of this performance analysis method is to match the application tasks and processing units optimally based on heterogeneity matching matrix \( M_a \) and calculate the execution time of task according to (1). Next, the interconnection will be reconfigured according to reconfigurability coupling matrix \( C_o \) and the communication time between tasks can be calculated according to (2). Last, the total completion time can be calculated. Table 1 lists the parameters and definitions used in the scheduling algorithm.

<table>
<thead>
<tr>
<th>parameter</th>
<th>definition</th>
</tr>
</thead>
<tbody>
<tr>
<td>( M )</td>
<td>number of processing units</td>
</tr>
<tr>
<td>( N )</td>
<td>number of tasks</td>
</tr>
<tr>
<td>( M_a )</td>
<td>heterogeneity matching matrix</td>
</tr>
<tr>
<td>( C_o )</td>
<td>reconfigurability coupling matrix</td>
</tr>
<tr>
<td>( w_a ), ( 1 \leq a \leq N )</td>
<td>computation workload of task ( a )</td>
</tr>
<tr>
<td>( d_{ab} ), ( 1 \leq a, b \leq N )</td>
<td>communication workload between task ( a ) and task ( b )</td>
</tr>
<tr>
<td>( B )</td>
<td>bandwidth</td>
</tr>
<tr>
<td>parents( ( a ) ), ( 1 \leq a \leq N )</td>
<td>list of direct predecessors for each task ( a )</td>
</tr>
</tbody>
</table>

4.2 MaCo-based Heterogeneous & Reconfigurable Performance Analysis (MaCo-HRPA)

Algorithm: MaCo-HRPA
Input: All inputs listed in table 1
Output: Total completion time

\{
L1: \( t_s(a)=0, \ t_e(a)=0 \)
L2: \( \rho \leftarrow \{1, 2, \ldots, M\} \)
L3: \( \varepsilon \leftarrow \{a | \ \text{indegree}(a)=0, \ 1 \leq a \leq N \} \)
\}
L4: Do until \(\varepsilon\) empty {
L5: (1) For each executable task \(a \in \varepsilon\)
L6: \(x = \text{scheduling} (a, \rho, M_a)\)
L7: if \(x > 0\)
L8: \(\varepsilon \leftarrow \varepsilon - \{a\}\)
L9: \(\rho \leftarrow \rho + \{x\}\)
L10: \(v_{ax} = \text{select}_v(a, x, M_a)\)
L11: \(T_{\text{comp}}(a) = \frac{w_a}{v_{ax}}\)
L12: \(t_e(a) \leftarrow t_e(a) + T_{\text{comp}}(a)\)
L13: \(\rho \leftarrow \rho + \{x\}\)
L14: (2) For each immediate successor \(b\) of task \(a\)
L15: \(r_i \leftarrow \text{HR-DAG} (a, b)\)
L16: \(t_j \leftarrow \text{select}_c(r_i, C_o)\)
L17: reconfiguring \((t_j)\)
L18: \(T_{\text{comm}}(a, b) = \frac{d_{ab}}{Bc_{i,j}}\)
L19: \(t_c(b) \leftarrow t_c(a) + T_{\text{comm}}(a)\)
L20: indegree\((b) \leftarrow \text{indegree}(b) - 1\)
L21: if \(\text{indegree}(b) = 0\)
L22: \(\varepsilon \leftarrow \varepsilon + \{b\}\)
L23: 
L24: Total_completion_time \(\leftarrow \max\{t_e(b), 1 \leq b \leq N\}\)

5. Case Study

In order to validate the accuracy of the performance analysis method, we create an experiment platform. This platform has one dual-core 2.4 GHz AMD Opteron CPU, two NVIDIA Quadro FX 5600 GPU cards (each has 1.5 GB of on-board memory), two Nallatech H101-PCIX FPGA accelerator card (with 16 MB of SRAM and 512 MB of SDRAM), and InfiniBand and Ethernet PCIe network cards. We design a set of microbenchmarks, it consists of matrix multiply (MM), DES, N-body (PP algorithm), SVM, 2-DFFT. Fig. 4 compares the predicted completion time with the measured completion time for these microbenchmarks.

![Fig 4. Comparision between predicted value and measured value](image-url)
According to the results, the geometric mean of the error in the execution time prediction for microbenchmarks is 4.1%. Because of this performance analysis method neglect the tasks' start-up time, reconfiguration time and other system overhead, the measured times are all bigger than the predicted times. We find that DES achieves the optimal performance on FPGA, even the execution time is 4 orders of magnitude smaller than GPU. It shows that the optimal matching is important to performance.

6. Related work

Traditionally, performance analysis techniques for high performance computing can be classified into three categories: analytical modeling, measurement and simulation [7-9]. Adve, etc [10] developed a conceptually simple modeling technique called deterministic task graph analysis that provides detailed performance prediction for shared-memory programs with arbitrary task graphs, a wide variety of task scheduling policies, and significant communication and resource contention. This model assumed deterministic task execution times (while retaining the use of stochastic models for communication and resource contention). Smith, etc [11] developed an analytic modeling methodology for characterizing the performance of applications running on high performance reconfigurable computing (HPRC) resources. This methodology included the impact of RC hardware, communication, application load imbalance, background load, and heterogeneous resources. The model differed from earlier pure software models in that it accounts for the hardware execution time and related hardware overhead. Javadi, etc [12] presented a model based on queuing networks and probabilistic analysis to predict the average message latency of super-cluster systems with network heterogeneity. The analysis was based on a parametric family of fat-trees, the m-port n-tree, and a deterministic routing algorithm was proposed.

7. Conclusion

This paper presented the definitions about heterogeneous computing (HC), reconfigurable computing (RC) and heterogeneous & reconfigurable computing (HRC) and discussed the relationship among them. We proposed the heterogeneous & reconfigurable computing system (HRCS) model and heterogeneous & reconfigurable DAG (HR-DAG) model. HR-DAG extended the traditional DAG model, it was able to reflect the heterogeneity of tasks and reconfigurability of architecture. Heterogeneity matching matrix Ma was proposed to depict the matching degree between various tasks and different processing units, reconfigurability coupling matrix Co depicted the coupling degree when these processing units execute different tasks. We designed a heterogeneous & reconfigurable performance analysis method based on scheduling algorithm--MaCo-HRPA, which took into account the heterogeneity and reconfigurability simultaneously and more applicable than these previous techniques. In the future, we plan to research the generation of reconfigurability coupling matrix Co, the interaction between communication characteristic and interconnection and the quantitative study about coupling degree.

References


