ParaLearn: A Massively Parallel, Scalable System for Learning Interaction Networks on FPGAs

Narges Bani Asadi†
Electrical Engineering Department
Stanford University
Stanford, CA, USA 94305
nargesb@stanford.edu

Christopher W. Fletcher†
Electrical Engineering and Computer Sciences Department
University of California, Berkeley, CA, USA 94720
cwfletcher@berkeley.edu

Greg Gibeling
Electrical Engineering and Computer Sciences Department
University of California, Berkeley, CA, USA 94720
gdgib@berkeley.edu

John Wawrzynek
Electrical Engineering and Computer Sciences Department
University of California, Berkeley, CA, USA 94720
johnw@eecs.berkeley.edu

Wing H. Wong
Statistics and Health Research and Policy Departments
Stanford University
Stanford, CA, USA 94305
whwong@stanford.edu

Garry P. Nolan
Microbiology and Immunology Department
Stanford University
Stanford, CA, USA 94305
gnolan@stanford.edu

† Lead authors.

ABSTRACT
ParaLearn is a scalable, parallel FPGA-based system for learning interaction networks using Bayesian statistics. ParaLearn includes problem specific parallel/scalable algorithms, system software and hardware architecture to address this complex problem.

Learning interaction networks from data uncovers causal relationships and allows scientists to predict and explain a system’s behavior. Interaction networks have applications in many fields, though we will discuss them particularly in the field of personalized medicine where state of the art high throughput experiments generate extensive data on gene expression, DNA sequencing and protein abundance. In this paper we demonstrate how ParaLearn models Signaling Networks in human T-Cells.

We show greater than 2000 fold speedup on a Field Programmable Gate Array when compared to a baseline conventional implementation on a General Purpose Processor (GPP), a 4.8 fold speedup compared to a heavily optimized parallel GPP implementation, and between 2.74 and 6.15 fold power savings over the optimized GPP. Compared to software approaches, ParaLearn is faster, more power efficient, and can support novel learning algorithms that substantially improve the precision and robustness of the results.

Categories and Subject Descriptors
C.3 [Computer Systems Organization]: Special-Purpose and Application-Based Systems

General Terms
Design, Performance, Algorithms

Keywords
FPGA, Bayesian Networks, Markov Chain Monte Carlo, Reconfigurable Computing, Signal Transduction Networks

1. INTRODUCTION
ParaLearn is a scalable, parallel FPGA-based system for learning interaction networks using Bayesian statistics. Interaction networks and graphical models have various applications in bioinformatics, finance, signal processing, and computer vision. They have found use particularly in systems biology and personalized medicine in recent years, as improvements in biological high throughput experiments provide scientists with massive amounts of data.

ParaLearn algorithms are based on a Bayesian network (BN) [19] statistical framework. BNs’ probabilistic nature allows them to model uncertainty in real life systems as well as the noise that is inherent in many sources of data.

Unlike other graphical models such as Markov Random Fields and undirected graphs, BNs are easily capable of learning sparse and causal structures that are interpretable by scientists [9, 14, 19].

Learning BNs from experimental data helps scientists predict and explain systems’ outputs and learn causal relations that lead to new discoveries. Discovering unknown BNs from data, however, has been a computationally challenging problem and despite the significant recent improvements in algorithms is still computationally infeasible, except for networks with few variables [8].
ParaLearn accelerates BN discovery through exploiting parallelism in BN learning algorithms. Furthermore, ParaLearn uses Markov Chain Monte Carlo (MCMC) optimization methods to search for the BNs that can best explain experimental data. As it was designed for high-performance, ParaLearn is able to use novel algorithms to address robustness and precision issues, which have traditionally been sacrificed to decrease runtime on general purpose processors (GPPs).

BN modeling has motivated previous studies targeting single chip FPGA implementations [5, 20]. Unlike these approaches, ParaLearn’s focus is to explore different avenues for system scalability. In order to scale to larger problems and networks, ParaLearn supports a general mesh of interconnected FPGAs. To increase robustness and find the true underlying model behind the data, ParaLearn coordinates an ensemble of scoring methods in what we call the MetaSearch approach.

We implemented ParaLearn on multi-core GPUs, GPUs [17] and FPGAs. The FPGA implementation achieves the highest performance and power savings through best exploiting fine-grained parallelism inherent in the algorithm, but required the most development effort. Moreover, FPGA multi-chip systems are interconnected through a topology customized to ParaLearn algorithms. This allows ParaLearn to scale, while maintaining performance, using additional FPGAs.

We demonstrate ParaLearn’s performance, power and scalability in modeling the Signal Transduction Networks (STN) in human T-Cells. STNs include hundreds of interacting proteins and small molecules in the cell and regulate numerous cellular functions in response to changes in the cell’s chemical or physical environment. STNs are important subjects of study in systems biology and alterations in their structures are profoundly linked to increased risks of human diseases like cancer [15].

Single cell high-throughput measurement techniques such as flow cytometry have made it possible to measure and monitor the activity of small molecules and proteins involved in STNs. BNs have been successfully applied to reconstruct possible STN pathways from this data [21]. BN inference helps determine the causal structure of STNs as well as linking different structures to clinical outcomes. This can lead to new diagnostics, drug target and biomarker discoveries and novel prognosis tools [15]. While the previous computational studies on reconstructing STNs from data involved few variables (about 10 proteins), a new high-throughput measurement technology, "CyToF" [4], has been developed based on mass spectrometry that enables scientists to measure up to a hundred molecules simultaneously. CyToF creates the opportunity and the need for more sophisticated analysis of STNs in cancer cells. We use the data that has been produced using CyToF technology [4] in our analysis.

2. ALGORITHMS & MAPPING TO FPGAS

BNs [19] are a class of probabilistic graphical models that are useful in modeling and learning complex stochastic relationships between interacting variables. A BN is a directed acyclic graph $\mathcal{G}$, the nodes of which represent multivariate random variables $V = \{V_1, \ldots, V_n\}$. The structure of the graph encodes conditional and marginal independence of the variables and their causal relations by the local Markov property: that every node is independent of its non-descendants given its parents. The dependence of nodes on their parents is encoded in local Conditional Probability Distributions (CPDs) at each node [14, 19]. The joint probability distribution is decomposed to the product of the probability of each variable $V_i$ conditioned on its parents $\Pi_i$: $P(V_1, \ldots, V_n) = \prod_{i=1}^{n} P(V_i|\Pi_i)$.

The CPDs that model $P(V_i|\Pi_i)$ can have different forms based on the data characteristics. The Multinomial distribution, Linear Gaussian or Mixture of Gaussians are a few examples of possible CPD formulations. Figure 1.a depicts an example of a BN with binomial CPDs.

The algorithms that have been developed for learning BNs usually assign a score to candidate graph structures (based on how well each graph structure explains the data) and then search for the best scoring graph structures. The score of a graph structure $\mathcal{G}$ given the data $\mathcal{D}$ can be computed in many different ways. The most popular scoring metrics are based on Maximum Likelihood (ML), Bayesian Information Criterion (BIC) or Bayesian methods [14, 16]. The graph scores also depend on the CPD formulation that is assumed for the BN. The important observation here is that the graph score can always be decomposed into a local score at each node. As we explain later, this enables a general purpose parallel framework that can accommodate all the different BN learning algorithms.

While learning CPD parameters for a given graph structure is a relatively easy task, the computational inference of the graph structures is an NP-hard problem [6]. Despite significant recent progress in algorithm development, this is currently an open challenge in computational statistics that has remained infeasible except for cases with a small number of variables [8].

MCMC optimization methods have been used to perform the search algorithm [5, 8] in this super-exponentially growing space. MCMC methods are capable of skipping the local optima in the search space and are preferred over heuristic search methods in multi-modal distributions. Following [5, 8, 22] we use MCMC in the order space instead of directly searching the graph space. The order space is smaller than the graph space ($2^{O(n \log(n))}$ vs. $2^{\Omega(n^2)}$, where $n$ is the number of variables or nodes of the BN). Moreover, the order space enables powerful search operations (such as swapping the position of two nodes) leading to better exploration of the search space.

For any BN there exists at least one total ordering of the vertices, denoted as $\prec$, such that $V_i \prec V_j$ if $V_j \in \Pi_i$. An example of a BN and a compatible order is depicted in Figure 1. To employ MCMC in the order space, one performs a random walk in the space of possible orders and at each step accepts or rejects the proposed order based on the current and proposed orders’ scores (according to the Metropolis-Hastings rule [13]). The score of a given order is decomposed into the scores of the graphs consistent with the order: $\text{Score}(\prec, \mathcal{D}) = \sum_{\mathcal{G} \in \prec} \text{Score}(\mathcal{G}; \mathcal{D})$, which can be efficiently calculated as shown in [10]:

$$\text{Score}(\prec, \mathcal{D}) = \sum_{\mathcal{G} \in \prec} \prod_{i=1}^{n} \text{LocalScore}(V_i, \Pi_i; \mathcal{D}, \mathcal{G}) = \prod_{i=1}^{n} \sum_{\Pi_i \in \Pi} \text{LocalScore}(V_i, \Pi_i; \mathcal{D}, \mathcal{G}) \quad (1)$$
After MCMC converges, each order is sampled with a frequency proportional to its posterior probability. The high scoring orders include the high scoring graphs and these graph structures can be extracted from the orders in parallel.

Local score generation must finish before the MCMC Order Sampler can start walking the order space. The computation of local scores in Equation 1 depends on the scoring metric as well as CPD formulations. Local scores are calculated from the raw data for the possible parent sets of that node and only once for each node. The MCMC Order Sampler then scores randomly generated orders by traversing the list of possible parent sets for each node and accumulating the local score of each parent set that is compatible with the current order. Each node’s score is then multiplied together to form the order score, according to Equation 1. The MCMC Kernel usually needs to perform hundreds of thousands or even millions of iterations before convergence and this is where more than 90% of computation takes place [5].

Therefore BN inference is a three step computation problem:

1. **Preprocessing**: local scores are generated according to the CPD formulation and ordering for possible parent sets.

2. **Order and Graph Sampling**: the high scoring order structures are learned from data and the high scoring graph structures are sampled from these orders. This step usually takes 90% of execution time on general purpose processors.

3. **Postprocessing**: higher level analysis processes such as visualization and other tools for comparing and using the learned graph structures.

To achieve high throughput as well as flexibility, ParaLearn uses general purpose computing tools (GPPs) to perform the first and third steps and optimizes the algorithm kernel (second step) by using a customized hardware implementation. The key observation that leads to this efficient and flexible design is that while the different BN scoring and search methods differ in the first and third steps, all feature the same computation kernel (second step). In the next section we will explain how the computations needed for the MCMC Kernel are mapped and optimized to FPGA platforms.

### 3. IMPLEMENTATION

In this section, we discuss how the MCMC Kernel is implemented on a single FPGA and scaled to multiple FPGAs.

#### 3.1 Mapping the MCMC Kernel to an FPGA

The MCMC Kernel’s computation is performed by an MCMC Controller, Scoring Unit and Graph Sampler Unit, shown in Figure 2.

##### 3.1.1 MCMC Controller

The MCMC Controller Unit (MCU) performs the sample (order) generation as well as deciding whether to keep or discard a sample based on its score. The MCU uses the two dimensional encoding to represent orders as in [5]. In the two dimensional encoding of an order, each row represents a “local order” and encodes the possible parents of the node corresponding to that row, as is shown in Figure 1. The MCU performs the random walk by swapping the position of two nodes (corresponding rows and columns in the two dimensional encoding) to create new samples, and accepts the new sample with probability A based on the Metropolis-Hastings rule:

\[
A(\prec \rightarrow \prec') = \min \left( 1, \frac{\text{Score}(\prec') | D)}{\text{Score}(\prec | D)} \right)
\]

Since the scores represent very small probabilities, all computations are done in logarithmic (log) space. This also allows the reduction step on the FPGA to be implemented as simple additions rather than multiplications. Therefore, the MCU decides whether to keep the new sample by subtracting the old score from the new score and compares this number to the log of a random number. A hardware LFSR and a log look-up table are designed to generate the required random numbers.

##### 3.1.2 MCMC Scoring Unit

When the MCU generates an order, the MCMC scoring unit (MSU) is responsible for calculating and returning the resulting score to the MCU. As introduced in Section 2, the scoring process for each local order involves traversing over a set of parent sets (stored in memory) and accumulating the local score of each compatible parent set. The accumulation process is associative and commutative and can be decomposed into as many smaller, parallel accumulations as the implementation platform can accommodate.

On Virtex-5 FPGAs, each accumulation circuit (called a scoring core or ‘SC’) requires Block RAM (BRAM) for parent set and local score lookups, regular FPGA logic to implement the accumulator itself, and either BRAM or distributed/LUT RAM (LRAM) to implement a log lookup operation. Score look-ups are made with BRAMs because of their high bandwidth and single cycle access. Using LRAM for log look-ups saves BRAM resources for parent sets and scores but constrains routing high-utilization/frequency designs.

At the MSU’s top level, the logic responsible for scoring one node (or ‘SN’) is replicated for each node that the system must support. Nodes are then divided into SCs, which are
assigned $BRAMperSC$ 18Kbit BRAM, given by:

$$BRAMperSC = \left\lceil \frac{\text{Network Size}}{32 \times \text{Ports}} \right\rceil + \left\lceil \frac{\text{FP Precision}}{32 \times \text{Ports}} \right\rceil$$

$Ports$ is either 1 or 2 for Virtex-5 FPGA BRAMs, meaning that each BRAM can be dual-ported to increase performance. Within each SC, the score accumulation datapath is pipelined and time-multiplexed between hardware scoring threads in order to maintain a one-score-per-cycle accumulation in the steady state, regardless of clock frequency. Each SN is assigned as many SCs as is necessary to support the problem specified parent-sets-per-node (PPN) constraint. This architecture strives to minimize BRAM depth and maximize throughput per SC, which in turn maximizes performance.

When the MCU broadcasts an order, the order is split into local orders and, through a pipeline that fans out across the FPGA, sent to each node’s SCs in parallel. Each SC starts and finishes scoring its local order at the same time since the scoring process is a data independent memory traversal. Once scoring is complete, each SC’s result is accumulated as shown in Figure 3. Linear accumulation is used to accumulate in nearest-neighbor fashion across SCs and SNs because the Virtex-5 FPGA’s BRAMs are arranged in columns.

To avoid rerunning FPGA CAD tools for different networks, each node and SC in the MSU can be enabled or disabled at runtime. Each SC that does not receive any parent sets from software is considered disabled and is by-
passed during the score accumulation process. If all SCs within an SN are bypassed in this way, the entire SN is bypassed. This means that an MCMC Kernel that is designed to accommodate an N node and P PPN system can accommodate any number of nodes \( \leq N \) and any number of PPN \( \leq P \).

When the system is synthesized to support more PPN than it needs to support a given network, the system can achieve a greater speedup. This is because every parent set BRAM is designed to behave like a restartable FIFO. Thus, each SC only scores as many parent sets as have been loaded at initialization. By spreading out parent sets across the SCs as evenly as possible, the kernel can optimize at runtime for arbitrary problems.

### 3.1.3 MCMC Graph Sampler Unit

Determining which graph will produce the highest score from each order is a process typically undertaken by software after order sampling is complete [17]. ParaLearn’s FPGA Kernel determines graphs and their scores in parallel with the order scoring process.

The highest scoring graph consists of the highest scoring parent set for each node. Thus, each SC must keep track of the highest scoring parent set it has seen throughout each iteration. When order scoring is complete, the graph score is accumulated separately across SCs and the graph itself is assembled and sent to software (see the “Graph-Samplers” in Figure 2). Assembling the final graph takes more clock cycles than accumulating the order score, so the \( i^{th} \) order’s graph is assembled while the \( (i + 1)^{th} \) order is issued by the MCU. Through integrating the graph sampling step into the order sampler, graph sampling costs zero time overhead.

### 3.1.4 GateLib & RCBIOS

GateLib [12] is a standard library of hardware and software code with an integrated build and test framework. GateLib was developed at U.C. Berkeley and includes everything from standard registers, to DRAM controllers, to build and test tools which ensure that both the library and designs such as ParaLearn are operating correctly.

For ParaLearn the most significant component of GateLib is the sub-library called RCBIOS, which provides a Reconfigurable Cluster Basic Input/Output System for the kind of FPGA computing platforms used in this work. RCBIOS provides three primary interfaces between hardware and software: remote memory access (RDMA), control and status registers and data streams. RCBIOS is built on top of a flexible Network-on-a-Chip interconnect and XLink, another sub-library of GateLib which provides simple hardware/software communication. Implemented directly in RTL Verilog (i.e., without any service processors), the RCBIOS modules provide high-performance, low cost and easy-to-use communications between the FPGAs and the front-end system.

ParaLearn uses RCBIOS to initialize system state and to collect resulting graphs and their scores as shown in Figure 4. After the pre-processing step, ParaLearn software uses RCBIOS to send parent sets, local scores, a node count, an iteration count, and an initial order to the reconfigurable cluster (referred to as Load Time). As each order is scored, its highest scoring graph and that graph’s score is streamed back to software and post-processed.

### 4. SCALABILITY

When the problem becomes too large for a single FPGA, the MCMC Kernel must be spread across multiple FPGAs. ParaLearn leverages the BEE3 [2] platform’s mesh network (see Figure 5), composed of both interchip links between the four FPGAs, and CX4 links between BEE3’s. A multi-FPGA MCMC Kernel is composed of a master FPGA and one or more slave FPGAs. The master FPGA contains MCU and MSU logic. The slave FPGAs are used for their MSUs only and score any local order which arrives, returning the partial score to the master FPGA. In all cases, since the logic which differentiates master and slave is relatively small compared to total FPGA area, a single FPGA bit-file is used for both master and slave, and each is configured at runtime through software. This simplifies the process of reconfiguring and managing the system as more FPGAs need to be introduced to support larger problems.

ParaLearn augments the BEE3 mesh network with a general cross-chip router to support additional FPGAs without having to modify the pre-existing system. The router, a ded-
Figure 5: Different views of a multi-FPGA ParaLearn Kernel.

ParaLearn’s scalability, coupled with the dearth of industry standards, makes it important for us to address the tradeoffs between different multi-FPGA or reconfigurable cluster platforms. Reconfigurable cluster platforms differ in the number and type of FPGAs per board, the connection topology, as well as supported peripherals and interfaces. Platforms range from the Dini Group DN9000K10 which consists of a mesh of FPGAs, to the Xilinx ACP that uses an Intel front-side-bus (FSB) to connect a smaller number of FPGAs to a CPU.

ParaLearn targets the BEE3 reconfigurable cluster platform from BEEcube [2]. The BEE3 consists of four Virtex-5 LX155T FPGAs, two channels of DDR2 SDRAM per FPGA, and a point-to-point interconnect. The interconnect for each FPGA consists of two “interchip” links, which form a ring on the board, and two 10GBase-CX4 Ethernet interfaces for interboard connections. ParaLearn as discussed in section 4, uses these connections to assemble a mesh network of MCMC cores.

Unlike platforms that embed a full network of FPGAs in a single PCB, such as Dini Group boards, the CX4 connections on the BEE3 are cable-based, which allows the system to be reconfigured to meet the application’s needs. This allows different FPGAs in the system to be connected directly together in order to reduce inter-FPGA hop count, or to increase the size of the overall system. These CX4 connections come at a price, having latency on the order of several dozen cycles, as opposed to the 5 cycles between FPGAs on one board.

When comparing the BEE3 against bus-connected platforms like the Xilinx ACP, there is a tradeoff between cluster size and front-end communication. Bus-connected platforms like the ACP are limited by bus sharing and score accumulation will scale linearly in time with the number of FPGAs. By contrast the BEE3 can take advantage of score reductions at each hop, accumulating multiple FPGA’s score at a single time. Furthermore, the cost of the ACP systems must include a processor which is of no particular use to ParaLearn and adds to purchase, complexity and maintenance costs.

While ParaLearn is currently implemented on a BEE3 system due to the characteristics of the algorithm, this is not the only compatible implementation platform, and the system can be efficiently implemented on other platforms. In particular the system was developed in part on the Xilinx ML505 demonstration boards, which are comparable to a quarter of a BEE3, and provide a low-cost entry level FPGA platform.

5. RESULTS AND ANALYSIS

Our primary study analyzes 22 proteins in human cancer T-Cells with data from CyTof technology. We use 10,000 single cell measurements of the 22 proteins and limit the search indegree for the graph search to 4. Therefore, we have 7547 possible parent sets for each node and that we need at least two Virtex-5 LX155T FPGAs to implement one MCMC Kernel.

5.1 Quality of Results

To assess the correctness of the FPGA design, we tested the software and FPGA versions of the algorithms on synthetic data simulated from known BN structures such as the ALARM [1] network and we were able to reconstruct the network on both implementations. The only approximation in the FPGA implementation compared to the software version is the fixed point conversion of the local scores and lookup.
5.2 Design Space Exploration

Sections 5.2.2 through 5.2.5 present different studies used to evaluate ParaLearn in terms of scalability, performance, area and power. In our analysis, we use an “orders per second” (OPS) metric to determine performance. Unless otherwise noted, all tests are run on the following hardware:

**FPGA:** Xilinx Virtex-5 XC5VLX155T (-2 speed) FPGAs

**GPP:** Intel(R) Core(TM) i7 CPU QuadCore running at 3.07 GHz, with 12 GB of Memory and an advertised TDP of 130 W.

For perspective, the Virtex-5 is a generation-old FPGA family and the XC5VLX155T is a mid-sized chip in the family.

5.2.1 Software Generator

To help conduct performance studies, we use software to generate MCMC Kernels that are optimized for different networks. The generator takes as input the desired network size, PPN, and parameters describing the implementation platform (such as BRAM dimensions and the number of FPGAs available to solve the problem). From this information, the generator sets parameters used by the FPGA CAD tools that describe how many SCs should be instantiated in each node, and also how each of the SCs should be implemented.

To build the optimal configuration, the software generator first determines the theoretical maximum performance through fixing the network size, PPN, and hardware resources, while scaling the number of SCs per node. Generally, performance will increase with the number of SCs to the point where the result accumulation step overwhelms the SC’s execution time. Figures 7 and 9 exemplify the pare-to optimum curves as a function of SCs per node.

The generator varies the number of SCs per node by adding more single-ported SCs or dual-ported existing SCs. Adding single-ported SCs costs the most BRAM and FPGA logic, but can be done incrementally until the system runs out of FPGA fabric. Using dual-ported SCs is “all or nothing”—if one SC is dual-ported, the rest have to be as well or the performance benefit is masked by the slower SCs. Adding single-ported SCs also increases the maximum PPN that the system can support, while dual-portion does not. ParaLearn dual-ports SCs when the required PPN is low or the system is BRAM constrained.

5.2.2 Current Scalability

In this section we compare the 22 parameter² CyTof data across currently attainable³ FPGA configurations in order to show how speedup and power consumption is affected by the hardware configuration used to support the problem.

The OPS column in Table 1 shows the performance impact of spreading the 22 parameter problem across different arrangements of FPGAs. For each number of FPGAs, trials are run for \{100, 150, 200\} MHz clock frequencies and for both single and dual-ported SCs. To isolate the effect of scaling across FPGAs, all configurations support exactly 8 single or dual-ported SCs per node. If a configuration in this permutation isn’t listed in Table 1, it was not attainable. As more FPGAs are used to support the problem, configurations using the same number of SC ports and the same clock frequency tend to degrade in performance due to an increased hop latency in the FPGA mesh. With additional hardware, however, more aggressive configurations (in terms of SC porting and clock frequency) become “attainable.”

The Power columns in Table 1 show power utilization for different configurations. All power results are gathered through the Xilinx XPower Analyzer after simulating traces through the system running the 22 parameter CyTof data set. As we have fixed the number of SCs per node in each experiment, the power consumption per FPGA drops in sparser FPGA configurations however total system power

²One parameter corresponds to one node.
³An “attainable” configuration can fit into the allotted hardware and meet timing at the specified frequency.
Table 1: MCMC Kernel study on the 22 parameter CyTof data (PPN = 7547) on one and two BEE3 boards. The highest performance “attainable” configuration for each system composed of $N$ FPGAs is shown in bold.

(taking into account the number of FPGAs) tends to increase. We found that this is due to fixed cross-FPGA communication overhead, namely the interchip links in each sample (which produced $\approx 3.8$ W) and the GTP connections (responsible for $\approx 1$ W) in the 8 FPGA experiment. Furthermore, Table 1 shows that in general, 150 Mhz configurations require more power than 200 Mhz configurations. We attribute this to there being an extra clock in 150 Mhz configurations (interchip requires a 200 Mhz clock and RCBIOS requires a 100 Mhz clock-- thus the MCMC Kernel can use those existing clocks when running at 100 or 200 Mhz).

Figure 6 shows the FPGA resource utilization for all attainable 2-4 FPGA configurations in Table 1. In general, using dual-ported SCs increases performance improvements by $\approx 1.6 \times$ and more evenly uses FPGA resources. In practice, we observed that designs that utilize a majority of FPGA logic resources could not route at higher frequencies when LRAM was also heavily utilized. To get around this problem for 200 MHz configurations, we more heavily relied on BRAM rather than LRAM to implement $\log$ tables, relative to 150 MHz configurations.

5.2.3 Projected Scalability

In this section, we explore how larger FPGAs can be used to increase system speedup. For this experiment, we used the software generator (Section 5.2.1) to project performance over a range of SC counts per node (shown in Figure 7). We verified these trends through:

1. Direct comparison with hardware performance for “attainable” configurations.
2. Gate-level simulation at fixed intervals on the curve, and at the projected upper-bound, for configurations requiring larger FPGAs.

Extrapolated from Figure 7, larger FPGAs will provide between $1.5 \times$ and $1.7 \times$ speedup over current configurations assuming a comparison across the same number of FPGAs. If the problem is able to fit onto a single FPGA, power will be minimized and speedup can increase by up to 2.61 ×.

It is also important to consider how performance degrades as the number of FPGAs in the system increases. To isolate this effect, we constrain the software generator to 8 single-ported SCs per node while increasing the number of FPGAs, as shown in Figure 8. In order to maximize performance, each FPGA mesh is arranged so that the hop latency from the master FPGA to any given slave FPGA is minimized. The occasional kinks in the graph are due to adding an FPGA with a new greatest hop latency. As can be seen, the greatest performance hit is moving from one to two FPGAs, with the performance decreasing linearly as the number of FPGAs increases.

5.2.4 Current Flexibility

This section studies how ParaLearn can scale to handle different networks.

As the FPGA CAD tools take an appreciable amount of time to run, it is important to study how a general FPGA bitfile, configured for up to $N$ nodes and $P$ PPN performs for different problems. Table 2 shows different configurations for problems spanning over four FPGAs (a single BEE3 board). All configurations maximize PPN and place performance as a second-order constraint in order to tolerate denser networks. Thus, this study synthesizes only single-ported BRAMs and shows results for the highest attainable clock frequency (150 MHz for every experiment).

All configurations are first shown assuming that all available parent set memory is used. The 22 parameter CyTof
data’s performance is shown in rows whose PPN column is marked with a ⋆. The closest configuration to the CyToF data (\(C\)) achieves \(0.8 \times\) performance relative to the most aggressive, attainable bitfile (marked \(\oplus\) in Table 1) while the next closest (\(B\)) achieves \(0.6 \times\) the performance, and the last (\(A\)) maintains \(0.4 \times\) performance. The drop-off is due to there being less SCs per FPGA, which is due to larger networks requiring wider parent set logic.

Table 2 also shows the point at which the MCMC Kernel becomes BRAM limited in handling different networks. For a network of \(N\) nodes, there are a possible \(2^{N-1}\) different parent sets. Theoretically, the 8 node network across four FPGAs can tolerate 59,392 parent sets but a network of that size will only ever have \(2^8 - 1 = 128\) parents (for this reason, an 8 node network would never be spread across four FPGAs in practice). The 16 node network, on the other hand could have 32,768 different parent sets, yet the MCMC Kernel can only tolerate 28,672 parents.

ParaLearn tolerates BRAM constrained networks through limiting the number of parent sets. This can be done in two ways:

1. Using smarter parent set filtering algorithms based on supervised learning techniques that can learn the subset of variables that are correlated with a node value and then include those as possible parent sets.

2. Limiting the maximum indegree of the graphs. With this approach, ParaLearn includes parent sets up to size \(K\), where usually \(2 < K < 6\). This decreases the parent set count from \(2^{N-1}\) to \(\sum_{i=0}^{K} \binom{N-1}{i}\).

Method 1 is more flexible as candidate parent sets for each node will be optimized and learned separately. For the 22 parameter CyToF data we used method 2 (\(K = 4\)) as it does not impose additional runtime and is a reasonable assumption for 22 node networks.

### 5.2.5 Projected Flexibility

Using Method 2, Figure 9 shows peak performance curves for each network, again as a function of SCs per node. The 8, 16, 22, and 24 node networks use indegree = 4 while the 32 and 40 node networks use indegree = 3. X notches in the graph represent where 22 parameter CyToF configurations sit on each curve that supports \(\geq 22\) nodes and \(\geq 7547\) PPN. As in Table 2, bitfiles designed for less similar networks, relative to the 22 parameter data, show less performance. In general, networks will lower PPN requirements reach peak performance with less SCs per node because the ratio between SC kernel time and result accumulation decreases as problem size decreases.
5.3 End-to-End Computation

In this section we compare 22 parameter CyTof end-to-end (“time to graphs”) performance between an optimized GPP implementation [17] and an MCMC Kernel using different numbers of FPGAs. This study runs 100,000 MCMC iterations over 50 random restarts, where each random restart resets the system with a different initial order (starting point). After the MCMC Kernel completes, the final graph results are sorted by their scores and probabilities are normalized against a weighted average to produce a final graph.

Table 3 shows how the Order Sampler’s speedup scales with the number of available FPGA chips. The pre-processing time starts to dominate as the core is accelerated. Preprocessing currently takes 185 seconds for multinomial CPDs and 50 seconds for Linear Gaussian CPDs.

6. SOURCES OF SPEEDUP

In this section we compare different MCMC algorithm implementations, using different optimizations, to show where the speedup in our approaches comes from. Table 4 shows the ‘best effort’ end-to-end run times for a baseline GPP implementation, optimized GPP implementation (derived from [17]), and attainable FPGA configurations. For this particular problem and the network’s size, [17] has shown that the GPU does not perform well relative to the GPP. Thus, only GPP numbers are shown here. As can be seen, FPGA load time weighs down the FPGA’s performance while the graph sampler weighs down the GPP.

The baseline GPP implementation was written in C and applies no software optimizations. The optimized GPP implementation performs bitwise operations where possible, uses software multi-threading, and caches results to improve performance. The FPGA implementation exploits the fine-grained parallelism afforded by the BRAMs and uses bitwise manipulation, however, does not cache. In the remainder of the section, we will distill the different optimizations used in each approach to try to explain performance differences.

First, we compare the benefits of threading the GPP and FPGA implementations. We implicitly implemented multi-threading on the FPGA through exploiting BRAM parallelism. Through adding pthreads to the baseline C implementation, we were able to achieve a 2x speedup with 10 threads running on a quad core/8 thread machine. The cross-thread communication in the result accumulation step seems to weigh down the pthreads speedup. The FPGA can best take advantage of parallelism because of fine-grained BRAM access and because it can optimize the result accumulation step at the gate level.

Second, we implemented a fixed point implementation of the baseline GPP model so that it can more closely be compared to the FPGA, which is also fixed point. Fixed point on the GPP does not improve performance. This may be due to the fact that the floating point units are nearly as optimized and fast as the integer units on modern processors.

Third, we compare the effect of caching on the GPP to the FPGA. Caching works by pairing orders with their scores, which is valid because a given order will always have the same score. We did not implement caching on the FPGA because it requires dedicated logic and makes execution time problem dependant (unlike this work’s “orders per second” metric). Work done by [17] (which provided data for the optimized GPP in Tables 3 and 4) shows how different problem configurations have different cache performance. The benefit achieved through caching is based on how long it takes the problem to converge.

Caching is naturally suited for the GPP/GPU because these platforms have built-in caches that don’t come at extra design cost. After paying the logic penalty, however, caching can be implemented on FPGAs. In order to construct an analytic model of cache performance, we consider the following types of caches:

**Order:** A cache that sits next to the MCU and caches whole orders.

**Node:** A cache at each node that stores only local orders and their partial scores. A GPP benefits from every hit in every node cache. The FPGA only benefits if every node’s cache hits in the same iteration.

Both caches’ performance will depend on their capacity and hash function. On an FPGA, we have two options for cache storage: off-chip memory (we assume DRAM) and on-chip BRAM. DRAM requires a DRAM controller and has gigabytes of capacity, however carries an access latency that

---

Figure 9: The effect of scaling the number of SCs per node, fixed over 4 FPGAs, for various networks. All experiments are carried out with single-ported SCs, a 150 Mhz clock, and assuming that an FPGA chip can support any number of SCs.
varies with address and locality. BRAM requires only the minimum support logic as is necessary and is single-cycle access, however takes up BRAMs that would otherwise be used for SCs. In order to maximize cache capacity, we will assume DRAM-based caches. Since DRAM is centralized, logic dedicated to hash functions and cache control are a one-time cost.

Work done by [17] has shown us that when running the 22 node CyToF data with 100,000 iterations and 50 restarts, there are 22,978 hits in the order cache and 103,757,230 hits in node caches (~95% hit rate) out of which 3,274,755 are hits when all the nodes hit in a single iteration (~65% hit rate). Given this, we construct an analytical model of cache benefit on the FPGA:

$$CPO = \frac{\text{Raw} \times (\text{Total} - O_h - N_h) + O_a \times O_h + N_a \times N_h}{\text{Total}}$$

where CPO is “cycles per order,” the $O_{(b,a)}$ are the order cache hit count and access time (in cycles), $N_{(b,a)}$ are the same figures for the node cache, $\text{Raw}$ is cycles per order without caching and $\text{Total}$ is the number of orders proposed to the system. Table 5 shows a theoretical study of how the FPGA implementation would fare with the caching optimization. We assume a pipelined hash function initiates back-to-back DRAM reads for node cache lookups and a single read for an order lookup. Thus, cache lookup latency is given by:

$$O_a = T_{MCU} + T_h + DRAM_a = 5 + 10 + 25 = 40$$
$$N_a = T_{MCU} + T_h + DRAM_a + NS = 5 + 10 + 25 + 22 = 62$$

where $T_{MCU}$ is the mandatory overhead in the MCU, $T_h$ is the cycle latency of the pipelined hash function, $DRAM_a$ is a conservative DRAM access time (in cycles), and $NS$ is the network size (in nodes). This study assumes that all FPGA caches are non-blocking and proceed with order scoring during the cache look-up. Furthermore, we assume that partial scores retrieved from the node cache accumulate as each new score arrives, hiding the result accumulation step’s latency. For this data set, we observe $\approx 2.59\times$ improvement against the baseline FPGA implementation when using both caches. Furthermore, the caching FPGA attains a $\approx 4.86\times$ improvement over the optimized GPP.

### Table 4: MCMC Kernel end-to-end performance comparison against an optimized GPP solution. Times are broken into: Load Time + Order Sampler + Graph Sampler.

<table>
<thead>
<tr>
<th>Architecture</th>
<th>100,000 iterations, 0 restarts</th>
<th>100,000 iterations, 50 restarts</th>
<th>100,000 iterations, 100 restarts</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPP (baseline)</td>
<td>$0 + 970 + 20 = 990$</td>
<td>$0 + 48,500 + 60 = 48,560$</td>
<td>$0 + 97,000 + 60 = 97,060$</td>
</tr>
<tr>
<td>GPP (optimized)</td>
<td>$0 + 5,2 + 0.85 = 6.05$</td>
<td>$0 + 44.6 + 18.8 = 63.4$</td>
<td>$0 + 78.8 + 31.3 = 110.1$</td>
</tr>
<tr>
<td>FPGA (2x)</td>
<td>$4.5 + 0.81 + 0 = 5.31$</td>
<td>$4.5 + 40.6 + 0 = 45.1$</td>
<td>$4.5 + 81.3 + 0 = 85.8$</td>
</tr>
<tr>
<td>FPGA (3x)</td>
<td>$4.5 + 0.51 + 0 = 5.01$</td>
<td>$4.5 + 25.5 + 0 = 30.0$</td>
<td>$4.5 + 51.1 + 0 = 55.6$</td>
</tr>
<tr>
<td>FPGA (4x)</td>
<td>$4.5 + 0.44 + 0 = 4.94$</td>
<td>$4.5 + 22.1 + 0 = 26.6$</td>
<td>$4.5 + 44.1 + 0 = 48.65$</td>
</tr>
</tbody>
</table>

### Table 5: FPGA speedup when using both order and node caches. This test compares 22 node CyToF data over 100,000 iterations and 50 restarts.

<table>
<thead>
<tr>
<th># FPGAs</th>
<th>Baseline</th>
<th>Order</th>
<th>Node</th>
<th>Both</th>
<th>Speedup</th>
</tr>
</thead>
<tbody>
<tr>
<td>3</td>
<td>25.39</td>
<td>25.41</td>
<td>10.12</td>
<td>10.04</td>
<td>2.53×</td>
</tr>
<tr>
<td>4</td>
<td>22.1</td>
<td>21.98</td>
<td>8.63</td>
<td>8.53</td>
<td>2.59×</td>
</tr>
</tbody>
</table>

7. **DISCUSSION AND CONCLUSIONS**

Graphical models and in particular Bayesian Networks are of great importance in our data intensive era. They can help scientists learn causal interactions and are useful tools in personalized medicine where there are substantial high-throughput experiments to measure gene expressions, DNA sequence and protein abundance. In this paper we demonstrated ParaLearn’s usability in learning network structure and used the STN in human T-cells as a motivating example. ParaLearn’s accelerated and integrated solution enables scientists and physicians to use computationally intensive BN inference algorithms in real-time settings.

As discussed in Section 5, the learned graphs can be quite different for different CPD formulations. While conventional BN structure learning techniques limit their search to one or few CPD models because of computation complexities, ParaLearn can execute many of these possible CPDPs in parallel. We would like to further explore this MetaSearch approach in our future work. We will study how the different learned graphs can be validated and compared or integrated to produce a final model. We believe that the novel MetaSearch approach increases the robustness of the algorithms and makes BN modeling a more useful and reliable tool for scientists.

In Section 5.2, we provided power results for currently attainable FPGA configurations. Comparing our power results to the GPP shown in Section 5.2, we show a 2.74× power improvement (considering only the FPGA and GPP die) using our most aggressive configuration over 4 FPGAs. Furthermore, our lowest performance 2 FPGA configuration from Table 1 shows a 6.15× power advantage and maintains 1.02× the performance of the optimized GPP in Table 3. We used GPP TDP to perform power analysis meaning that our power advantage is an upper bound. In a complete system, the infrastructure around both the FPGA and GPP costs additional power. Unlike a GPP, however, ParaLearn does not require any external devices except for an Ethernet PHY, reducing off-FPGA power overhead.

In Section 5.2, we also projected how performance will increase with larger and more FPGAs. To help meet projected speedups, the new generation of Xilinx FPGAs, the Virtex-6 family, which retains in 2010, provides up to 5× the BRAMs (our limiting factor) and 3× the logic of the FPGAs used in this work. Architecturally, these new FPGAs are the same as our current FPGAs. Thus, ParaLearn can be ported to this new platform to realize the projected speedup.

As we presented in Section 5.3 the pre-processing step that currently runs on a single GPP has become the bottleneck relative to the FPGA kernel. We are currently exploring a parallel implementation of the pre-processing step for different CPD methods. The pre-processing step for discrete
data maps best to an FPGA as it involves bit operations and *popcount* functions. The continuous data pre-processing would probably benefit most from a GPU implementation because of the heavy usage of floating point matrix operations. On the FPGA, however, pre-processing would have the added benefit of eliminating the *Load Time* bottleneck revealed in Section 5.3 and 6.

In Section 6, we observed how caching, and other optimizations, have benefitted GPP implementations by several orders of magnitude over a baseline. We plan to explore different caching implementations on the FPGA in order to learn about the performance/area tradeoff in that space.

ParaLearn is both a framework and design suite for conducting interaction networks research. We have shown different approaches and performed a design space exploration at the pre-processing, order, and graph sampling steps. Through this work, our goal is to make interaction network studies tractable as technology, such as CyTof, produces data sets at orders of magnitude greater size and complexity than ever seen before.

8. ACKNOWLEDGEMENTS

This project was supported by NIH grant 130826-02. The authors would like to thank Sean Bendall for carrying out the Cytof experiments for this paper and sharing the data. We would also like to express our gratitude to Michael Linderman et. al for sharing their data and results on the optimzed GPP implementation [17].

9. ADDITIONAL AUTHORS

Eric N. Glass (eglass@stanford.edu)
Electrical Engineering Department, Stanford University, Stanford, CA, USA 94305,
Karen Sachs (karensachs@stanford.edu)
Microbiology and Immunology Department, Stanford University, Stanford, CA, USA 94305
Daniel Burke (drburke@berkeley.edu)
Electrical Engineering and Computer Sciences Department, University of California, Berkeley, CA, USA 94720
Zoey Zhou (cuizy@stanford.edu)
Electrical Engineering Department, Stanford University, Stanford, CA, USA 94305

10. REFERENCES