SystemML大规模机器学习,优化算子融合方案的研究
SystemML大規模機器學習,優化算子融合方案的研究
摘要
許多大規模機器學習(ML)系統允許通過線性代數程序指定定制的ML算法,然后自動生成有效的執行計劃。在這種情況下,優化的機會融合基本算子的熔合鏈的算子是無處不在的。這些機會包括
(1)更少的物化中間表示
(2)更少的輸入數據掃描,以及
(3)利用算子鏈上的稀疏性。
自動算子融合消除了手寫的需要
融合運算符并顯著提高
復雜的或以前看不見的算子鏈。然而,現有的融合啟發式算法,很難找到好的融合方法。
復雜DAG計劃或局部分布式算子的混合計劃。
本文提出了一種融合方案的系統推理優化框架
綜合了DAGs中的物化點、稀疏性利用、不同的融合模板類型,局部性
和分布式算子。提出了
(1)有效融合方案的候選探索算法
(2)基于成本的候選選擇,以及
(3)在密集、稀疏和壓縮數據上生成本地和分布式算子的代碼。
在SystemML中的實驗表明,通過優化融合方案,端到端性能得到了改善
與手寫熔合運算符相比,高達21倍,可忽略的優化和代碼生成開銷。
1.引論
大規模機器學習(ML)的目的,是從大量的數據收集和通常的數據中,建立依賴于數據并行框架的預測模型,如MapReduce。
Spark在商品硬件上實現了經濟高效的并行化。大規模的ML應用范圍從數據密集型的、傳統的分類、回歸和集群用例,到計算密集型的矩陣分解和聚類的深度學習架構。在這種情況下,最先進的ML系統,允許數據科學家用線性代數和統計函數,來表達他們的ML算法,并自動高效地編譯執行計劃。這個高級規范簡化了開發定制的ML算法,并允許執行計劃適應不同的部署以及不同的數據、硬件和集群特征。
融合機會:執行產生計劃有很多機會,其中融合算子-基本算子鏈的復合算子項-
可以提高性能。圖1顯示了融合潛在的主要類別。
首先,融合可以消除物化的中間表示
算子融合的現有工作:考慮到無處不在的機會和高性能影響,在數據庫和高性能計算文獻中,算子融合和代碼生成受到了廣泛的關注。
例如,SystemML使用手工編碼的融合算子,為了消除中間表示,不必要的掃描,以及利用算子間的稀疏性。同樣,Cumulon和常見問題,分別通過屏蔽算子和半連接減少(最壞情況下的最佳連接)。然而,這樣的方法需要定制的算子,通常僅限于少數算子的固定模式,并且對于密集算子和稀疏輸入。自動算子融合解決了這些問題:訪問模式感知融合和后續代碼的問題生成。
現有的研究大多忽略了稀疏和壓縮。除了BTO中的節點局部優化或微優化(如Tupleware中的預測和循環平鋪一樣,不要考慮算子融合計劃的優化。
一個優化融合計劃的案例:隨著代碼生成器變得越來越復雜(例如,通過覆蓋更多操作),優化融合計劃的原則性方法變得越來越成問題。在這方面,關鍵的挑戰是算子重疊訪問模式,以及稀疏性開發的目標,創建需要自動優化的搜索空間:
? Materialization points (e.g., for multiple consumers)
? Sparsity exploitation and ordering of sparse inputs
? Decisions on fusion patterns (e.g., template types), and
? Constraints (e.g., memory budget and blocksizes) and costs for local and/or distributed operations.
例如,關于物化點的決策,考慮了冗余計算與物化和需求,比較指數級的計劃。基本解決方案是啟發式的,例如fuse all或fuse no redundancy,這些啟發式算法很難為復雜的問題找到好本地和分布式操作的DAG或混合計劃的解決方案。
貢獻:撥我介紹了一種基于cost-based,基于DAGs的算子融合方案優化框架,對線性代數運算進行了描述,并將其集成到SystemML。用下列公式來描述優化問題。
分為三個階段候選探索,候選選擇,以及代碼生成。
設計了新穎高效的算法。詳細技術貢獻如下
從論文的結構來看:
? System Architecture: We describe the integration into SystemML in Section 2. This overview includes the compiler integration, our optimization framework, code generation plans, and their runtime integration.
? Candidate Exploration: In Section 3, we introduce a novel bottom-up algorithm for the efficient exploration of valid partial fusion plans. We also discuss our memoization data structure and basic pruning rules.
? Candidate Selection: In Section 4, we then present strategies for selecting the optimal candidate plans. We formulate the problem, describe heuristics, and introduce our novel cost-based plan selection, including its search space, cost model, and enumeration algorithm.
? Experiments: In Section 5, we then report on extensive experiments in SystemML. These cover micro benchmarks for code generation, end-to-end performance in local and distributed environments, as well as comparisons with Julia, TensorFlow, and fusion heuristics.
2系統架構
本節將描述代碼的體系結構
代碼生成器(codegen)及其與SystemML的編譯器集成。cost-based的優化框架,擴展了SPOOF框架,該框架依賴于adhoc候選探索和啟發式的融合。還提供了代碼生成的概述計劃、生成的算子及其運行時runtime集成。
2.1編譯器集成
SystemML提供了一種高級腳本語言類R語法,包括線性代數、元素和實現ML算法的統計運算。
如圖2所示,這些腳本被解析為一個層次結構語句塊,其中塊由控制流描述。對于每個塊,編譯高級算子(HOPs)的DAG。這些DAG通過靜態修改,即大小無關重寫和過程間分析(IPA),從輸入中傳播矩陣維數和稀疏性。根據尺寸信息,應用動態重寫,即依賴于大小的重寫,并計算每個算子的內存估計。這些估計是:
依次用于選擇本地或分布式執行類型,實際算子形成執行計劃。類似自適應查詢處理,SystemML重新編譯HOP。在運行時(從動態重寫)調整DAG,最初未知或改變尺寸的計劃。
Codegen編譯器集成:代碼生成器采用如圖2所示的HOP,動態重寫后的DAG作為輸入,并產生可能修改的HOPDAG,可包括基本和融合DAG算子。融合算子通過一個泛型表示
SpoofOp,它有一個類型,并引用生成的編譯類。這些算子仍然是有效的HOP。因此,剩下的內存估計編譯步驟,算子選擇(如本地/分布式)和運行計劃,生成也無縫地應用于融合算子。還可以在動態重新編譯期間,調用代碼生成器,這對于許多算法非常重要,因為優化器依賴已知的大小信息來計算成本和有效性約束。編譯過程中的實際編譯器集成初始編譯稍微復雜一些,稱之為密碼。運行時程序生成后的生成器,但具有訪問權限到HOPDAG以生成修改的運行時指令。但保留原來的DAG。這種方法有助于避免不完全融合,失去了算子的語義,限制了動態重新編譯時的融合潛力。
Codegen體系結構:在高層,Codegen編譯器包含五個定義良好的編譯步驟。
第一, 關于候選探索,做了一個自下而上的研究、
pass over the HOP DAG to explore all valid partial fusion plans and store these plans in a memoization table, organized by HOPs. Second, during candidate selection (Section 4), we choose the optimal subset of fusion plans using a time-based cost model. Third, we construct code generation plans (CPlans, Section 2.2)—which are a backendindependent basis for code generation—for all selected fusion plans. Fourth, we then recursively expand templates for these given CPlans to generate java source code for each fused operator, compile the classes and load them into the JVM. By default, we use the fast janino compiler [89] but also support the standard javac compiler. Generated fused operators are maintained in a plan cache—which identifies equivalent CPlans via hashing—to avoid redundant code generation and compilation for existing operators. Finally, we replace covered parts of the HOP DAG by the fused operators. These separate compilation steps are very valuable for debugging without compromising on fusion potential. 2.2 Code Generation Plans Code generation plans (CPlans) [27] are a backend-independent representation of fused operators and allow for recursive code generation. We generate code via a depth-first template expansion to ensure valid ordering. Such plans consist of CNodes, which are either template or basic operation nodes. Template nodes represent generic fused operator skeletons that have a specific data binding and contain a DAG of basic operations that encodes the data flow. Example Expressions: We illustrate CPlans for two typical expressions with high performance impact of fusion. The first expression is part of an inner-loop update rule of ALS-CG (alternating least squares via conjugate gradient)
1: public final class TMP10 extends SpoofCellwise { 2: public TMP10() {super(CellType.NO_AGG, null, false);} 3: protected double genexec(double a,SideInput[] b, 4: double[] scalars,…, int rix, int cix) { 5: double TMP5 = getValue(b[0], n, rix, cix); 6: double TMP6 = a * 1.0E-6; 7: double TMP7 = getValue(b[1], rix); 8: double TMP8 = TMP6 * TMP7; 9: double TMP9 = TMP5 + TMP8; 10: return TMP9; }} Finally, Figure 3? shows the row-wise CPlan of Expression (2). This fused operator requires only a single pass over X by exploiting temporal locality and it avoids six large intermediates. The memory for row intermediates is managed via a preallocated ring buffer per thread (here of size 5). 1: public final class TMP25 extends SpoofRowwise { 2: public TMP25() {super(RowType.COL_AGG_B1_T,true,5);} 3: protected void genexecDense(double[] a, int ai, 4: SideInput[] b, double[] c,…, int len) { 5: double[] TMP11 = getVector(b[1].vals(rix),…); 6: double[] TMP12 = vectMatMult(a,b[0].vals(rix),…); 7: double[] TMP13 = vectMult(TMP11, TMP12, 0,0,…); 8: double TMP14 = vectSum(TMP13, 0, TMP13.length); 9: double[] TMP15 = vectMult(TMP11, TMP14, 0,…); 10: double[] TMP16 = vectMinus(TMP13, TMP15, 0,0,…); 11: vectOuterMultAdd(a, TMP16, c, ai,0,0,…); } 12: protected void genexecSparse(double[] avals, int[] 13: aix, int ai, SideInput[] b, …, int len){…}}
Runtime Integration: Templates refer to generic skeletons of fused operators, which are inspired by algorithmic skeletons [21]. Figure 4 exemplifies this runtime integration using the Cell template. Unlike existing work [9, 22, 48], we made the conscious design decision not to generate the data access into the fused operators. Instead, the handcoded skeleton implements the data access—depending on its sparse-safeness over cells or non-zero values—of dense, sparse, or compressed [28] matrices and calls an abstract (virtual) genexec method for each value. Generated operators inherit this skeleton and only override the specific genexec, which yields very lean yet efficient operators. The skeleton also handles multi-threading, cache blocking, memory management, and pseudo-sparse-safe aggregations. Sharing common skeletons and vector primitives among fused operators can also reduce the instruction footprint and thus, L1 instruction cache misses, which is a known bottleneck in OLTP [85] and scale-out workloads [30]. 3. CANDIDATE EXPLORATION The exploration of candidate fusion plans aims to identify all valid partial fusion plans to provide a common input for different plan selection policies and simplify optimization. However, the exponential search space prohibits the enumeration of all possible plans. Instead, we enumerate partial fusion plans per operator, which represent local fusion decisions. We describe (1) the representations of partial fusion plans in our central memoization table, and (2) an efficient algorithm for populating this memo table in a single pass over the HOP DAG, including pruning techniques. 3.1 Memoization Table Our memoization (memo) table consists of a set of groups, where each group represents the output of an operator in the HOP DAG, i.e., a logical subexpression. Each group is identified by the operator ID, has access to its operator meta data, and contains a set of valid partial fusion plans for this operator. A partial fusion plan is called a memo table entry, and can reference other groups to represent fusion decisions. This structure is similar to groups and group expressions in the Cascades Optimization Framework [16, 34, 83], but we use it merely as a compact representation of fusion plans, which only includes operators that are amenable to fusion. Memo Table Entries: A memo table entry is a tuple (type, {i1, …, ik}, closed), consisting of a template type (as introduced in Table 1), a list of input references, and a closed type. The list of inputs corresponds to HOP inputs (i.e., data dependencies) by position, and each input is either a group reference or -1, which indicate fusion or materialized intermediates. A reference from an entry to a group implies that the group contains at least one compatible fusion plan. Finally, the close status can be open valid, open invalid (i.e., an invalid entry point), closed valid, and closed invalid. Example: We use Expression (2) from Section 2.2 to illustrate the structure of our memo table. Figure 5 shows the HOP DAG and the related memo table after candidate exploration and pruning (described in Section 3.2). All eight operators are represented by groups in the memo table. The group 11 refers to the final matrix multiplication (binary aggregate ba(+*)), and consists of three memo table entries of type Row. These entries encode fusion alternatives: (1) fuse right R(-1,9), (2) fuse left R(10,-1), and (3) fuse both R(10,9). Instead of encoding all alternative subplans along inputs, we only reference the input groups. This memo table then allows for simple costing and fusion by traversing the HOP DAG top down, probing for fusion plans, traversing group references, and determining the input HOPs, from where this process repeats until we reach the leaf HOPs. 3.2 Open-Fuse-Merge-Close Exploration Given a HOP DAG and an empty memo table, we aim to efficiently discover all valid partial fusion plans. We introduce a bottom-up algorithm that is template-oblivious and populates the memo table in a single pass over the DAG. OFMC Template Abstraction: As the basis of our candidate exploration algorithm, we define the open-fusemerge-close (OFMC) template abstraction: ? open(Hop h): Indicates if a new fused operator of this template can be started at HOP h, covering its operation and reading materialized inputs. For example, the condition of an Outer template is an outer-product-like matrix multiplication with size constraints. ? fuse(Hop h, Hop in): Indicates if an open fused operator (of this template) at the input HOP in can be expanded to its consumer HOP h. For example, a Cell template can fuse valid unary, binary, or ternary operations, valid aggregations, and inner products. ? merge(Hop h, Hop in): Indicates if an open fused operator (of this template) at the consumer HOP h can be expanded to its input HOP in, i.e., if it can merge with fused operators at the input. An example is the merge of Cell templates into Row templates. ? close(Hop h): Indicates the close status of the template after the HOP h and its validity. For example, any aggregation closes a Cell template (as valid or invalid), whereas only column-wise or full aggregations close a Row template. Outer templates are also validated for the existence of sparsity exploiting operators. The major benefit of this OFMC abstraction is the separation of template-specific conditions from the HOP DAG traversal and the population of the memo table. The OFMC Algorithm: Based on the memo table and OFMC abstraction, we introduce the OFMC exploration algorithm (Algorithm 1). This algorithm is called recursively, in a depth-first manner to populate the memo table bottomup. First, we check for already processed operators— indicated by existing groups or marked operators—(lines 1- 3) to avoid redundant exploration if nodes are reachable over multiple paths. Second, we recursively explore all input operators (lines 4-6) because these input data dependencies constitute potential fusion references. Third, we explore all templates for valid opening conditions at the current operator (lines 7-10). In case of a valid opening condition, we add this memo entry and enumerate merge plans with createPlans. This merging is important to cover scenarios such as X>(y
z), where the matrix-vector multiplication with X opens a Row template, which can also merge Cell templates over y
z. Third, we fuse and merge existing partial fusion plans from the operator inputs to the current operator (lines 11-15). This step entails iterating over all distinct template types of all inputs and probing pair-wise fusion conditions. In case of a valid condition, we again call createPlans, which constructs a memo table entry for the fused operator, and enumerates all local plan combinations for inputs that satisfy the pair-wise merge condition. This entire plan set is then added to the group of the current operator. Fourth, we check all group entries for closing conditions (lines 16-20). Entries which satisfy the closing condition of their templates are either removed (invalid) or marked as closed (valid), while all other entries remain open. Pruning Techniques: Finally, we prune duplicates and valid closed entries without group references (line 22). For example, the group 7 ua(R+) in Figure 5 does not contain C(-1) because a rowSums closes the Cell template, which would cover only a single operator. In addition, there are advanced techniques that exploit characteristics of candidate selection policies. For instance, a policy that only considers materialization points with multiple consumers allows pruning dominated plans. A memo entry is dominated if all its references point to operators that are consumed once, and there is another entry (of the same type) whose reference list is a strict superset. For example, in Figure 5, R(10,9) dominates R(10,-1) but R(6,8) does not dominate R(-1,8) because group 6 has multiple consumers. However, we prune dominated plans only for selection heuristics. Algorithm Analysis: Overall our algorithm has linear time and space complexity in the number of operators. Memoization ensures that we visit each operator exactly once and the OFMC conditions apply only locally to an operator and its inputs. These conditions still have access to the hops and thus the entire DAG but this flexibility is only exploited in rare exceptions such as recognizing t(cumsum(t(X))) as a row operation.
- CANDIDATE SELECTION Given a memo table of partial fusion plans, candidate selection aims to choose the optimal subset of non-conflicting partial fusion plans. We describe the problem and cost model, as well as introduce our cost-based enumeration algorithm MPSkipEnum. The basic ideas are to (1) partition the set of partial fusion plans into independent groups, (2) restrict the search per group to interesting materialization points, (3) linearize the resulting exponential search space, and (4) enumerate and cost plans with skipping of search space areas that can be pruned based on cost or structure. 4.1 Problem Formulation and Heuristics Overall, we aim to find the cost-optimal set of fusion plans with the optimization scope of a single HOP DAG at-a-time and hybrid runtime plans that might include single-node and distributed operations. We define this problem as follows:
參考文獻:
https://arxiv.org/pdf/1801.00829.pdf
總結
以上是生活随笔為你收集整理的SystemML大规模机器学习,优化算子融合方案的研究的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 显示技术举例
- 下一篇: 对端边缘云网络计算模式:透明计算、移动边