CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252 http://www-inst.eecs.berkeley.edu/~cs252 Review: Simultaneous Multi-threading ... One thread, 8 units Two threads, 8 units Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC 1 1 2 2 3 3 4 4 5 5
6 6 7 7 8 8 9 9 M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes 2/28/2007 cs252-S07, Lecture 11 2 Time (processor cycle) Review: Multithreaded Categories Superscalar Fine-Grained Coarse-Grained Thread 1 Thread 2 2/28/2007 Multiprocessing Thread 3 Thread 4
cs252-S07, Lecture 11 Simultaneous Multithreading Thread 5 Idle slot 3 Design Challenges in SMT Since SMT makes sense only with fine-grained implementation, impact of fine-grained scheduling on single thread performance? A preferred thread approach sacrifices neither throughput nor single-thread performance? Unfortunately, with a preferred thread, the processor is likely to sacrifice some throughput, when preferred thread stalls Larger register file needed to hold multiple contexts Clock cycle time, especially in: Instruction issue - more candidate instructions need to be considered Instruction completion - choosing which instructions to commit may be challenging Ensuring that cache and TLB conflicts generated by SMT do not degrade performance 2/28/2007 cs252-S07, Lecture 11 4 Power 4 Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may
issue an instruction each cycle. 2/28/2007 cs252-S07, Lecture 11 5 Power Power 4 4 2 commits (architected register sets) Power Power 5 5 2 fetch (PC), 2 initial decodes 2/28/2007 cs252-S07, Lecture 11 6 Power 5 data flow ... Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck 2/28/2007 cs252-S07, Lecture 11
7 Power 5 thread performance ... Relative priority of each thread controllable in hardware. For balanced operation, both threads run slower than if they owned the machine. 2/28/2007 cs252-S07, Lecture 11 8 Changes in Power 5 to support SMT Increased associativity of L1 instruction cache and the instruction address translation buffers Added per thread load and store queues Increased size of the L2 (1.92 vs. 1.44 MB) and L3 caches Added separate instruction prefetch and buffering per thread Increased the number of virtual registers from 152 to 240 Increased the size of several issue queues The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support 2/28/2007 cs252-S07, Lecture 11
9 Initial Performance of SMT Pentium 4 Extreme SMT yields 1.01 speedup for SPECint_rate benchmark and 1.07 for SPECfp_rate Pentium 4 is dual threaded SMT SPECRate requires that each SPEC benchmark be run against a vendor-selected number of copies of the same benchmark Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (262 runs) speed-ups from 0.90 to 1.58; average was 1.20 Power 5, 8 processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_rate Power 5 running 2 copies of each app speedup between 0.89 and 1.41 Most gained some Fl.Pt. apps had most cache conflicts and least gains 2/28/2007 cs252-S07, Lecture 11 10 Head to Head ILP competition Processor Micro architecture Fetch / Issue / Execute FU Clock Rate
(GHz) Transis -tors Die size Power Intel Pentium 4 Extreme Speculative dynamically scheduled; deeply pipelined; SMT 3/3/4 7 int. 1 FP 3.8 125 M 122 mm2 115 W AMD Athlon 64 FX-57 Speculative dynamically
scheduled 3/3/4 6 int. 3 FP 2.8 114 M 115 mm2 104 W IBM Power5 (1 CPU only) Speculative dynamically scheduled; SMT; 2 CPU cores/chip 8/4/8 6 int. 2 FP 1.9 200 M 80W 300 (est.) mm2 (est.)
Intel Itanium 2 Statically scheduled VLIW-style 6/5/11 9 int. 2 FP 1.6 592 M 423 mm2 2/28/2007 cs252-S07, Lecture 11 130 W 11 Performance on SPECint2000 Itanium 2 Pentium 4 AMD Athlon 64 Pow er 5 3500
bzip2 twolf 12 Performance on SPECfp2000 14000 Itanium 2 Pentium 4 AMD Athlon 64 Power 5 12000 SPEC Ratio 10000 8000 6000 4000 2000 0 w upw ise 2/28/2007 sw im
mgrid applu mesa galgel art equake facerec cs252-S07, Lecture 11 ammp lucas fma3d sixtrack apsi 13 35 Normalized Performance: Efficiency I tanium 2 Pentium 4 AMD Athlon 64
POWER 5 Rank I t a n i u m 2 Int/Trans 4 2 1 3 FP/Trans 4 2 1 3 Int/area 4 2 1 3 FP/area 4 2 1 3 Int/Watt 4 3 1 2 FP/Watt 2 4 3 1 30
25 20 15 10 5 P e n t I u m 4 A t h l o n P o w e r 5 0 SPECI nt / M SPECFP / M Transistors Transistors
2/28/2007 SPECI nt / mm^2 SPECFP / mm^2 SPECI nt / Watt cs252-S07, Lecture 11 SPECFP / Watt 14 No Silver Bullet for ILP No obvious over all leader in performance The AMD Athlon leads on SPECInt performance followed by the Pentium 4, Itanium 2, and Power5 Itanium 2 and Power5, which perform similarly on SPECFP, clearly dominate the Athlon and Pentium 4 on SPECFP Itanium 2 is the most inefficient processor both for Fl. Pt. and integer code for all but one efficiency measure (SPECFP/Watt) Athlon and Pentium 4 both make good use of transistors and area in terms of efficiency, IBM Power5 is the most effective user of energy on SPECFP and essentially tied on SPECINT 2/28/2007 cs252-S07, Lecture 11 15
Limits to ILP Doubling issue rates above todays 3-6 instructions per clock, say to 6 to 12 instructions, probably requires a processor to issue 3 or 4 data memory accesses per cycle, resolve 2 or 3 branches per cycle, rename and access more than 20 registers per cycle, and fetch 12 to 24 instructions per cycle. The complexities of implementing these capabilities is likely to mean sacrifices in the maximum clock rate E.g, widest issue processor is the Itanium 2, but it also has the slowest clock rate, despite the fact that it consumes the most power! 2/28/2007 cs252-S07, Lecture 11 16 Limits to ILP Most techniques for increasing performance increase power consumption The key question is whether a technique is energy
efficient: does it increase power consumption faster than it increases performance? Multiple issue processors techniques all are energy inefficient: 1. Issuing multiple instructions incurs some overhead in logic that grows faster than the issue rate grows 2. Growing gap between peak issue rates and sustained performance Number of transistors switching = f(peak issue rate), and performance = f( sustained rate), growing gap between peak and sustained performance increasing energy per unit of performance 2/28/2007 cs252-S07, Lecture 11 17 Administrivia Exam: Wednesday 3/14 Location: TBA TIME: 5:30 - 8:30 This info is on the Lecture page (has been) Meet at LaVals afterwards for Pizza and Beverages CS252 Project proposal due by Monday 3/5 Need two people/project (although can justify three for right project) Complete Research project in 8 weeks Typically investigate hypothesis by building an artifact and measuring it against a base case Generate conference-length paper/give oral presentation Often, can lead to an actual publication. 2/28/2007
cs252-S07, Lecture 11 18 Supercomputers Definition of a supercomputer: Fastest machine in world at given task A device to turn a compute-bound problem into an I/ O bound problem Any machine costing $30M+ Any machine designed by Seymour Cray CDC6600 (Cray, 1964) regarded as first supercomputer 2/28/2007 cs252-S07, Lecture 11 19 Supercomputer Applications Typical application areas Military research (nuclear weapons, cryptography) Scientific research Weather forecasting Oil exploration Industrial design (car crash simulation) All involve huge computations on large data sets In 70s-80s, Supercomputer Vector Machine 2/28/2007 cs252-S07, Lecture 11 20 Vector Supercomputers Epitomized by Cray-1, 1976:
Scalar Unit + Vector Extensions 2/28/2007 Load/Store Architecture Vector Registers Vector Instructions Hardwired Control Highly Pipelined Functional Units Interleaved Memory System No Data Caches No Virtual Memory cs252-S07, Lecture 11 21 Cray-1 (1976) 2/28/2007 cs252-S07, Lecture 11 22 Cray-1 (1976) 64 Element Vector Registers Single Port Memory 16 banks of
64-bit words + 8-bit SECDED ( (Ah) + j k m ) (A0) 80MW/sec data load/store Tjk ( (Ah) + j k m ) (A0) 320MW/sec instruction buffer refill 64 T Regs Si 64 B Regs Ai Bjk 4 Instruction Buffers 2/28/2007 S0 S1 S2 S3 S4
S5 S6 S7 A0 A1 A2 A3 A4 A5 A6 A7 NIP 64-bitx16 memory bank cycle 50 ns V0 V1 V2 V3 V4 V5 V6 V7 Vi V. Mask Vj V. Length Vk FP Add
Sj FP Mul Sk FP Recip Si Int Add Int Logic Int Shift Pop Cnt Aj Ak Ai Addr Add Addr Mul CIP LIP processor cycle 12.5 ns (80MHz) cs252-S07, Lecture 11 23 Vector Programming Model Scalar Registers Vector Registers r15 v15
DADDIU R2, 8 DADDIU R3, 8 DSUBIU R4, 1 BNEZ R4, loop 2/28/2007 cs252-S07, Lecture 11 25 Vector Instruction Set Advantages Compact one short instruction encodes N operations Expressive, tells hardware that these N operations: are independent use the same functional unit access disjoint registers access registers in the same pattern as previous instructions access a contiguous block of memory (unit-stride load/store) access memory in a known pattern (strided load/store) Scalable can run same object code on more parallel pipelines or lanes 2/28/2007 cs252-S07, Lecture 11 26
Vector Arithmetic Execution Use deep pipeline (=> fast clock) to execute element operations Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!) V 1 V 2 V 3 Six stage multiply pipeline V3 <- v1 * v2 2/28/2007 cs252-S07, Lecture 11 27 Vector memory Subsystem Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency Bank busy time: Cycles between accesses to same bank Base Stride Vector Registers Address Generator
+ 0 1 2 3 4 5 6 7 8 9 A B C D E F Memory Banks 2/28/2007 cs252-S07, Lecture 11 28 Vector Instruction Execution ADDV C,A,B Execution using one pipelined functional unit A A A A 2/28/2007 B B B B Execution using four pipelined functional units A A A A B B
C C C C C C C C C C C C cs252-S07, Lecture 11 29 Vector Unit Structure Vector Registers Elements 0, 4, 8, Elements
1, 5, 9, Functional Unit Elements 2, 6, 10, Elements 3, 7, 11, Lane Memory Subsystem 2/28/2007 cs252-S07, Lecture 11 30 T0 Vector Microprocessor (1995) Vector register elements striped over lanes Lane       2/28/2007             cs252-S07, Lecture 11
            31 Vector Memory-Memory versus Vector Register Machines Vector memory-memory instructions hold all vector operands in main memory The first vector machines, CDC Star-100 (73) and TI ASC (71), were memory-memory machines Cray-1 (76) was first vector register machine Vector Memory-Memory Code Example Source Code for (i=0; i
Vector Register Code cs252-S07, Lecture 11 LV V1, A LV V2, B ADDV V3, V1, V2 SV V3, C SUBV V4, V1, V2 SV V4, D 32 Vector Memory-Memory vs. Vector Register Machines Vector memory-memory architectures (VMMA) require greater main memory bandwidth, why? All operands must be read in and out of memory VMMAs make if difficult to overlap execution of multiple vector operations, why? Must check dependencies on memory addresses VMMAs incur greater startup latency Scalar code was faster on CDC Star-100 for vectors < 100 elements For Cray-1, vector/scalar breakeven point was around 2 elements Apart from CDC follow-ons (Cyber-205, ETA-10) all major vector machines since Cray-1 have had vector register architectures (we ignore vector memory-memory from now on) 2/28/2007 cs252-S07, Lecture 11 33
Automatic Code Vectorization for (i=0; i < N; i++) C[i] = A[i] + B[i]; Vectorized Code Scalar Sequential Code load load load load Time Iter. 1 add store load load Iter. 2 add 2/28/2007 store load load add add store store
Iter. 1 Iter. 2 Vector Instruction Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis cs252-S07, Lecture 11 34 Vector Stripmining Problem: Vector registers have finite length Solution: Break loops into pieces that fit into vector registers, Stripmining ANDI R1, N, 63 # N mod 64 for (i=0; i
LV V1, RA DSLL R2, R1, 3 # Do remainder # Multiply by 8 DADDU RA, RA, R2 # Bump pointer LV V2, RB DADDU RB, RB, R2 64 elements ADDV.D V3, V1, V2 SV V3, RC DADDU RC, RC, R2 DSUBU N, N, R1 # Subtract elements LI R1, 64 MTC1 VLR, R1 # Reset full length cs252-S07, 11 BGTZLecture N, loop # Any more to do? 35 Vector Instruction Parallelism Can overlap execution of multiple vector instructions example machine has 32 elements per vector register and 8 lanes Load Unit load Multiply Unit Add Unit mul add
time load mul add Instruction issue Complete 24 operations/cycle while issuing 1 short instruction/cycle 2/28/2007 cs252-S07, Lecture 11 36 Vector Chaining Vector version of register bypassing introduced with Cray-1 LV V 1 v1 V 2 V 3 V 4
V 5 MULV v3,v1,v2 ADDV v5, v3, v4 Chain Load Unit Chain Mult. Add Memory 2/28/2007 cs252-S07, Lecture 11 37 Vector Chaining Advantage Without chaining, must wait for last element of result to be written before starting dependent instruction Load Mul Time Add With chaining, can start dependent instruction as soon as first result appears Load Mul Add
2/28/2007 cs252-S07, Lecture 11 38 Vector Startup Two components of vector startup penalty functional unit latency (time through pipeline) dead time or recovery time (time before another vector instruction can start down pipeline) Functional Unit Latency R X X X W R X X X W R X X
X W R X X X W R X X X W R X X X W R X
X X W R X X X W R X X X First Vector Instruction Dead Time 2/28/2007 Dead Time Second Vector Instruction W R Lecture X X 11
X cs252-S07, W 39 Dead Time and Short Vectors No dead time 4 cycles dead time T0, Eight lanes No dead time 100% efficiency with 8 element vectors 64 cycles active Cray C90, Two lanes 4 cycle dead time Maximum efficiency 94% with 128 element vectors 2/28/2007 cs252-S07, Lecture 11 40 Vector Scatter/Gather Want to vectorize loops with indirect accesses: for (i=0; i
# Load B vector ADDV.D vA, vB, vC # Do add SV vA, rA # Store result 2/28/2007 cs252-S07, Lecture 11 41 Vector Scatter/Gather Scatter example: for (i=0; i
Problem: Want to vectorize loops with conditional code: for (i=0; i0) then A[i] = B[i]; Solution: Add vector mask (or flag) registers vector version of predicate registers, 1 bit per element and maskable vector instructions vector operation becomes NOP at elements where mask bit is clear Code example: CVM LV vA, rA SGTVS.D vA, F0 LV vA, rB SV vA, rA 2/28/2007 # # # # # Turn on all elements Load entire A vector Set bits in mask register where A>0 Load B vector into A under mask Store A back to memory under mask cs252-S07, Lecture 11 43 Masked Vector Instructions Simple Implementation execute all N operations, turn off
result writeback according to mask Density-Time Implementation scan mask vector and only execute elements with non-zero masks M=1 A M=0 A B B M=1 M=0 M=1 A M=1 A M=0 A B B B M=1 M=1 M=0 M=0 C M=1 C M=0 M=1 M=0
A B C C C Write data port M=0 Write Enable 2/28/2007 C Write data port cs252-S07, Lecture 11 44 Compress/Expand Operations Compress packs non-masked elements from one vector register contiguously at start of destination vector register population count of mask vector gives packed vector length Expand performs inverse operation M=1 M=0 M=1 M=1 M=0 M=0 M=1
M=1 M=0 Expand Used for density-time conditionals and also for general selection operations 2/28/2007 cs252-S07, Lecture 11 45 Vector Reductions Problem: Loop-carried dependence on reduction variables sum = 0; for (i=0; i1) 2/28/2007 Vector of VL partial sums Stripmine VL-sized chunks Vector sum
vector register # Halve vector length # Halve no. of partials cs252-S07, Lecture 11 46 Novel Matrix Multiply Solution Consider the following: /* Multiply a[m][k] * b[k][n] to get c[m][n] */ for (i=1; i
for (t=1; t
49 Vector for Multimedia? Intel MMX: 57 additional 80x86 instructions (1st since 386) similar to Intel 860, Mot. 88110, HP PA-71000LC, UltraSPARC 3 data types: 8 8-bit, 4 16-bit, 2 32-bit in 64bits reuse 8 FP registers (FP and MMX cannot mix) short vector: load, add, store 8 8-bit operands + Claim: overall speedup 1.5 to 2X for 2D/3D graphics, audio, video, speech, comm., ... use in drivers or added to library routines; no compiler 2/28/2007 cs252-S07, Lecture 11 50 MMX Instructions Move 32b, 64b Add, Subtract in parallel: 8 8b, 4 16b, 2 32b opt. signed/unsigned saturate (set to max) if overflow Shifts (sll,srl, sra), And, And Not, Or, Xor in parallel: 8 8b, 4 16b, 2 32b Multiply, Multiply-Add in parallel: 4 16b Compare = , > in parallel: 8 8b, 4 16b, 2 32b sets field to 0s (false) or 1s (true); removes branches Pack/Unpack Convert 32b<> 16b, 16b <> 8b Pack saturates (set to max) if number is too large
2/28/2007 cs252-S07, Lecture 11 51 Vector Summary Vector is alternative model for exploiting ILP If code is vectorizable, then simpler hardware, more energy efficient, and better real-time model than Out-of-order machines Design issues include number of lanes, number of functional units, number of vector registers, length of vector registers, exception handling, conditional operations Fundamental design issue is memory bandwidth With virtual address translation and caching Will multimedia popularity revive vector architectures? 2/28/2007 cs252-S07, Lecture 11 52
Unapproved Mission Provide a comprehensive and integrated set of electronic products and service offerings that meet the current needs of IEEE Members, Customers, Volunteers, Staff and other stakeholders, and positions IEEE for continued full participation in the professional communities of...
Canada on the Homefront . Total War and Women's Suffrage. As the war went on vast numbers of able bodied men went to Europe, Africa and Asia to support the war effort. Result? Women are thrust into paid occupations that...
You MUST have Liability Insurance to hold fundraisers and Booster Events. Protect yourself, the school, the school board . MUST include the School Board of Miami Dade County as additional insured on the policy. Tell the Insurance Agent that this...
Goal Implementation Team Chairs Meeting. Meeting Purpose: To agree on the most effective way for the GIT's to operate and provide leadership within the Bay Program Partnership. The GIT chairs agree with the Executive Council (EC) and the Federal Leadership...
Arial Garamond Times New Roman Wingdings Monotype Corsiva Default Design Microsoft Excel Worksheet Microsoft Excel Chart SmartDraw Drawing A Depth-First-Search Controlled Gridless Incremental Routing Algorithm for VLSI Circuits Outline Introduction Incremental Routing Slide 5 Prior work on Incremental Routing Emmert-Bhatia...
METHODS & USAGE OF DEPRECIATION. STRAIGHT-LINE METHOD:The simplest and most often used technique, in which the company estimates the salvage value of the asset, after the length of time over which it will be used to generate revenues (useful life),...
b) Star 1 and Star 2 are of the same type. Explain how scientists would use brightness to determine which of the two stars is closer to Earth. Closest star will be the brightest one. 4. The start of the...
Ann Landers once asked her female readers whether they would be content with affectionate treatment from men, with no sex ever. Over 90,000 women wrote in, with 72% answering "Yes." Why shouldn't we believe the results of this "poll"? It...
Ready to download the document? Go ahead and hit continue!