Table of contents for Petascale computing : algorithms and applications / editor, David A. Bader.

Bibliographic record and links to related information available from the Library of Congress catalog.

Note: Contents data are machine generated based on pre-publication provided by the publisher. Contents may have variations from the printed book or be incomplete or contain other coding.


Counter
 
Contents
1 Performance Characteristics of Potential Petascale Scienti
c
Applications 1
Leonid Oliker, John Shalf, Jonathan Carter, Andrew Canning, Shoaib
Kamil, Michael Lijewski and Stephane Ethier
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Target Architectures . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Scienti
c Application Overview . . . . . . . . . . . . . . . . . 6
1.4 GTC: Particle-in-Cell Magnetic Fusion . . . . . . . . . . . . 7
1.4.1 Experimental results . . . . . . . . . . . . . . . . . . . 9
1.5 ELBM3D: Lattice Bolzmann Fluid Dynamics . . . . . . . . . 11
1.5.1 Experimental results . . . . . . . . . . . . . . . . . . . 13
1.6 Cactus: General Relativity Astrophysics . . . . . . . . . . . 14
1.6.1 Experimental results . . . . . . . . . . . . . . . . . . . 14
1.7 PARATEC: First Principles Materials Science . . . . . . . . 16
1.7.1 Experimental results . . . . . . . . . . . . . . . . . . . 18
1.8 HyperCLaw: Hyperbolic AMR Gas Dynamics . . . . . . . . 19
1.8.1 Experimental results . . . . . . . . . . . . . . . . . . . 21
1.9 Summary and Conclusions . . . . . . . . . . . . . . . . . . . 23
2 Petascale Computing: Impact on Future NASA Missions 29
Rupak Biswas, Michael Aftosmis, Cetin Kiris, and Bo-Wen Shen
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 The Columbia Supercomputer . . . . . . . . . . . . . . . . . 30
2.3 Aerospace Analysis and Design . . . . . . . . . . . . . . . . . 31
2.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.3 Bene
ts of Petascale Computing to NASA . . . . . . . 34
2.4 Propulsion Subsystem Analysis . . . . . . . . . . . . . . . . . 35
2.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.3 Bene
ts of Petascale Computing to NASA . . . . . . . 39
2.5 Hurricane Prediction . . . . . . . . . . . . . . . . . . . . . . 39
2.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.3 Bene
ts of Petascale Computing to NASA . . . . . . . 42
2.6 Bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
0-8493-0052-5/00/$0.00+$.50
c 2001 by CRC Press LLC xix
3 Multiphysics Simulations and Petascale Computing 47
Steven F. Ashby and John M. May
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 The next generation of supercomputers . . . . . . . . . . . . 48
3.3 Programming models for massively parallel machines . . . . . 50
3.3.1 New parallel languages . . . . . . . . . . . . . . . . . . 51
3.3.2 MPI-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.3 Cooperative parallelism . . . . . . . . . . . . . . . . . 51
3.3.4 Example uses of cooperative parallelism . . . . . . . . 52
3.4 Multiscale algorithms . . . . . . . . . . . . . . . . . . . . . . 54
3.4.1 Parallel multigrid methods . . . . . . . . . . . . . . . 54
3.4.2 ALE-AMR discretization . . . . . . . . . . . . . . . . 56
3.4.3 Hybrid atomistic-continuum algorithms . . . . . . . . 57
3.5 Applications present and future . . . . . . . . . . . . . . . . 58
3.5.1 State of the art in terascale simulation . . . . . . . . . 59
3.5.2 Multiphysics simulation via cooperative parallelism . . 62
3.6 Looking ahead . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4 Scalable Parallel AMR for the Uintah Multi-Physics Code 67
Justin Luitjens, Bryan Worthen, Martin Berzins, and Thomas C. Henderson
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Adaptive Mesh Re
nement . . . . . . . . . . . . . . . . . . . 69
4.3 Uintah Framework Background . . . . . . . . . . . . . . . . . 72
4.3.1 Simulation Components . . . . . . . . . . . . . . . . . 72
4.3.2 Load Balancer . . . . . . . . . . . . . . . . . . . . . . 73
4.3.3 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4 Regridder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.1 Extending Uintah's Components to Enable AMR . . . 78
4.5 Performance Improvements . . . . . . . . . . . . . . . . . . . 78
4.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5 Simulating Cosmological Evolution with Enzo 83
Michael L. Norman, James Bordner, Daniel Reynolds and Rick Wagner,
Greg L. Bryan, and Brian O'Shea
5.1 Cosmological structure formation . . . . . . . . . . . . . . . 83
5.2 The Enzo code . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.1 Physical model and numerical algorithms . . . . . . . 84
5.2.2 Adaptive mesh re
nement . . . . . . . . . . . . . . . . 87
5.2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . 88
5.2.4 Parallelization . . . . . . . . . . . . . . . . . . . . . . 89
5.2.5 Fast sibling grid search . . . . . . . . . . . . . . . . . 90
5.2.6 Enzo I/O . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3 Performance and scaling on terascale platforms . . . . . . . . 92
5.3.1 Unigrid application . . . . . . . . . . . . . . . . . . . . 92
xxi
5.3.2 AMR application . . . . . . . . . . . . . . . . . . . . . 93
5.3.3 Parallel scaling . . . . . . . . . . . . . . . . . . . . . . 93
5.4 Toward petascale Enzo . . . . . . . . . . . . . . . . . . . . . 96
5.4.1 New AMR data structures . . . . . . . . . . . . . . . . 96
5.4.2 Hybrid parallelism . . . . . . . . . . . . . . . . . . . . 98
5.4.3 Implicitly-coupled radiation hydrodynamics . . . . . . 98
5.4.4 Inline analysis tools . . . . . . . . . . . . . . . . . . . 99
6 Numerical Prediction of High-Impact LocalWeather: A Driver
for Petascale Computing 103
Ming Xue and Kelvin K. Droegemeier and Daniel Weber
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2 Computational Methodology and Tools . . . . . . . . . . . . 106
6.2.1 Community Weather Prediction Models . . . . . . . . 106
6.2.2 Memory and Performance Issues Associated with Petascale
Systems . . . . . . . . . . . . . . . . . . . . . . . 107
6.2.3 Distributed-memory Parallelization and Message Passing
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.2.4 Load Balancing . . . . . . . . . . . . . . . . . . . . . . 112
6.2.5 Timing and Scalability . . . . . . . . . . . . . . . . . . 112
6.2.6 Other Essential Components of NWP Systems . . . . 114
6.2.7 Additional Issues . . . . . . . . . . . . . . . . . . . . . 115
6.3 Example NWP Results . . . . . . . . . . . . . . . . . . . . . 116
6.3.1 Storm-Scale Weather Prediction . . . . . . . . . . . . 116
6.3.2 Very High-Resolution Tornado Simulation . . . . . . . 117
6.3.3 The Prediction of an Observed Supercell Tornado . . 118
6.4 Numerical Weather Prediction Challenges and Requirements 120
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7 Software Design for Petascale Climate Science 125
John B. Drake, Philip W. Jones, Mariana Vertenstein, James B. White
III, and Philip H. Worley
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.2 Climate Science . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.3 Peta
op Architectures . . . . . . . . . . . . . . . . . . . . . . 128
7.4 Community Climate System Model . . . . . . . . . . . . . . 131
7.4.1 Overview of the Current CCSM . . . . . . . . . . . . . 131
7.4.2 Community Atmosphere Model . . . . . . . . . . . . . 131
7.4.3 Parallel Ocean Program . . . . . . . . . . . . . . . . . 135
7.4.4 Community Land Model . . . . . . . . . . . . . . . . . 137
7.4.5 Community Sea Ice Model . . . . . . . . . . . . . . . . 139
7.4.6 Model Coupling . . . . . . . . . . . . . . . . . . . . . 140
7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
xxii
8 Towards Distributed Petascale Computing 147
Alfons G. Hoekstra, Simon Portegies Zwart, Marian Bubak, and Peter
M.A. Sloot
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.2 Grid Computing . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.3 Petascale Computing on the Grid . . . . . . . . . . . . . . . 150
8.4 The Virtual Galaxy . . . . . . . . . . . . . . . . . . . . . . . 152
8.4.1 A Multi-Physics model of the Galaxy . . . . . . . . . 152
8.4.2 A performance model for simulating the Galaxy . . . 155
8.4.3 Petascale simulation of a Virtual Galaxy . . . . . . . . 157
9 Biomolecular Modeling in the Era of Petascale Computing 165
Klaus Schulten, James C. Phillips, Laxmikant V. Kale, and Abhinav Bhatele
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9.2 NAMD Design . . . . . . . . . . . . . . . . . . . . . . . . . . 166
9.2.1 Hybrid Decomposition . . . . . . . . . . . . . . . . . . 166
9.2.2 Dynamic Load Balancing . . . . . . . . . . . . . . . . 168
9.3 Petascale Challenges and Modi
cations . . . . . . . . . . . . 169
9.3.1 Current Performance . . . . . . . . . . . . . . . . . . . 170
9.3.2 Performance on Future Petascale Machines . . . . . . 172
9.3.3 Acceleration Co-Processors . . . . . . . . . . . . . . . 173
9.4 Biomolecular Applications . . . . . . . . . . . . . . . . . . . 173
9.4.1 Aquaporins . . . . . . . . . . . . . . . . . . . . . . . . 173
9.4.2 Potassium Channels . . . . . . . . . . . . . . . . . . . 175
9.4.3 Viruses . . . . . . . . . . . . . . . . . . . . . . . . . . 175
9.4.4 Ribosome . . . . . . . . . . . . . . . . . . . . . . . . . 176
9.4.5 Chromatophore . . . . . . . . . . . . . . . . . . . . . . 177
9.4.6 BAR Domain Vesicle . . . . . . . . . . . . . . . . . . . 177
9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
10 Petascale Special-purpose Computer for Molecular Dynamics
Simulations 183
Makoto Taiji, Tetsu Narumi, and Yousuke Ohno
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
10.2 Hardware of MDGRAPE-3 . . . . . . . . . . . . . . . . . . . 185
10.3 The calculations performed by MDGRAPE-3 . . . . . . . . . 186
10.4 MDGRAPE-3 Chip . . . . . . . . . . . . . . . . . . . . . . . 188
10.4.1 Force Calculation Pipeline . . . . . . . . . . . . . . . . 189
10.4.2 j-Particle Memory and Control Units . . . . . . . . . 191
10.4.3 Chip speci
cations . . . . . . . . . . . . . . . . . . . . 193
10.5 System Architecture . . . . . . . . . . . . . . . . . . . . . . . 194
10.6 Software for MDGRAPE-3 . . . . . . . . . . . . . . . . . . . 196
10.7 Performance of MDGRAPE-3 . . . . . . . . . . . . . . . . . 200
xxiii
10.8 Summary and Future Directions . . . . . . . . . . . . . . . . 204
11 Simulating Biomolecules on the Petascale Supercomputers 211
Pratul K. Agarwal, Sadaf R. Alam, and Al Geist
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
11.2 Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.2.1 Ability to investigate bigger biomolecular systems . . 214
11.2.2 Ability to investigate longer time-scales . . . . . . . . 217
11.2.3 Hybrid quantum and classical (QM/MM) simulations 219
11.2.4 More accurate simulations . . . . . . . . . . . . . . . . 221
11.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
11.3.1 Scaling the biomolecular simulations code on >100K
processors . . . . . . . . . . . . . . . . . . . . . . . . . 221
11.3.2 Adapting to the changes in the hardware . . . . . . . 224
11.3.3 Fault-tolerance . . . . . . . . . . . . . . . . . . . . . . 227
11.3.4 Multi-paradigm hardware including Recon
gurable Computing
. . . . . . . . . . . . . . . . . . . . . . . . . . . 229
11.3.5 New simulation methodologies enabled by petascale . 229
11.4 Summary and Outlook . . . . . . . . . . . . . . . . . . . . . 230
12 Multithreaded Algorithms for Processing Massive Graphs 237
Kamesh Madduri, David A. Bader, Jonathan W. Berry, Joseph R. Crobak,
and Bruce A. Hendrickson
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
12.1.1 The trouble with graphs . . . . . . . . . . . . . . . . . 239
12.1.2 Limits on the Scalability of Distributed-Memory Graph
Computations . . . . . . . . . . . . . . . . . . . . . . . 239
12.2 The Cray MTA-2 . . . . . . . . . . . . . . . . . . . . . . . . 240
12.2.1 Expressing Parallelism . . . . . . . . . . . . . . . . . . 241
12.2.2 Support for 
ne-grained synchronization . . . . . . . . 242
12.3 Case Study: Shortest Paths . . . . . . . . . . . . . . . . . . . 242
12.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 244
12.3.2 -stepping algorithm . . . . . . . . . . . . . . . . . . 244
12.3.3 Thorup's algorithm . . . . . . . . . . . . . . . . . . . . 247
12.3.4 Experimental Results . . . . . . . . . . . . . . . . . . 250
12.4 Case Study: Connected Components . . . . . . . . . . . . . 254
12.4.1 Traditional PRAM algorithms . . . . . . . . . . . . . 255
12.4.2 Kahan's multilevel algorithm . . . . . . . . . . . . . . 255
12.4.3 Performance Comparisons . . . . . . . . . . . . . . . . 257
12.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
13 Disaster Survival Guide in Petascale Computing: An Algorithmic
Approach 263
Jack J. Dongarra, Zizhong Chen, George Bosilca, and Julien Langou
13.1 FT-MPI: A fault tolerant MPI implementation . . . . . . . . 265
xxiv
13.1.1 FT-MPI Overview . . . . . . . . . . . . . . . . . . . . 265
13.1.2 FT-MPI: A Fault Tolerant MPI Implementation . . . 266
13.1.3 FT-MPI Usage . . . . . . . . . . . . . . . . . . . . . . 266
13.2 Application Level Diskless Checkpointing . . . . . . . . . . . 267
13.2.1 Neighbor-Based Checkpointing . . . . . . . . . . . . . 269
13.2.2 Checksum-Based Checkpointing . . . . . . . . . . . . . 271
13.2.3 Weighted-Checksum-Based Checkpointing . . . . . . . 272
13.3 A Fault Survivable Iterative Equation Solver . . . . . . . . . 275
13.3.1 Preconditioned Conjugate Gradient Algorithm . . . . 276
13.3.2 Incorporating Fault Tolerance into PCG . . . . . . . . 276
13.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . 279
13.4.1 Performance of PCG with Dierent MPI Implementations
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
13.4.2 Performance Overhead of Taking Checkpoint . . . . . 280
13.4.3 Performance Overhead of Performing Recovery . . . . 283
13.4.4 Numerical Impact of Round-O Errors in Recovery . . 284
13.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
14 The Road to TSUBAME and Beyond 289
Satoshi Matsuoka
14.1 Introduction | the road to TSUBAME . . . . . . . . . . . . 289
14.2 Architectural Requirements of TSUBAME . . . . . . . . . . 291
14.3 The Hatching of TSUBAME . . . . . . . . . . . . . . . . . . 295
14.4 The Flight of TSUBAME | Performance, and Its Operations
so that Everybody Supercomputes . . . . . . . . . . . . . . . 301
15 Peta
ops Basics - Performance from SMP Building Blocks 311
Christian Bischof, Dieter an Mey, Christian Terboven, and Samuel Sarholz
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
15.2 Architectures for OpenMP Programming . . . . . . . . . . . 314
15.3 Loop Level Parallelization with OpenMP . . . . . . . . . . . 315
15.4 C++ and OpenMP . . . . . . . . . . . . . . . . . . . . . . . 316
15.4.1 Iterator Loops . . . . . . . . . . . . . . . . . . . . . . 316
15.4.2 ccNUMA Issues . . . . . . . . . . . . . . . . . . . . . . 317
15.4.3 Parallelizing OO-codes . . . . . . . . . . . . . . . . . . 317
15.4.4 Thread-Safety . . . . . . . . . . . . . . . . . . . . . . . 318
15.5 Nested Parallelization with OpenMP . . . . . . . . . . . . . 319
15.5.1 Nested Parallelization in the Current OpenMP Speci
-
cation . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
15.5.2 Content-based Image Retrieval with FIRE . . . . . . . 320
15.5.3 Computation of 3D Critical Points in Multi-Block CFD
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 321
15.5.4 The TFS Flow Solver . . . . . . . . . . . . . . . . . . 324
xxv
16 Performance and its Complexity on Petascale Systems 333
Erich Strohmaier
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
16.2 Architectural Trends and Concurrency Levels for Petascale Systems
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
16.3 Current Situation in Performance Characterization and Benchmarking
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
16.3.1 Benchmarking Initiatives . . . . . . . . . . . . . . . . 336
16.3.2 Application Performance Characterization . . . . . . . 338
16.3.3 Complexity and Productivity Measures of Performance 338
16.4 APEX-Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
16.4.1 Design Principles of APEX-Map . . . . . . . . . . . . 339
16.4.2 Comparison of Parallel Programming Paradigms with
Apex-MAP . . . . . . . . . . . . . . . . . . . . . . . . 340
16.5 How to characterize Performance Complexity . . . . . . . . . 342
16.5.1 De
nition of Performance Complexity . . . . . . . . . 343
16.5.2 Performance Model Selection . . . . . . . . . . . . . . 344
16.5.3 PC Analysis of some Parallel Systems . . . . . . . . . 347
17 Highly Scalable Performance Analysis Tools 355
Michael Gerndt and Karl Furlinger
17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
17.2 Performance analysis concepts revisited . . . . . . . . . . . . 356
17.3 Paradyn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
17.4 SCALASCA . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
17.5 Vampir Next Generation . . . . . . . . . . . . . . . . . . . . 359
17.6 Periscope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
17.6.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . 360
17.6.2 Speci
cation of performance properties with ASL . . . 361
17.6.3 The Periscope Node Agents . . . . . . . . . . . . . . . 361
17.6.4 Search for Performance Properties . . . . . . . . . . . 363
17.6.5 The Periscope High-Level Agents . . . . . . . . . . . . 364
17.6.6 Agent Communication Infrastructure . . . . . . . . . . 365
17.6.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 367
18 Towards Petascale Multilevel Finite Element Solvers 375
Christoph Freundl, Tobias Gradl, Ulrich Rude and Benjamin Bergen
18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
18.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 375
18.1.2 Exemplary Petascale Architectures . . . . . . . . . . . 376
18.2 Design Paradigms . . . . . . . . . . . . . . . . . . . . . . . . 377
18.2.1 Hierarchical Hybrid Grids . . . . . . . . . . . . . . . . 377
18.2.2 ParExPDE . . . . . . . . . . . . . . . . . . . . . . . . 380
18.3 Evaluation and Comparison . . . . . . . . . . . . . . . . . . 383
xxvi
19 A Hybrid Approach to Ecient Finite Element Code Development
391
Anders Logg, Kent-Andre Mardal, Martin Sandve Alns, Hans Petter
Langtangen, and Ola Skavhaug
19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
19.2 High-Level Application Codes . . . . . . . . . . . . . . . . . 393
19.3 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . 401
19.3.1 Meta-programming . . . . . . . . . . . . . . . . . . . . 401
19.3.2 Just-in-time compilation of variational problems . . . 404
19.3.3 FFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
19.3.4 SyFi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
19.4 A Uni
ed Framework for Finite Element Assembly . . . . . . 410
19.4.1 Finite element assembly . . . . . . . . . . . . . . . . . 411
19.4.2 The UFC interface . . . . . . . . . . . . . . . . . . . . 412
19.4.3 Implementing the UFC interface . . . . . . . . . . . . 414
19.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
20 Programming Petascale Applications with Charm++ 421
Laxmikant V. Kale, Eric Bohm, Celso L. Mendes, and Terry Wilmarth,
Gengbin Zheng
20.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
20.2 Charm++ and AMPI: Programming Model . . . . . . . . . . 423
20.2.1 Dynamic Load Balancing . . . . . . . . . . . . . . . . 424
20.2.2 Projections . . . . . . . . . . . . . . . . . . . . . . . . 425
20.2.3 Summary of Other Features . . . . . . . . . . . . . . . 426
20.3 Charm++ Applications . . . . . . . . . . . . . . . . . . . . . 428
20.3.1 NAMD . . . . . . . . . . . . . . . . . . . . . . . . . . 428
20.3.2 LeanCP . . . . . . . . . . . . . . . . . . . . . . . . . . 428
20.3.3 Cosmology . . . . . . . . . . . . . . . . . . . . . . . . 431
20.3.4 Other Applications . . . . . . . . . . . . . . . . . . . . 433
20.4 Simulation of Large Systems . . . . . . . . . . . . . . . . . . 434
20.5 Language Extensions . . . . . . . . . . . . . . . . . . . . . . 436
20.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
21 Annotations for Productivity and Performance Portability 443
Boyana Norris, Albert Hartono, and William Gropp
21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
21.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 444
21.2.1 Overall Design . . . . . . . . . . . . . . . . . . . . . . 444
21.2.2 Annotation Language Syntax . . . . . . . . . . . . . . 445
21.2.3 System Extensibility . . . . . . . . . . . . . . . . . . . 446
21.2.4 Code-Generation Module . . . . . . . . . . . . . . . . 446
21.3 Performance Studies . . . . . . . . . . . . . . . . . . . . . . . 453
21.3.1 STREAM benchmark . . . . . . . . . . . . . . . . . . 453
21.3.2 AXPY Operations . . . . . . . . . . . . . . . . . . . . 456
xxvii
21.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 456
21.4.1 Self-Tuning Libraries and Code . . . . . . . . . . . . . 457
21.4.2 Compiler Approaches . . . . . . . . . . . . . . . . . . 457
21.4.3 Performance-Related User Annotations . . . . . . . . 458
21.5 Summary and Future Directions . . . . . . . . . . . . . . . . 459
22 Locality Awareness in a High-Productivity Programming Language
463
Roxana E. Diaconescu and Hans P. Zima
22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
22.2 Basic Chapel Concepts Related to Data Parallelism . . . . . 465
22.2.1 Domains . . . . . . . . . . . . . . . . . . . . . . . . . . 465
22.2.2 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
22.3 Data Distributions . . . . . . . . . . . . . . . . . . . . . . . . 468
22.3.1 Basic Approach . . . . . . . . . . . . . . . . . . . . . . 468
22.3.2 The Distribution Interface . . . . . . . . . . . . . . . . 470
22.3.3 The On-Locale Allocation Policy . . . . . . . . . . . . 471
22.4 Examples and Discussion . . . . . . . . . . . . . . . . . . . . 472
22.4.1 A Load-Balanced Block Distribution . . . . . . . . . . 472
22.4.2 A Sparse Data Distribution . . . . . . . . . . . . . . . 473
22.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 476
22.5.1 Compiler Implementation Status . . . . . . . . . . . . 478
22.5.2 Distribution Implementation Strategy . . . . . . . . . 478
22.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 480
22.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . 481
23 Architectural and Programming Issues for Sustained Peta
op
Performance 485
Uwe Kuster and Michael Resch
23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
23.2 A short history of numerical computing and computers . . . 486
23.2.1 The Sixties . . . . . . . . . . . . . . . . . . . . . . . . 486
23.2.2 The Seventies . . . . . . . . . . . . . . . . . . . . . . . 486
23.2.3 The Eighties . . . . . . . . . . . . . . . . . . . . . . . 487
23.2.4 The Nineties . . . . . . . . . . . . . . . . . . . . . . . 487
23.2.5 2000 and Beyond . . . . . . . . . . . . . . . . . . . . . 488
23.3 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
23.3.1 Processor development . . . . . . . . . . . . . . . . . . 489
23.3.2 Cell Broadband Engine processor . . . . . . . . . . . . 491
23.3.3 Clearspeed card . . . . . . . . . . . . . . . . . . . . . . 491
23.3.4 Vector like architectures . . . . . . . . . . . . . . . . . 492
23.3.5 Power consumption and cost aspects . . . . . . . . . . 493
23.3.6 Communication network . . . . . . . . . . . . . . . . . 493
23.3.7 Communication protocols and parallel paradigms . . . 494
23.4 Algorithms for very large computers . . . . . . . . . . . . . . 494
xxviii
23.4.1 Large scale machines . . . . . . . . . . . . . . . . . . . 494
23.4.2 Linpack will show limits . . . . . . . . . . . . . . . . . 495
23.4.3 Narrowing the path to algorithms . . . . . . . . . . . 498
24 Cactus Framework: Black Holes to Gamma Ray Bursts 505
Erik Schnetter, Christian D. Ott, Gabrielle Allen, Peter Diener, Tom
Goodale, Thomas Radke, Edward Seidel, and John Shalf
24.1 Current challenges in relativistic astrophysics and the Gamma-
Ray Burst problem . . . . . . . . . . . . . . . . . . . . . . . 506
24.1.1 GRBs and petascale computing . . . . . . . . . . . . . 508
24.2 The Cactus framework . . . . . . . . . . . . . . . . . . . . . 509
24.3 Spacetime and hydrodynamics codes . . . . . . . . . . . . . . 511
24.3.1 Ccatie: Spacetime evolution . . . . . . . . . . . . . . 511
24.3.2 Whisky: General relativistic hydrodynamics . . . . . . 511
24.4 Parallel implementation and mesh re
nement . . . . . . . . . 512
24.4.1 PUGH . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
24.4.2 Adaptive Mesh Re
nement with Carpet . . . . . . . . 513
24.4.3 I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
24.5 Scaling on current machines . . . . . . . . . . . . . . . . . . 515
24.5.1 Floating point performance . . . . . . . . . . . . . . . 516
24.5.2 I/O performance . . . . . . . . . . . . . . . . . . . . . 518
24.6 Developing for petascale . . . . . . . . . . . . . . . . . . . . 519
24.6.1 Physics: radiation transport . . . . . . . . . . . . . . . 520
24.6.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 521
24.6.3 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 522

Library of Congress Subject Headings for this publication:

High performance computing.
Petaflops computers.
Parallel processing (Electronic computers).