Concurrency and Computation: Practice and Experience

Published Papers for 2004

Journal Vision

WILEY Journal Home Page

Papers under review through 2004

Editorial Board


2004 Volume 16 Articles

Article ID: CPE745

Publisher John Wiley & Sons, Ltd. Chichester, UK
Category Research Article
Article Title A comparison of task pools for dynamic load balancing of irregular algorithms
Volume ID 16
Issue ID 1
Date Jan 1 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.745
Article ID CPE745
Author Name(s) Matthias Korch1Thomas Rauber2
Author Email(s) matthias.korch@uni-bayreuth.de1
Affiliation(s) Faculty of Mathematics, Physics and Computer Science, University of Bayreuth, Bayreuth, Germany 1 2
Keyword(s) task pools, dynamic task scheduling, irregular algorithms, hierarchical radiosity, volume rendering, performance evaluation, threads,
Abstract
Since a static work distribution does not allow for satisfactory speed-ups of parallel irregular algorithms, there is a need for a dynamic distribution of work and data that can be adapted to the runtime behavior of the algorithm. Task pools are data structures which can distribute tasks dynamically to different processors where each task specifies computations to be performed and provides the data for these computations. This paper discusses the characteristics of task-based algorithms and describes the implementation of selected types of task pools for shared-memory multiprocessors. Several task pools have been implemented in C with POSIX threads and in Java. The task pools differ in the data structures to store the tasks, the mechanism to achieve load balance, and the memory manager used to store the tasks. Runtime experiments have been performed on three different shared-memory systems using a synthetic algorithm, the hierarchical radiosity method, and a volume rendering algorithm. Copyright © 2004 John Wiley & Sons, Ltd.

Article ID: CPE746

Publisher John Wiley & Sons, Ltd. Chichester, UK
Category Research Article
Article Title Security Maintenance Mediation: a technology for preventing unintended security breaches
Volume ID 16
Issue ID 1
Date Jan 1 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.746
Article ID CPE746
Author Name(s) Roger (Buzz) King1
Author Email(s) roger@cs.colorado.edu1
Affiliation(s) Department of Computer Science, University of Colorado, Boulder, CO 80309, U.S.A. 1
Keyword(s) data security, security policy, data mediation, collaborative systems, Web portals,
Abstract
Web-resident information is becoming ‘smarter’, in the sense that emerging technology will support the annotation of it with ontological terms, which will be used to locate and reuse information. This will pose a great security risk in the form of unintended breaches (as distinct from deliberate invasions). Web-resident information will be far more readily available and relevant, thus causing inadvertent releases of secure information to potentially cause it to be diffusely spread across the Internet. Then as this information is iteratively transformed and integrated with other information, it will become irretrievable and potentially used in a myriad of unpredictable ways. The problem is that ontological annotations, while making information more understandable in its original form, do not provide a means for easily capturing the complex semantics of information that has been transformed via abstraction, aggregation, and integration. This demands the development of a semantically rich way of specifying ‘views’ of Web information, to which security controls can be attached. Also needed is a way for users of secure information to easily and voluntarily blend-and thereby propagate-security controls as information is transformed. Information mediators designed by collaborative teams of experts are proposed as the vehicle for wrapping information, so that at each step of reuse, high-level views and their corresponding integrity controls can be made easily accessible to trusted users who will then be able to ensure their proper maintenance. Copyright © 2004 John Wiley & Sons, Ltd.

Article ID: CPE747

Publisher John Wiley & Sons, Ltd. Chichester, UK
Category Research Article
Article Title Parallel operation of CartaBlanca on shared and distributed memory computers
Volume ID 16
Issue ID 1
Date Jan 1 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.747
Article ID CPE747
Author Name(s) N. T. Padial-Collins1W. B. VanderHeyden2D. Z. Zhang3E. D. Dendy4D. Livescu5
Author Email(s) nelylanl@lanl.gov1
Affiliation(s) Los Alamos National Laboratory Theoretical Division and Los Alamos Computer Science Institute, Los Alamos, NM 87545, U.S.A. 1 2 3 4 5
Keyword(s) Java, scientific computing, parallel computation,
Abstract
We describe the parallel performance of the pure Java CartaBlanca code on heat transfer and multiphase fluid flow problems. CartaBlanca is designed for parallel computations on partitioned unstructured meshes. It uses Java"s thread facility to manage computations on each of the mesh partitions. Inter-partition communications are handled by two compact objects for node-by-node communication along partition boundaries and for global reduction calculations across the entire mesh. For distributed calculations, the JavaParty package from the University of Karlsruhe is demonstrated to work with CartaBlanca. Copyright © 2004 John Wiley & Sons, Ltd.

Article ID: CPE749

Publisher John Wiley & Sons, Ltd. Chichester, UK
Category Performance
Article Title Performance and scalability of MPI on PC clusters
Volume ID 16
Issue ID 1
Date Jan 1 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.749
Article ID CPE749
Author Name(s) Glenn R. Luecke1Marina Kraeva2Jing Yuan3Silvia Spanoyannis4
Author Email(s) grl@iastate.edu1
Affiliation(s) 291 Durham Center, Iowa State University, Ames, IA 50011, U.S.A. 1 2 3 4
Keyword(s) PC clusters, MPI, MPI performance,
Abstract
The purpose of this paper is to compare the communication performance and scalability of MPI communication routines on a Windows Cluster, a Linux Cluster, a Cray T3E-600, and an SGI Origin 2000. All tests in this paper were run using various numbers of processors and two message sizes. In spite of the fact that the Cray T3E-600 is about 7 years old, it performed best of all machines for most of the tests. The Linux Cluster with the Myrinet interconnect and Myricom"s MPI performed and scaled quite well and, in most cases, performed better than the Origin 2000, and in some cases better than the T3E. The Windows Cluster using the Giganet Full Interconnect and MPI/Pro"s MPI performed and scaled poorly for small messages compared with all of the other machines. Copyright © 2004 John Wiley & Sons, Ltd.

Article ID: CPE770

Publisher John Wiley & Sons, Ltd. Chichester, UK
Category Research Article
Article Title Group-SPMD programming with orthogonal processor groups
Volume ID 16
Issue ID 2-3
Date 02 00 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.770
Article ID CPE770
Author Name(s) Thomas Rauber1Robert Reilein2Gudula Rünger3
Author Email(s) rauber@uni-bayreuth.de1
Affiliation(s) Department of Mathematics and Physics, University of Bayreuth, 95440 Bayreuth, Germany 1Department of Computer Science, Chemnitz University of Technology, 09111 Chemnitz, Germany 2 3
Keyword(s) communication library, group-SPMD, orthogonal processor groups, distributed memory,
Abstract
Many programs for message-passing machines can benefit from an implementation in a group-SPMD programming model due to the potential to reduce communication overhead and to increase scalability. In this paper, we consider group-SPMD programs exploiting different orthogonal processor partitions in one program. For each program this is a fixed set of predefined processor partitions given by the parallel hyperplanes of a two- or multi-dimensional virtual processor organization. We introduce a library built on top of MPI to support the programming with those orthogonal processor groups. The parallel programming model is appropriate for applications with a multi-dimensional task grid and task dependencies mainly aligned in the dimensions of the task grid. The library can be used to specify the appropriate processor partitions, which are then created by the library, and to define the mapping of tasks to the processor hyperplanes. Examples from numerical analysis illustrate the programming style and show that the runtime on distributed memory machines can be considerably reduced by using the library. Copyright © 2004 John Wiley & Sons, Ltd.

Article ID: CPE768

Publisher John Wiley & Sons, Ltd. Chichester, UK
Category Research Article
Article Title Managing distributed shared arrays in a bulk-synchronous parallel programming environment
Volume ID 16
Issue ID 2-3
Date 02 00 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.768
Article ID CPE768
Author Name(s) Christoph W. Kessler1
Author Email(s) chrke@ida.liu.se1
Affiliation(s) Programming Environments Laboratory (PELAB), Department of Computer Science, Linkμping University, S-581 83 Linkμping, Sweden 1
Keyword(s) NestStep, BSP model, bulk-synchronous parallelism, parallel programming language, distributed shared array, runtime scheduling of communication,
Abstract
NestStep is a parallel programming language for the BSP (bulk-hronous parallel) programming model. In this article we describe the concept of distributed shared arrays in NestStep and its implementation on top of MPI. In particular, we present a novel method for runtime scheduling of irregular, direct remote accesses to sections of distributed shared arrays. Our method, which is fully parallelized, uses conventional two-sided message passing and thus avoids the overhead of a standard implementation of direct remote memory access based on one-sided communication. The main prerequisite is that the given program is structured in a BSP-compliant way. Copyright © 2004 John Wiley & Sons, Ltd.

Article ID: CPE767

Publisher John Wiley & Sons, Ltd. Chichester, UK
Category Research Article
Article Title Compiling data-parallel programs for clusters of SMPs
Volume ID 16
Issue ID 2-3
Date 02 00 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.767
Article ID CPE767
Author Name(s) Siegfried Benkner1Thomas Brandes2
Author Email(s) sigi@ieee.org1
Affiliation(s) Institute for Software Science, University of Vienna, Liechtensteinstrasse 22, A-1090 Vienna, Austria 1SCAI-Fraunhofer Institute for Algorithms and Scientific Computing, Schloß Birlinghoven, D-53754 St. Augustin, Germany 2
Keyword(s) SMP clusters, parallel programming, HPF, OpenMP, MPI, hybrid parallelization,
Abstract
Clusters of shared-memory multiprocessors (SMPs) have become the most promising parallel computing platforms for scientific computing. However, SMP clusters significantly increase the complexity of user application development when using the low-level application programming interfaces MPI and OpenMP, forcing users to deal with both distributed-memory and shared-memory parallelization details. In this paper we present extensions of High Performance Fortran (HPF) for SMP clusters which enable the compiler to adopt a hybrid parallelization strategy, efficiently combining distributed-memory with shared-memory parallelism. By means of a small set of new language features, the hierarchical structure of SMP clusters may be specified. This information is utilized by the compiler to derive inter-node data mappings for controlling distributed-memory parallelization across the nodes of a cluster and intra-node data mappings for extracting shared-memory parallelism within nodes. Additional mechanisms are proposed for specifying inter- and intra-node data mappings explicitly, for controlling specific shared-memory parallelization issues and for integrating OpenMP routines in HPF applications. The proposed features have been realized within the ADAPTOR and VFC compilers. The parallelization strategy for clusters of SMPs adopted by these compilers is discussed as well as a hybrid-parallel execution model based on a combination of MPI and OpenMP. Experimental results indicate the effectiveness of the proposed features. Copyright © 2004 John Wiley & Sons, Ltd.

Article ID: CPE771

Publisher John Wiley & Sons, Ltd. Chichester, UK
Category Research Article
Article Title A compiler for multiple memory models
Volume ID 16
Issue ID 2-3
Date 02 00 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.771
Article ID CPE771
Author Name(s) S. P. Midkiff1J. Lee2D. A. Padua3
Author Email(s) smidkiff@purdue.edu1
Affiliation(s) School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47905-2305, U.S.A. 1School of Computer Science and Engineering, Seoul National University, Seoul, Korea 2Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL 61801, U.S.A. 3
Keyword(s) memory models, Java, explicit parallelism, compilers, program optimization, program analysis,
Abstract
The design of consistency models for both hardware and software is a difficult task. For a programming language, it is particularly difficult because the target audience for a high-level programming language is much wider than the target audience for a machine language, making usability a more important criterion. Exacerbating this problem is the reality that the programming language community has little experience designing programming language consistency models, and therefore each new attempt is very much a voyage into uncharted territory. A concrete example of the difficulties of the task is the current Java Memory Model. Although designed to be easy to use by Java programmers, it is poorly understood and at least one common idiom (the ‘double check idiom’) to exploit the model is unsafe. In this paper, we describe the design of an optimizing Java compiler that will accept either as input or as an interface implementation a consistency model for the code to be compiled. The compiler will use Shasha and Snir's delay set analysis, and our CSSA program representation to provide a canonical representation for the effects of different consistency models on optimizations and analysis. The compiler will serve as a testbed to prototype new memory models, and to measure the effects of different memory models on program performance. Copyright © 2004 John Wiley & Sons, Ltd.

Article ID: CPE772

Publisher John Wiley & Sons, Ltd. Chichester, UK
Category Research Article
Article Title Space-time mapping and tiling: a helpful combination
Volume ID 16
Issue ID 2-3
Date 02 00 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.772
Article ID CPE772
Author Name(s) Martin Griebl1Peter Faber2Christian Lengauer3
Author Email(s) griebl@fmi.uni-passau.de1
Affiliation(s) Department of Mathematics and Informatics, University of Passau, D-94030 Passau, Germany 1 2 3
Keyword(s) automatic parallelization, loop transformation, tiling, space-time mapping, polyhedron model, loop optimization,
Abstract
Tiling is a well-known technique for sequential compiler optimization, as well as for automatic program parallelization. However, in the context of parallelization, tiling should not be considered as a stand-alone technique,but should be applied after a dedicated parallelization phase, in our case after space-time mapping.We show how tiling can benefit from space-time mapping, and we derive an algorithm for computing tiles which can minimize the number of communication startups, taking the number of physically available processors into account. We also present how the use of a simple cost model reduces real execution time. Copyright © 2004 John Wiley & Sons, Ltd.

Article ID: CPE773

Publisher John Wiley & Sons, Ltd. Chichester, UK
Category Research Article
Article Title The effect of cache models on iterative compilation for combined tiling and unrolling
Volume ID 16
Issue ID 2-3
Date 02 00 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.773
Article ID CPE773
Author Name(s) P. M. W. Knijnenburg1T. Kisuki2K. Gallivan3M. F. P. O'Boyle4
Author Email(s) peterk@liacs.nl1
Affiliation(s) LIACS, Leiden University, Niels Bohrweg 1, 2333 CA Leiden, The Netherlands 1 2 Department of Computer Science, Florida State University, U.S.A. 3Institute for Computing Systems Architecture, Edinburgh University, U.K. 4
Keyword(s) compiler optimization, phase ordering problem, cache models,
Abstract
Iterative compilation, where we search for the best program transformation based on profile feedback information, has been highly successful in determining good program optimizations, typically outperforming all static approaches. However, this benefit has come at a price, namely, a large number of executions of the target program which in many cases is unacceptable. This paper is concerned with reducing the number of program executions needed by iterative compilation by incorporating static models. We examine how static models may be incorporated into a purely iterative approach by developing several characterized heuristics. We empirically explore the tradeoff between reducing the number of executions required and the ultimate performance achieved for differing parameters or slack factors. We focus on tiling and unrolling as transformations and cache models as a test case for the feasibility of this approach. We show that a highly accurate model, used as a filter to profiling and appropriately parameterized, can reduce the number of executions by 50%. We also show that using a simple model to rank transformations and profiling, only those with the highest ranking can reduce the number of executions even further up to 70%, in cases where there is only a limited number of profiles available. We conclude that a production compiler might perform best using the last approach, gaining the benefit of iterative compilation at a much reduced cost. Copyright © 2004 John Wiley & Sons, Ltd.

Article ID: CPE774

Publisher John Wiley & Sons, Ltd. Chichester, UK
Category Research Article
Article Title A fast and accurate method for determining a lower bound on execution time
Volume ID 16
Issue ID 2-3
Date 02 00 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.774
Article ID CPE774
Author Name(s) G. Fursin1M. F. P. O'Boyle2O. Temam3G. Watts4
Author Email(s) g.fursin@ed.ac.uk1
Affiliation(s) ICSA, School of Informatics, University of Edinburgh, Edinburgh EH9 3JZ, U.K. 1 2 Laboratoire de Recherche en Informatique, Bâtiment 490, Université Paris Sud, 91405 Orsay Cedex, France 3 4
Keyword(s) memory performance, optimization tool, memory latency analysis,
Abstract
In performance critical applications, memory latency is frequently the dominant overhead. In many cases, automatic compiler-based optimizations to improve memory performance are limited and programmers frequently resort to manual optimization techniques. However, this process is tedious and time-consuming. Furthermore, as the potential benefit from optimization is unknown there is no way to judge the amount of effort worth expending, nor when the process can stop, i.e. when optimal memory performance has been achieved or sufficiently approached. Architecture simulators can provide such information but designing an accurate model of an existing architecture is difficult and simulation times are excessively long. In this article, we propose and implement a technique that is both fast and reasonably accurate for estimating a lower bound on execution time for scientific applications. This technique has been tested on a wide range of programs from the SPEC benchmark suite and two commercial applications, where it has been used to guide a manual optimization process and iterative compilation. We compare our technique with that of a simulator with an ideal memory behaviour and demonstrate that our technique provides comparable information on memory performance and yet is over two orders of magnitude faster. We further show that our technique is considerably more accurate than hardware counters. Copyright © 2004 John Wiley & Sons, Ltd.

Article ID: CPE775

Publisher John Wiley & Sons, Ltd. Chichester, UK
Category Research Article
Article Title Parallel object-oriented framework optimization
Volume ID 16
Issue ID 2-3
Date 02 00 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.775
Article ID CPE775
Author Name(s) Daniel J. Quinlan1Markus Schordan2Brian Miller3Markus Kowarschik4
Author Email(s) dquinlan@llnl.gov1
Affiliation(s) Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA, U.S.A. 1 2 3 System Simulation Group, Department of Computer Science, University of Erlangen-Nuremberg, Germany 4
Keyword(s) telescoping languages, AST restructuring tools, semantics-based transformations, object-oriented optimizations,
Abstract
Sophisticated parallel languages are difficult to develop; most parallel distributed memory scientific applications are developed using a serial language, expressing parallelism through third party libraries (e.g. MPI). As a result, frameworks and libraries are often used to encapsulate significant complexities. We define a novel approach to optimize the use of libraries within applications. The resulting tool, named ROSE, leverages the additional semantics provided by library-defined abstractions enabling library specific optimization of application codes. It is a common perception that performance is inversely proportional to the level of abstraction. Our work shows that this is not the case if the additional semantics can be leveraged. We show how ROSE can be used to leverage the semantics within the compile-time optimization. Copyright © 2004 John Wiley & Sons, Ltd.

Article ID: CPE766

Publisher John Wiley Sons, Ltd. Chichester, UK
Category Editorials
Article Title Special Issue: Compilers for Parallel Computers (CPC 2001)
Volume ID 16
Issue ID 2-3
Date Feb 1 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.766
Article ID CPE766
Author Name(s) . .1
Author Email(s)
Affiliation(s) 1
Keyword(s) .,
Abstract
No abstract

Article ID: CPE769

Publisher John Wiley Sons, Ltd. Chichester, UK
Category Research Article
Article Title Data partitioning‐based parallel irregular reductions
Volume ID 16
Issue ID 2-3
Date Feb 1 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.769
Article ID CPE769
Author Name(s) Eladio Guti?rrez1Oscar Plata2Emilio L. Zapata3
Author Email(s) eladio@ac.uma.es1
Affiliation(s) Department of Computer Architecture, University of Malaga, E-29071 Malaga, Spain 1 2 3
Keyword(s) irregular reductions, data locality, workload balancing, shared-memory multiprocessor, NUMA machines,
Abstract
Different parallelization methods for irregular reductions on shared memory multiprocessors have been proposed in the literature in recent years. We have classified all these methods and analyzed them in terms of a set of properties: data locality, memory overhead, exploited parallelism, and workload balancing. In this paper we propose several techniques to increase the amount of exploited parallelism and to introduce load balancing into an important class of these methods. Regarding parallelism, the proposed solution is based on the partial expansion of the reduction array. Load balancing is discussed in terms of two techniques. The first technique is a generic one, as it deals with any kind of load imbalance present in the problem domain. The second technique handles a special case of load imbalance which occurs whenever a large number of write operations are concentrated on small regions of the reduction arrays. Efficient implementations of the proposed optimizing solutions for a particular method are presented, experimentally tested on static and dynamic kernel codes, and compared with other parallel reduction methods. Copyright @ 2004 John Wiley & Sons, Ltd.

Article ID: CPE776

Publisher John Wiley Sons, Ltd. Chichester, UK
Category Research Article
Article Title <SC>SWARP</SC>: a retargetable preprocessor for multimedia instructions
Volume ID 16
Issue ID 2-3
Date Feb 1 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.776
Article ID CPE776
Author Name(s) Gilles Pokam1St?phane Bihan2Julien Simonnet3Fran?ois Bodin4
Author Email(s) gilles.pokam@irisa.fr1
Affiliation(s) IRISA, Campus Universitaire de Beaulieu, 35042 Rennes Cedex, France 1 2 3 4
Keyword(s) multimedia instructions, compiler optimizations, embedded processors,
Abstract
In this paper, we propose SWARP, a retargetable preprocessor for exploiting multimedia instructions. The system mixes loop distribution, unrolling and pattern matching to exploit complex multimedia instructions. Contrary to all available systems, it can be extended at the user level. Using with a TriMedia processor we show that our system achieves important good code quality with a set of frequently used loop kernels for multimedia applications. Copyright @ 2004 John Wiley & Sons, Ltd.

Article ID: CPE751

Publisher John Wiley & Sons, Ltd. Chichester, UK
Category Research Article
Article Title A comparison of concurrent programming and cooperative multithreading under load balancing applications
Volume ID 16
Issue ID 4
Date 04 10 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.751
Article ID CPE751
Author Name(s) Justin T. Maris1Aaron W. Keen2Takashi Ishihara3Ronald A. Olsson4
Author Email(s) olsson@cs.ucdavis.edu4
Affiliation(s) Department of Computer Science, University of California, Davis, CA 95616, U.S.A. 1 2 3 4
Keyword(s) cooperative multithreading, concurrent programming, load balancing, parallel and distributed programming languages, synchronization mechanisms,
Abstract
Two models of thread execution are the general concurrent programming execution model (CP) and the cooperative multithreading execution model (CM). CP provides nondeterministic thread execution where context switches occur arbitrarily. CM provides threads that execute one at a time until they explicitly choose to yield the processor. This paper focuses on a classic application to reveal the advantages and disadvantages of load balancing during thread execution under CP and CM styles; results from a second classic application were similar. These applications are programmed in two different languages (SR and Dynamic C) on different hardware (standard PCs and embedded system controllers). An SR-like run-time system, DesCaRTeS, was developed to provide interprocess communication for the Dynamic C implementations. This paper compares load balancing and non-load balancing implementations; it also compares CP and CM style implementations. The results show that in cases of very high or very low workloads, load balancing slightly hindered performance; and in cases of moderate workload, both SR and Dynamic C implementations of load balancing generally performed well. Further, for these applications, CM style programs outperform CP style programs in some cases, but the opposite occurs in some other cases. This paper also discusses qualitative tradeoffs between CM style programming and CP style programming for these applications. Copyright © 2004 John Wiley & Sons, Ltd.

Article ID: CPE752

Publisher John Wiley & Sons, Ltd. Chichester, UK
Category Research Article
Article Title OpenMP-oriented applications for distributed shared memory architectures
Volume ID 16
Issue ID 4
Date 04 10 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.752
Article ID CPE752
Author Name(s) Ami Marowka1Zhenying Liu2Barbara Chapman3
Author Email(s) amimar2@yahoo.com1
Affiliation(s) Department of Computer Science, The University of Houston, Houston, TX 77204-3010, U.S.A. 1 2 3
Keyword(s) OpenMP, data locality, NAS Parallel Benchmarks, programming model,
Abstract
The rapid rise of OpenMP as the preferred parallel programming paradigm for small-to-medium scale parallelism could slow unless OpenMP can show capabilities for becoming the model-of-choice for large scale high-performance parallel computing in the coming decade. The main stumbling block for the adaptation of OpenMP to distributed shared memory (DSM) machines, which are based on architectures like cc-NUMA, stems from the lack of capabilities for data placement among processors and threads for achieving data locality. The absence of such a mechanism causes remote memory accesses and inefficient cache memory use, both of which lead to poor performance. This paper presents a simple software programming approach called copy-inside-copy-back (CC) that exploits the data privatization mechanism of OpenMP for data placement and replacement. This technique enables one to distribute data manually without taking away control and flexibility from the programmer and is thus an alternative to the automat and implicit approaches. Moreover, the CC approach improves on the OpenMP-SPMD style of programming that makes the development process of an OpenMP application more structured and simpler. The CC technique was tested and analyzed using the NAS Parallel Benchmarks on SGI Origin 2000 multiprocessor machines. This study shows that OpenMP improves performance of coarse-grained parallelism, although a fast copy mechanism is essential. Copyright © 2004 John Wiley & Sons, Ltd.

Article ID: CPE763

Publisher John Wiley & Sons, Ltd. Chichester, UK
Category Research Article
Article Title Proceedings of the 10th International Symposium on High Performance Distributed Computing (HPDC-10)
Volume ID 16
Issue ID 4
Date 04 10 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.763
Article ID CPE763
Author Name(s) Anand Natrajan1Michael Crowley2Nancy Wilkins-Diehr3Marty A. Humphrey4Anthony D. Fox5Andrew S. Grimshaw6Charles L. Brooks7
Author Email(s) anand@virginia.edu1
Affiliation(s) Department of Computer Science, University of Virginia, 151 Engineer's Way, Charlottesville, VA 22904, U.S.A. 1Department of Molecular Biology, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, CA 92037, U.S.A. 2San Diego Supercomputing Center, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA 92093, U.S.A. 3 4 5 6 7
Keyword(s) Legion, computational Grid, CHARMM, protein folding,
Abstract
One benefit of a computational Grid is the ability to run high-performance applications over distributed resources simply and securely. We demonstrated this benefit with an experiment in which we studied the protein-folding process with the CHARMM molecular simulation package over a Grid managed by Legion, a Grid operating system. High-performance applications can take advantage of Grid resources if the Grid operating system provides both low-level functionality as well as high-level services. We describe the nature of services provided by Legion for high-performance applications. Our experiences indicate that human factors continue to play a crucial role in the configuration of Grid resources, underlying resources can be problematic, Grid services must tolerate underlying problems or inform the user, and high-level services must continue to evolve to meet user requirements. Our experiment not only helped a scientist perform an important study, but also showed the viability of an integrated approach such as Legion's for managing a Grid. Copyright © 2004 John Wiley & Sons, Ltd.

Article ID: CPE748

Publisher John Wiley Sons, Ltd. Chichester, UK
Category Research Article
Article Title Parallel LU factorization of sparse matrices on FPGA-based configurable computing engines
Volume ID 16
Issue ID 4
Date March 10 2004
DOI(URI) http://dx.doi.org/10.1002/cpe.748
Article ID CPE748