BSP Worldwide

An Initial Proposal for the BSP Worldwide Standard Library

by Jonathan Hill and Bill McColl

Oxford University

Abstract

An initial proposal for the BSP Worldwide Standard Library is described. The library supports both bulk synchronous remote memory access and bulk synchronous message passing.

0 The essence of BSP programming
1 BSP programming languages and libraries
2 The proposal
2.1 Creating BSP Processes
2.2 Superstep synchronisation
2.3 Bulk synchronous remote memory access
2.3.1 Putting data into a remote location
2.3.2 Getting data from a remote location
2.4 Bulk synchronous message passing
2.4.1 Sending data
2.4.2 Matching, counting and moving buffered data
3 Collective communications
4 Current status
5 List of operations

0. The essence of BSP programming =================================

A bulk synchronous parallel (BSP) computer [5,7] consists of a set of processor-memory pairs, a global communications network, and a mechanism for the efficient barrier synchronisation of the processors. There are no specialised broadcasting or combining facilities, although these can be efficiently realised in software where required. The model also does not deal directly with issues such as input-output or the use of vector units, although it can be easily extended to do so.

A BSP computer operates in the following way. A computation consists of a sequence of parallel supersteps. During a superstep, each processor-memory pair can perform a number of computation steps on values held locally at the start of the superstep, send and receive a number of messages, and handle various remote get and put requests. The model does not prescribe any particular style of communication, although at the end of a superstep there is a barrier synchronisation at which point any pending communications must be completed.

Although we have described the BSP computer as an architectural model, one can also view bulk synchrony as a programming discipline. The essence of the BSP approach to parallel programming is the notion of the superstep, in which communication and synchronisation are completely decoupled. A "BSP program" is simply one which proceeds in phases, with the necessary global communications taking place between the phases. This approach to parallel programming is applicable to all kinds of parallel architecture: distributed memory architectures, shared memory multiprocessors, and networks of workstations. It provides a consistent, and very general, framework within which to develop portable software for scalable computing.

1. BSP programming languages and libraries ==========================================

The communication facilities described in this proposal are drawn from other communication libraries that provide either direct remote memory access or message passing facilities. In this section we briefly summarise the work that has influenced this proposal.

The PVM message passing library [2] is widely implemented and widely used. The MPI message passing interface [4] is more elaborate. It supports blocking and non-blocking point-to-point communication and a number of collective communications (broadcast, scatter, gather, reduction etc.). Although neither of these libraries is directly aimed at supporting BSP programming, they can be used for that purpose.

The Cray SHMEM library provides primitives for direct remote memory access (or one-sided communications.)

The Oxford BSP Library [6] provides a set of programming primitives for bulk synchronous remote memory access. The core of the Library consists of six routines - two for process management, two for superstep synchronisation, and two for remote memory access. Higher level operations such as broadcast and reduction are also available.

The Green BSP Library [3] provides a set of message passing primitives for exchanging fixed sized packets between processes. In contrast to using point-to-point message exchange, the sending and reception of messages is decoupled by ensuring that messages are delivered by the end of a superstep---bulk synchronous message passing.

Split-C [1] is a parallel extension of C which supports efficient access to a global address space on current distributed memory architectures. It aims to support careful engineering and optimisation of portable parallel programs by providing a cost model for performance prediction. The language extension is based on a small set of global access primitives and simple memory management declarations which support both bulk synchronous and message driven styles of parallel programming.

There are several other BSP programming languages under development as part of research projects, such as GPL and Opal at Oxford and BSP-L at Harvard. We will not discuss those here.

2. The proposal ===============

Our proposal can be viewed as an attempt to achieve a synthesis of the various approaches to low level BSP programming which are currently being pursued, including those mentioned above. Our main concern when defining the semantics for each of the library operations was to provide as general a semantics as possible, whilst ensuring that the implementor has the greatest possible scope to provide an efficient implementation of the library on the various forms of parallel architecture. For simplicity, we will refer to the library as BSPlib.

2.1. Creating BSP Processes ===========================

Processes are created in BSPlib by the operations "begin_bsp" and "end_bsp" which bracket a piece of code to be run in a bulk synchronous parallel manner. There can only be one instance of such a pair within a program.

C and Fortran90 procedure specifications ----------------------------------------

void begin_bsp(int maxprocs);
void end_bsp()
SUBROUTINE begin_bsp(maxprocs)
INTEGER, intent(IN)::maxprocs
SUBROUTINE end_bsp()

where maxprocs is the number of processes requested by the user. A

value of 0 requests the system to allocate as many processes

as possible.

The function "bsp_nprocs" returns n, the actual number of processes allocated to the program, where 1<=n<=maxprocs if maxprocs>0.

Each of the n processes created by begin_bsp has a unique value m in the range 0<=m<=n-1. The function "bsp_pid" returns the value of the process executing the function call.

C and Fortran90 procedure specifications ----------------------------------------

int bsp_nprocs();
int bsp_pid();
INTEGER FUNCTION bsp_nprocs()
INTEGER FUNCTION bsp_pid()

A single thread of control is initiated at the start of a BSPlib program. When the parallel construct "begin_bsp" is encountered, "bsp_nprocs" processes are created and start executing the code following the begin_bsp statement. Execution continues in an SPMD manner until "end_bsp" is encountered. Its effect is to perform a barrier synchronisation of all the processes, followed by the clean termination of all but process 0. After "end_bsp", process 0 proceeds sequentially until the end of the program.

Notes -----

a) There can only be a single begin_bsp end_bsp pair within a BSPlib program. This excludes the possibility of starting, stopping, and then restarting parallel tasks within a program, or any form of nested parallelism.

b) The process with bsp_pid()=0 is a continuation of the thread of control that initiated begin_bsp. This has the effect that all the values of the local and global variables prior to "begin_bsp" are available to that process.

c) The environment is not inherited by any of the other processes, i.e. those with bsp_pid()>0. If any of them require part of the state, then the data must be transferred from process 0.

d) If "begin_bsp" is called within a function or subroutine other than the main program, then the programmer must ensure that "end_bsp" is executed within the lifetime of that subroutine invocation.

The function "bsp_error" can be used to print an error message followed by a halt of the entire BSPlib program. The routine is designed _not to_ require a barrier synchronisation of all processes, so a single process in a potentially unique thread of control can halt the entire BSPlib program.

C and Fortran90 procedure specifications ----------------------------------------

void bsp_error(char* format,...);
SUBROUTINE bsp_error(err_string)
CHARACTER(*), intent(IN)::err_string

where format is a C-style format string as used by printf. Any other arguments are interpreted in the same way as the variable number of arguments to printf.

err_string is single error string that is printed when the Fortran routine is executed. All computation ceases after a call to bsp_error

2.2 Superstep synchronisation =============================

A BSPlib program consists of a series of supersteps, where a superstep is represented by a sequence of statements delimited by library calls.

The library distinguishes between two kinds of superstep: communication supersteps and computation supersteps. All communications (see section 2.3 and 2.4) are performed within a communication superstep, which is delimited by the operations "begin_superstep" and "end_superstep". The end of a superstep marks an explicit barrier where all processes in a BSPlib program must synchronise. A computation superstep is implicitly defined as the sequence of operations between a call to end_superstep and the next begin_superstep encountered during the execution of a program.

C and Fortran90 procedure specifications ----------------------------------------

void begin_superstep();
void end_superstep();
SUBROUTINE begin_superstep()
SUBROUTINE end_superstep()

Notes -----

a) In the execution of a correct BSPlib program, if any process enters a superstep then all of the bsp_nprocs processes must enter it. Similarly, the barrier synchronisation at the end of the superstep must involve all of the processes. There is therefore no form of subset synchronisation.

b) To ensure that communication has a simple and intuitive semantics, supersteps may not be nested, i.e. "begin_superstep" and "end_superstep" operations must alternate.

c) "begin_superstep" and "end_superstep" are intended to textually delimit a block of code that constitutes a superstep. This is analogous to the use of curly-braces (i.e., { and }) to delimit a block in a block-structured programming language. The calls to these routines must obey textual nesting rules in a similar manner to the user of curly-braces. As supersteps cannot be nested, this means that if a superstep is started within a procedure, it must be ended in the same procedure.

The semantics of end_superstep could be modified to allow a process that has reached end_bsp to implicitly meet all superstep barriers with its siblings. We have decided not to include this in the current proposal. If it were included, one might want to add a stronger synchronisation primitive which had the same requirement, i.e. full participation, as we have defined for end_superstep here.

2.3. Bulk synchronous remote memory access ==========================================

One way of performing data-communication in the BSP model is to use one-sided Direct Remote Memory Access (DRMA) communication facilities. In this mode of operation, the local address space of each process can be manipulated by other processes by using either "bsp_get" or "bsp_put". The operation "bsp_put" stores locally held data into the local memory of a target process, without the active participation of the target process. The operation "bsp_get" reaches into the local memory of another process to copy data values held there into a data structure in its own local memory. Like bsp_put, the bsp_get operation is one-sided as the process being "reached" into does not participate in the operation.

In the BSP model, the semantic interpretation of a group of puts or gets is that there is no imposed order on the communications, although they are all guaranteed to occur by the end of the superstep. The semantics we adopt is that, conceptually, any "put" or "get" operations are delayed so that they all occur concurrently at the end of the superstep. The delaying nature of such a communication mode permits the following implementation techniques:

a) communication and computation may be overlapped.

b) communication may be buffered so that the number of transactions with other processes can be minimised.

c) the order of a sequence of put and gets can be altered to maximise communication throughput.

2.3.1. Putting data into a remote location ------------------------------------------

The aim of "bsp_put" is to provide an operation akin to memcpy(3C) available in the Unix <string.h> library. The operation copies a specified number of bytes, from a byte addressed data structure in the local memory of one process into contiguous memory locations in the local memory of another process.

Unlike the DRMA operations available in the Cray SHMEM and Oxford BSP libraries, which require that the communicated data structures are held in statically allocated memory locations, the objective here is to provide a routine that is well defined for stack and heap allocated data structures as well. This is achieved by calculating the addresses of the data structures that are the source and target of a DRMA operation in the respective address space of the process that contains the data structure.

The semantics we require for "bsp_put" is that the values referenced by the source and destination addresses of a put may not be changed during a superstep. This ensures that an implementation may perform the put asynchronously with any computation within a superstep, _without_ buffering the outgoing data structure.

C and Fortran90 procedure specifications ----------------------------------------

void bsp_put(int pid,void *src,void *dst,int offset,int nbytes);
SUBROUTINE bsp_put(pid,src,dst,offset,nbytes)
INTEGER, intent(IN) ::pid, offset, nbytes
<TYPE>, intent(IN) ::src, dst

where pid is the identifier of the process where data is to be stored. src is the location of the first byte to be transferred by the put operation. The calculation of src is performed on the process that initiates the put. dst is the location of the first byte where data is to be stored. The calculation of dst is performed by process pid. offset is the displacement in bytes from dst where src will start copying into. The calculation of offset is performed by the process that initiates the put. nbytes is the number of bytes to be transferred from src into dst. It is assumed that src and dst are addresses of data structures that are at least nbytes in size.

Notes -----

a) the destination address of a put, and the data structure referenced by that address may not be changed during the lifetime of a superstep.
b) the data structure referenced by the source address of a put may not be changed during the lifetime of a superstep.
c) bsp_put has to be textually present between begin_superstep and end_superstep.
d) if more than one DRMA operation transfers data into overlapping memory locations, then the result at the overlapping region will be nondeterministically chosen from one of the operations.

2.3.2. Getting data from a remote location ------------------------------------------

The "bsp_get" operation reaches into the local memory of another process and copies remote data held there into a data structure in the local memory of the process that initiated the get.

C and Fortran90 procedure specifications ----------------------------------------

void bsp_get(int pid,void *src,int offset,void *dst,int nbytes);

SUBROUTINE bsp_get(pid,src,offset,dst,nbytes)
INTEGER, intent(IN) :: pid, nbytes, offset
<TYPE> intent(IN) :: src, dst

where pid is the identifier of the process where data is to be obtained from. src is the location of the first byte from where data will be obtained from. The calculation of src is performed on process pid. offset is an offset from src where the data will be taken from. The calculation of offset is performed by the process that initiates the get. dst is the location of the first byte where the data obtained is to be placed. The calculation of dst is performed by the process that initiates the get. nbytes is the number of bytes to be transferred from src into dst. It is assumed that src and dst are addresses of data structures that are at least nbytes in size.

Notes -----

a) the source address of a get, and the data structure referenced by that address may not be changed during the lifetime of a superstep.
b) the data structure referenced by the destination address of a get may not be changed during the lifetime of a superstep.
c) bsp_get has to be textually present between begin_superstep and end_superstep.
d) if more than one DRMA operation transfers data into overlapping memory locations, then the result at the overlapping region will be nondeterministically chosen from one of the operations.

2.4. Bulk synchronous message passing =====================================

Bulk synchronous remote memory access is a very convenient style of programming for BSP computations which can be statically analysed in a straightforward way. It is less convenient for computations where the volumes of data being communicated between supersteps is irregular and data dependent, and where the computation to be performed in a superstep depends on the quantity and form of data received at the start of that superstep. A more appropriate style of programming in such cases is bulk synchronous message passing.

A non-blocking "bsp_send" operation is provided that routes messages to a special buffer in the destination process. The message is guaranteed to arrive by the end of the superstep that the send was performed within. The messages sent to a process, and stored in its buffer, can be matched, counted and moved by the receiving process in the next superstep. In keeping with BSP superstep semantics, a collection of messages sent to the same process has no implied ordering at the receiving end. However, since each message must be tagged with an integer value, the programmer can identify messages by their tag.

2.4.1. Sending data -------------------

A message consists of an integer tag and an arbitrary amount of data to be communicated. The operation "bsp_send" provides a mechanism of sending data to a destination process. The function does not wait for the receiving process to accept the data, but returns immediately. The data object that is sent may not be changed during the lifetime of the communication superstep that contained the send. The message sent is guaranteed to reach the buffer of the destination process by the end of the superstep.

C and Fortran90 procedure specifications ----------------------------------------

void bsp_send(int to,int tag, void* src, int nbytes);
SUBROUTINE bsp_send(to,tag,src,nbytes)
INTEGER, intent(IN) :: to,tag, nbytes
<TYPE>, intent(IN) :: src

where to is the identifier of the process where data is to be sent. tag is a _positive_ integer that can be used to identify messages. src is the location of the first byte of data to be sent. nbytes is the number of bytes to be transferred.

Notes -----

a) the data structure referenced by the source address of a send may not change during the lifetime of the superstep.
b) bsp_send has to be textually present between begin_superstep and end_superstep.

2.4.2 Matching, counting and moving buffered data -------------------------------------------------

A process can use the operation "bsp_match" to test if a message with certain properties was sent to it during the last superstep, and is still in the buffer. The process from where data was sent, the tag of the sent message, and the size of the sent message can be used in specifying the match operation. Wildcards may be used to match messages with any tag, or from any process, or of any size. If a successful match is found, then the process identifier, tag, and the size of the matched message will be returned by the function.

C and Fortran90 procedure specifications ----------------------------------------

int bsp_match(int* from, int* tag, int* nbytes);
LOGICAL FUNCTION bsp_match(from,tag,nbytes)
INTEGER, intent(INOUT) :: from,tag,nbytes

where from is an identifier that specifies that only messages from process "from" are to be matched. The special identifier BSP_WILDCARD can be used to specify that a message from any process may be matched. On exit from the procedure, "from" contains the process number of the message actually matched. tag is a _positive_ integer that specifies that only messages with the identified tag are to be matched. The special identifier BSP_WILDCARD can be used to specify that a message with any tag may be matched. On exit from the procedure, tag contains the identity-tag of the message actually matched. nbytes is an integer which specifies that only messages of size<=nbytes are to be matched. The special identifier BSP_WILDCARD can be used to specify that a message of any size may be matched. On exit from the procedure, nbytes contains the number of bytes of the message matched.

Notes -----

a) if more than one message arrives from a process with the same tag, then there is no implied ordering of those messages within the buffer.
b) if more than one message arrives from a process with the same tag, but of different sizes, then multiple calls of match may only match a single arbitrary instance of the multiple messages.
c) since the buffer is flushed by the next begin_superstep, a match has to be performed in the computation superstep that directly follows the communication superstep in which the message was sent.
d) the semantics of match are deterministic so that if a match is called with the same arguments twice, one after the other, then the same message will be matched.

The operation "bsp_count" is very similar to bsp_match. The only difference is that it returns the number of messages currently in the buffer which match the criteria. Unlike bsp_match, it does not return any values for from, tag, or nbytes.

C and Fortran90 procedure specifications ----------------------------------------

int bsp_count(int* from, int* tag, int* nbytes);
INTEGER FUNCTION bsp_count(from,tag,nbytes)
INTEGER, intent(IN) :: from,tag,nbytes

where from is an identifier that specifies that only messages from process "from" are to be matched. The special identifier BSP_WILDCARD can be used to specify that a message from any process may be matched. tag is a _positive_ integer that specifies that only messages with the identified tag are to be matched. The special identifier BSP_WILDCARD can be used to specify that a message with any tag may be matched. nbytes is an integer which specifies that only messages of size<=nbytes are to be matched. The special identifier BSP_WILDCARD can be used to specify that a message of any size may be matched.

The role of the "bsp_move" operation is to find a message that arrived in the previous communication superstep and move it out of the buffer and into a locally held data structure. As in bsp_match, the process where the data was sent from, the tag of the sent message and the size of the sent message can be used in specifying the match criteria for the move operation. If there is no message currently in the buffer which matches a move request, a runtime error is raised. Match and move are designed to work together. If a match is carried out, and a move is then immediately performed with the same set of arguments, then the message that was matched will be moved.

C and Fortran90 procedure specifications ----------------------------------------

void bsp_move(int* from, int* tag, void* dst,int* nbytes);
SUBROUTINE bsp_move(from,tag,dst,nbytes)
<TYPE>, intent(OUT) :: dst
INTEGER, intent(INOUT) :: from,tag,nbytes

where from is an identifier that specifies that a message from process "from" should be moved. The special identifier BSP_WILDCARD can be used to specify that a message from any process may be matched. On exit from the procedure, "from" contains the process number of the message actually moved. tag is a _positive_ integer that specifies that only messages with the identified tag are to be matched. The special identifier BSP_WILDCARD can be used to specify that a message with any tag may be matched. On exit from the procedure, tag contains the identity-tag of the message actually moved. dst is the location of the first byte where the data is to be moved into. nbytes is the size in bytes of dst. On exit from the procedure, nbytes contains the number of bytes actually moved. Notes -----

a) if no message is matched by a move request, then a runtime error is raised.
b) if the buffered data is larger than the "nbytes" set aside for it, then a runtime error is raised.
c) if more than one message arrives from a process with the same tag, then there is no implied ordering on which messages will be matched by the move operation.
d) since the buffer is flushed by the next begin_superstep, a move has to be performed in the computation superstep that directly follows the communication superstep in which the message was sent.
e) once a message has been successfully moved into a local data structure it is removed from the buffer.

3. Collective communications ============================

Some message passing systems, such as MPI [4], provide primitives for various specialised communication patterns which arise frequently in message passing programs. These include broadcast, scatter, gather, total exchange, reduction, prefix sums (scan) etc. These standard communication patterns also arise frequently in the design of BSP algorithms. It is important that such structured patterns can be conveniently expressed and efficiently implemented in a BSP programming system, in addition to the more primitive operations such as put and get which generate arbitrary and unstructured communication patterns. The library we have described can easily be extended to support such structured communications by adding bsp_broadcast, bsp_combine, bsp_scatter, bsp_gather, bsp_scan, bsp_exchange etc. as higher level operations. These could be implemented in terms of the core operations, or directly on the architecture if that was more efficient.

4. Current status =================

A prototype implementation of a significant part of this library is currently running on various parallel machines at Oxford and elsewhere. The authors are working with Jifeng He of Oxford to produce an algebraic semantics for this style of BSP programming.

5. List of operations =====================

Core operations ---------------

begin_bsp end_bsp
bsp_nprocs bsp_pid bsp_error
begin_superstep end_superstep
bsp_put bsp_get
bsp_send bsp_match bsp_count bsp_move

Optional higher level operations --------------------------------

bsp_broadcast bsp_combine
bsp_scatter bsp_gather
bsp_scan
bsp_exchange

Acknowledgements ================

This work was supported in part by the EPSRC Portable Software Tools for Parallel Architectures Initiative, as Research Grant GR/K40765 "A BSP Programming Environment", October 1995-September 1998.

The authors would like to thank Rob Bisseling and David Skillicorn for various discussions on BSP libraries.

References ==========

[1] D E Culler, A Dusseau, S C Goldstein, A Krishnamurthy, S Lumetta, T von Eicken and K Yelick. Parallel programming in Split-C. In Proc. Supercomputing '93, pages 262-273, November 1993.

[2] A Geist, A Beguelin, J Dongarra, W Jiang, R Manchek, and V Sunderam. PVM: Parallel Virtual Machine - A Users' Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge, MA, 1994.

[3] M W Goudreau, K Lang, S B Rao and T Tsantilas. The Green BSP Library. University of Central Florida Technical Report 95-11, August 1995.

[4] W Gropp, E Lusk and A Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, Cambridge, MA, 1994.

[5] W F McColl. Scalable Computing. In Computer Science Today: Recent Trends and Developments. LNCS Volume 1000, J van Leeuwen (editor), pages 46-61, Springer-Verlag, 1995.

[6] R Miller. A library for bulk-synchronous parallel programming. In Proc. British Computer Society Parallel Processing Specialist Group workshop on General Purpose Parallel Computing, December 1993.

[7] L G Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8), pages 103-111, 1990.

Go to BSP Worldwide News Page

An Initial Proposal for the BSP Worldwide Standard Library

by Jonathan Hill and Bill McColl

Oxford University

Abstract

Contents

0. The essence of BSP programming =================================

1. BSP programming languages and libraries ==========================================

2. The proposal ===============

2.1. Creating BSP Processes ===========================

2.2 Superstep synchronisation =============================

2.3. Bulk synchronous remote memory access ==========================================

2.3.1. Putting data into a remote location ------------------------------------------

2.3.2. Getting data from a remote location ------------------------------------------

2.4. Bulk synchronous message passing =====================================

2.4.1. Sending data -------------------

2.4.2 Matching, counting and moving buffered data -------------------------------------------------

3. Collective communications ============================

4. Current status =================

5. List of operations =====================

Acknowledgements ================

References ==========