|BSP cluster||Eight 400Mhz Pentium IIs with 128Mbytes of memory, connected by a Cisco 2916XL 100Mbps Ethernet switch, with a dedicated BSP device driver for a 3Com 3C905B Ethernet card.|
|Silicon Graphics||Power Challenge||Four 75Mhz R8000 processors with 4Mbytes unified secondary cache, 0.5Gbytes main memory.|
|Origin 2000||Eight 195Mhz R10000 with 4Mbytes unified secondary cache, 2Gbytes main memory.|
|Cray Research||T3D||Five hundred and twelve 150Mhz DECchip 21064 with 64Mbytes per processor|
|T3E||Twenty four 300Mhz DECchip 21164 with 128Mbytes of main memory per processor.|
|IBM||SP2 (+switch)||Eight 66.7Mhz P2SC Thin1 nodes with 1Mbytes secondary cache and 128Mbytes of main memory per processor.|
|SP2 (+Ethernet)||Eight 66.7Mhz P2SC Thin1 nodes with 1Mbytes secondary cache and 128Mbytes of main memory per processor.|
|Intel||Pentium Pro NOW||Seventeen 266Mhz Pentium Pros with 512K cache connected by 10Mbit Ethernet|
|Digital||Digital 8400||Six 300Mhz Alpha EV5 with 2Gbytes main memory.|
|Sun Microsystems||Sparcstation-20 SMP|
|Sparcstation-5 NOW||Sixteen Sparcstation-5s connected by 10Mbit Ethernet|
The benchmark program that was used to calculate the BSP machine parameters can be found here. A more detailed list of the BSP machine parameters, including communication rates in usecs (which are useful for comparing the absolute performance of communication across architectures) can be found here.
|MACHINE||s (Mflop/s)||p (no. procs)||l (flops)||g (flops/word)|
|IBM SP2 (Switch)||26||1||244||1.3|
|Pentium Pro NOW||61||1||85||1.0|
|Sun Sparc-5 NOW||7.7||1||82||2.4|
|Multiprocessor Sun Sparc-20||10||1||24||0.4|
|Digital Alpha farm||10.1||1||29||0.3|
|IBM SP2 (Ethernet)||26||1||241||1.3|
The following description of the BSP parameters s, l, and g is taken from the paper:
The values of s given in the table above is an average of the upper and lower bound values for s.
The cost of a barrier synchronisation comes in two parts:
The parameter l captures the latter of these costs. The diameter of the communication network, or at least the length of the longest path that allows state to be moved from one processor to another clearly imposes a lower bound on l. However, it is also affected by many other factors, so that, in practice, an accurate value of l for each parallel architecture is obtained empirically.
The ability of a communication network to deliver data is captured by a BSP parameter, g, that measures the permeability of the network to continuous traffic addressed to uniformly-random destinations. As we have seen, BSP programs approximate such traffic. The parameter g is defined such that an h-relation will be delivered in time hg. Subject to some small provisos, hg is an accurate measure of communication performance over a large range of architectures. The value of g is normalised with respect to the clock rate of each architecture so that it is in the same units as the time for executing sequences of instructions.
Sending a message of length m clearly takes longer than sending a message of size 1. BSP does not distinguish between a message of length m and m messages of length 1 --- the cost in either case is mhg (refer to the Q&A paper to see why this isn't a problem). So messages of varying lengths may either be costed using the form mhg where h is the number of messages, or the message lengths can be folded into h, so that it becomes the number of units of data to be transferred.
The parameter g is related to the bisection bandwidth of the communication network but they are not equivalent --- g also depends on factors such as:
So g is bounded below by the ratio of p to the bisection bandwidth, suitablly normalised, but may be much larger because of these other factors. Only a very unusual network would have a bisection bandwidth that grew faster than p, so g is a monotonically increasing function of p. The precise value of g is, in practice, determined empirically for each parallel computer, by running suitable benchmarks.
Note that g is not the single-word delivery time, but the single-word delivery time under continuous traffic conditions. This difference is subtle but crucial.
An example of the BSP parameters being used in the analysis of a broadcasting algorithm can be found here.