BSP machine parameters (table of contents)

Summary of machines profiled
Table of the BSP machine parameters
What is the BSP parameter s?
What is the BSP parameter l?
What is the BSP parameter g?

Summary of machines profiled

MANUFACTURER	Machine	CONFIGURATION
	BSP cluster	Eight 400Mhz Pentium IIs with 128Mbytes of memory, connected by a Cisco 2916XL 100Mbps Ethernet switch, with a dedicated BSP device driver for a 3Com 3C905B Ethernet card.
Silicon Graphics	Power Challenge	Four 75Mhz R8000 processors with 4Mbytes unified secondary cache, 0.5Gbytes main memory.
	Origin 2000	Eight 195Mhz R10000 with 4Mbytes unified secondary cache, 2Gbytes main memory.
Cray Research	T3D	Five hundred and twelve 150Mhz DECchip 21064 with 64Mbytes per processor
	T3E	Twenty four 300Mhz DECchip 21164 with 128Mbytes of main memory per processor.
IBM	SP2 (+switch)	Eight 66.7Mhz P2SC Thin1 nodes with 1Mbytes secondary cache and 128Mbytes of main memory per processor.
	SP2 (+Ethernet)	Eight 66.7Mhz P2SC Thin1 nodes with 1Mbytes secondary cache and 128Mbytes of main memory per processor.
Intel	Pentium Pro NOW	Seventeen 266Mhz Pentium Pros with 512K cache connected by 10Mbit Ethernet
Digital	Digital 8400	Six 300Mhz Alpha EV5 with 2Gbytes main memory.
	Alpha farm
Sun Microsystems	Sparcstation-20 SMP
	Sparcstation-5 NOW	Sixteen Sparcstation-5s connected by 10Mbit Ethernet
Hitachi	SR2001
Convex	Exemplar
Parsytec	GC

Table of the BSP machine parameters

The benchmark program that was used to calculate the BSP machine parameters can be found here. A more detailed list of the BSP machine parameters, including communication rates in usecs (which are useful for comparing the absolute performance of communication across architectures) can be found here.

MACHINE	s (Mflop/s)	p (no. procs)	l (flops)	g (flops/word)
BSP cluster	88	1	128	1.3
		2	5654	33.5
		4	11759	31.5
		8	18347	30.9
SGI PowerChallenge	74	1	226	0.50
		2	1132	10.2
		3	1496	9.5
		4	1902	9.3
Origin 2000	100.7	1	286	1.36
		2	804	8.26
		3	1313	8.36
		4	1789	10.24
		5	2474	11.06
		6	2963	12.25
		7	3867	14.28
Cray T3E	47	1	86	2.14
		2	269	2.61
		3	296	2.11
		4	357	1.77
		8	506	1.64
		9	552	1.57
		16	751	1.66
		20	880	1.63
		24	1013	1.70
Cray T3D	12	1	68	0.3
		2	164	1.0
		4	168	0.8
		8	175	0.8
		9	383	1.2
		16	181	1.0
		25	486	1.5
		32	201	1.4
		64	148	1.7
		128	301	1.8
		256	387	2.4
IBM SP2 (Switch)	26	1	244	1.3
		2	1903	7.8
		4	3583	8.0
		8	5412	11.4
Pentium Pro NOW	61	1	85	1.0
		2	52745	484.5
		4	139981	1128.5
		8	539159	1994.1
		10	826054	2436.3
		16	2884273	3614.6
Digital 8400	24	1	28	0.3
		2	286	2.5
		3	362	5.3
Sun Sparc-5 NOW	7.7	1	82	2.4
		2	72966	128.8
		4	369771	266.2
		8	835281	322.3
Multiprocessor Sun Sparc-20	10	1	24	0.4
		2	54	3.4
		3	74	4.1
		4	118	4.1
Hitachi SR2001	5.4	1	31	0.2
		2	1165	3.0
		4	2299	3.0
		8	3844	3.1
		16	4638	3.0
		32	6906	4.9
Convex Exemplar	10.5	1	60	0.16
		2	21373	8.3
		4	64457	9.2
		8	194476	11.3
Digital Alpha farm	10.1	1	29	0.3
		2	17202	81.1
		3	34356	83.0
		4	47109	81.3
Parsytec GC	19.3	1	98	1.0
		2	6309	113
		4	23538	143
		8	29080	254
		16	224977	342
		32	130527	658
IBM SP2 (Ethernet)	26	1	241	1.3
		2	18759	183.6
		4	39025	628.2
		8	88795	1224.1

The following description of the BSP parameters s, l, and g is taken from the paper:

``Questions and answers about BSP''

David B. Skillicorn

Jonathan M. D. Hill

W. F. McColl

. Compressed Postscript, 193K

What is the BSP parameter s?

The values of the g and l parameters are normalised by the instruction rate, s, of each processor. Because this instruction rate depends heavily upon the kind of computations being done, the average of two different measured values is used:

Lower-bound for s: measures the cost of an inner product, where O(n) operations are performed on a data structure of size n. The value of n is chosen to be far greater than the cache size on each processor. This benchmark therefore gives a lower-bound megaflop rate for the processor as each arithmetic operation induces a cache miss.
Upper-bound for s: measures the cost of a dense matrix multiplication, where O(n^3) operations are performed on a data structures of size n^2. Because a large percentage of the computation can be kept in cache, this benchmark gives an upper-bound megaflop rate for the processor.

The values of s given in the table above is an average of the upper and lower bound values for s.

What is the BSP parameter l?

The cost of a barrier synchronisation comes in two parts:

The cost caused by the variation in the completion times of the computation steps that participate. There is not much that an implementation can do about this, but it does suggest that balance in the computation parts of a superstep is a good thing.
The cost of reaching a globally-consistent state in all of the processors. This depends, of course, on the communication network, but also on whether or not special-purpose hardware is available for synchronising, and on the way in which interrupts are handled by processors.

The parameter l captures the latter of these costs. The diameter of the communication network, or at least the length of the longest path that allows state to be moved from one processor to another clearly imposes a lower bound on l. However, it is also affected by many other factors, so that, in practice, an accurate value of l for each parallel architecture is obtained empirically.

What is the BSP parameter g?

The ability of a communication network to deliver data is captured by a BSP parameter, g, that measures the permeability of the network to continuous traffic addressed to uniformly-random destinations. As we have seen, BSP programs approximate such traffic. The parameter g is defined such that an h-relation will be delivered in time hg. Subject to some small provisos, hg is an accurate measure of communication performance over a large range of architectures. The value of g is normalised with respect to the clock rate of each architecture so that it is in the same units as the time for executing sequences of instructions.

Sending a message of length m clearly takes longer than sending a message of size 1. BSP does not distinguish between a message of length m and m messages of length 1 --- the cost in either case is mhg (refer to the Q&A paper to see why this isn't a problem). So messages of varying lengths may either be costed using the form mhg where h is the number of messages, or the message lengths can be folded into h, so that it becomes the number of units of data to be transferred.

The parameter g is related to the bisection bandwidth of the communication network but they are not equivalent --- g also depends on factors such as:

the protocols used to interface with and within the communication network,
buffer management by both the processors and the communication network,
the routing strategy used in the communication network, and
the BSP runtime system.

So g is bounded below by the ratio of p to the bisection bandwidth, suitablly normalised, but may be much larger because of these other factors. Only a very unusual network would have a bisection bandwidth that grew faster than p, so g is a monotonically increasing function of p. The precise value of g is, in practice, determined empirically for each parallel computer, by running suitable benchmarks.

Note that g is not the single-word delivery time, but the single-word delivery time under continuous traffic conditions. This difference is subtle but crucial.

An example of the BSP parameters being used in the analysis of a broadcasting algorithm can be found here.

Jonathan Hill

Last updated: September 24th 1997