Multiplus Architecture

Multiplus is a distributed shared-memory high-performance computer designed to have a modular architecture which is able to support up to 1024 processing elements and 32 Gbytes of global memory address space. Figure shows the Multiplus basic architecture. Within Multiplus, up to eight processing elements can be interconnected through a 64-bit double-bus system making up a cluster. Each bus follows a similar protocol to the one defined for the SPARC MBUS, but is implemented as an asynchronous bus.

The Multiplus architecture supports up to 128 clusters interconnected through an inverted n-cube multistage network. Through the addition of processing elements and clusters, the architecture can cover a broad spectrum of computing power, ranging from workstations to powerful parallel computers. With the adopted structure, the cost and delay introduced by the interconnection network is small or even non-existent in the implementation of parallel computers with up to 64 processing elements. On the other hand, very large parallel computers can be built without the use of an extremely expensive or slow interconnection network.

The Multiplus architecture can be classified as a Non-Uniform Memory Access (NUMA) architecture since a processing element access to memory can be performed in four different ways. The fastest memory access is a direct read operation on the local caches, which is performed within a processor cycle. The second fastest memory access is any read/write operation within the local bank of memory since, in principle, it does not require the use of the cluster bus system for its completion. The third fastest memory access is a write or a read access with cache failure to a memory position belonging to an external memory bank within the same cluster. In this case, the bus system must be used and the bus arbitration time is added to the access time.

Lastly, there are the accesses generated by a processing element requesting information which is not in its local caches but is stored within a memory bank sitting on another cluster. In this case, the bus system of the source cluster, the multistage interconnection network and the bus system of the destination cluster need to be used for the access operation to be performed. Therefore, the arbitration times of both bus systems and the multistage interconnection network delay are added to the access time.

As shown in the above figure, Multiplus uses a distributed I/O system architecture. It is possible to assign all processing elements within a cluster to a single I/O processor which is responsible for dealing with all I/O requests to or from mass storage devices started by these processing elements.

Design decisions have been taken to simplify the problem of maintaining consistency among the private caches of the processing elements within the Multiplus architecture. The first one is to have in every cluster one bus dedicated to instruction and data access operations and the other one dedicated to block transfer operations which occur in I/O or in memory page migration or copy operations. Only the instruction/data bus needs to be "snooped" by the cache controller and, as a result, the cache consistency problem can be solved within a cluster with the methods usually adopted in bus-based systems. In addition, a software approach has been adopted to keep cache consistency between clusters. Following the memory model based on the lazy release consistency approach, any access to shared regions of memory must be preceded by a "lock" operation. This ensures that a single processor is accessing a particular critical region at any moment. Cache consistency is achieved with the help of the memory management hardware.


Núcleo de Computação Eletrônica Núcleo de Computação Eletrônica/UFRJ
Conectada à INTERNET através da RedeRio de computadores
Webmaster