Scalable Parallel And Distributed Military Simulations Using The Speedes Framework

Dr. Jeffrey S. Steinman (jss@pebbles.jpl.nasa.gov)

California Institute of Technology

Jet Propulsion Laboratory

Abstract

A number of critical software design issues must be understood and correctly evaluated in order to produce fully scalable distributed military simulations. These issues can be quite complex. Correct decisions require much experience in software engineering and distributed simulation technologies. The consequences of incorrect design decisions are often not well known, leading to many systems which fail in a number of attributes including scalability, response time, and accuracy. This paper starts from fundamental principles and lays out a completely scalable, accurate solution to the construction of large distributed simulations.

Specifically, this paper discusses the scalability issues concerning:

Introduction

Since this paper focuses on scalability in distributed military simulations, the first question that one might ask is, "What is scalability?"

Someone with a background in scientific parallel computing might answer this question by saying that scalability refers to how a parallel algorithm performs as a function of the size of the problem and the number of computational nodes. For example, when doubling the size of the problem and the number of nodes, does the application complete in the same amount of time.

Another person with a background in networking might focus more on the number of messages required to do the computation and ignore the scalability of the algorithm itself. Of course, when message sending is relatively inexpensive (let's say for a high granularity simulation), it may be safe to build an algorithm that does not scale well in terms of message sending, but does scale in terms of its computational performance on a particular hardware platform. On the other hand, when porting the same application to a network with higher message passing overheads, it might be more practical to focus on scalability issues concerning the message traffic generated by the application.

A different person, with a background in distributed optimistic simulation techniques, might be concerned with how many rollbacks or antimessages are generated as a function of the size of the problem and as a function of the number available computing nodes. This type of scalability analysis is much more difficult to perform than the previous examples mentioned above, but it is much more focused on the critical performance issues of distributed simulations. We will discuss this in much more detail later in this paper.

Still another person, let's say with a background in software engineering, would be worried about the maintainability of the distributed simulation as it grows in size (in terms of the number of lines of code). Does the software design scale in terms of development and maintenance or does it fall apart due to its own internal complexity? People focusing on software engineering issues should definitely be concerned with the extensibility of the code as new functionality is incrementally added. For example, would it be possible to add a new type of object into a military simulation without destroying its entire infrastructure or ruining its parallel performance?

Together, these different viewpoints give a big picture of scalability with all of the relevant issues on the table. The goal of this paper is to provide this big picture of the important issues of scalability in terms of supporting: distributed synchronization techniques, unbounded scalable computational capability through distributed processing, minimal and scalable message sending techniques, unbounded extensible functional capability developed in an incremental fashion, and scalable software engineering practices that don't break down as the simulation grows in complexity and program size.

Because of the large number of subjects discussed in this paper, and because of the wide disparity in reader backgrounds, this paper is quite long. We have tried to provide enough background information on each subject to not only motivate, but to support each of our recommended design decisions. Readers who are familiar with some of the topics should certainly feel free to skim over the subject matter that is well understood. We have tried to start each section by motivating the problem and then going into a more detailed discussion concerning the solution. These details are often not essential to the big picture and can be skimmed over in a first reading of the paper.

Real-Time Synchronization

As our first critical design issue, we must decide how to synchronize the distributed simulation. Should we use the wall clock for synchronization (i.e., use a synchronized wall clock to provide the global value of the current simulation time) or should we provide a logically correct strategy that determines the current simulation time in a manner that is independent of the wall clock.

Some might argue that in a real time interactive simulation, it is better to let simulation errors occur (i.e., allow events to be processed out of their correct time order), than to allow the simulation time to fall behind the wall clock. Others would argue that it is meaningless to run a simulation that gets wrong answers and that if we must compromise, we should let the simulation occasionally lag behind the wall clock while processing events in their correct time order. Others might argue for a compromise between the two extremes (i.e., correctness and lagging behind the wall clock, vs. keeping up with the wall clock but getting incorrect results).

While we might want the simulation to support interactive uses (which does require synchronization to the wall clock), we may want the simulation to support analytical studies as well (which require no synchronization at all to the wall clock). Furthermore, it may be desirable to use the simulation as a "test-bed" where real world algorithms and/or hardware could be "bathed" in a well controlled virtual environment that is indistinguishable from the real world (this may or may not require wall clock synchronization).

What we are really addressing here is the issue of scalability in terms of the effort required to build separate simulators for each potential use vs. building one simulator that does it all. Building multiple simulators would, of course, be wasteful since most likely, there would be little code reuse for models built in vastly different simulation frameworks. On the other hand, building one simulation that does it all could be a software engineering nightmare if care is not taken. Do software engineering techniques exist that would allow one simulation to do it all (more on this later on)?

It is absolutely critical here to point out that if only the wall clock is used for synchronization, the simulation will never be able to efficiently support analytical studies. Of course, you could scale the wall clock to more efficiently support simulations that run slower or faster than real time. However, in order to support a correctly processed simulation (i.e., no events are processed out of time order), you would have to scale the wall clock in an overly conservative manner which would result in large wasteful idle times and would not make efficient use of the available computational resources.

On the other hand, if the distributed simulation is synchronized in a logically correct manner (and therefore, the simulation has the capability of running as fast as it can for supporting analytic studies), it is always possible to further synchronize the global simulation time to the wall clock. If the simulation runs faster than real time, then the global simulation time can be throttled to the wall clock in order to provide real-time interactive support. If the simulation can't keep up with the wall clock, then the simulation can either reduce its fidelity (possibly in an adaptive manner), or it could just run slower than real time. Either way, there are options. Also, as we will see later in this paper, optimistic techniques (with rollback support) allow future computations to be performance ahead of time which could greatly help a real time simulation keep up with the wall clock, even during phases when it normally would not be able to do so.

If only the wall clock is used for synchronization, we lose the ability to efficiently do analytical studies. If the simulation is logically synchronized, we very efficiently support analytic studies while still being able to efficiently support real-time interactive virtual uses. Therefore, the first important decision should be to provide logically correct synchronization to the simulation since nothing is lost by doing this, while we gain the ability to provide multiple uses for the same simulator.

Design decision 1: The distributed simulation should be synchronized in a logically correct manner while providing additional synchronization to the wall clock for real-time interactive uses.

Synchronization Strategies

There are numerous distributed synchronization strategies that have been researched and studied over the last twenty years. In this section, we describe these major strategies while pointing out their strengths and weaknesses. Specifically, we will show how these algorithms scale in terms of performance and as a function of different types of object interactions.

Lock-Step Time-Driven Synchronization

The first (and simplest) approach for synchronization is the lock-step, time-driven synchronization method. This method forces all computations to occur at regular intervals with a fixed time increment. For example, processing might occur at times 0.0, 7.5, 15.0, 22.5, ... until the end time of the simulation is reached. Together, each node does its processing at the current global simulation time and then waits for all of the other nodes to likewise complete. When all of the nodes complete their processing in the current global time value, they exchange messages and then increment their clocks in unison by the same amount in a lock-step fashion. Processing then resumes once again for the next time value.

There are a number of problems with this approach. First of all, real world events don't really happen at fixed time intervals. Events usually occur at times that are completely arbitrary and are not necessarily synchronized in any way. One might respond to this problem by simply decreasing the time step interval in order to better approximate asynchronous real world events. The problem with this quick fix is that it usually forces too many synchronizations, thereby grossly decreasing the simulation's performance. Furthermore, many lock-step time driven simulators are very inefficient in that they update the entire simulation at every time step. This is extremely wasteful when most of the computations do not require such frequent updates.

Even when computations are done in a more efficient manner, this approach is still unable to escape from the problem of too many synchronizations, especially when time scales need to be arbitrarily tight (i.e., objects need to interact with other objects using very small time delays).

Because lock-step time-driven synchronization methods do not provide scalable computations and synchronization as a function of interaction time scales, this approach must be rejected because it does not scale.

Fixed Time-Bucket Synchronization

A much more powerful variant of the lock-step time-driven method is the fixed time-bucket approach. Here, events may be generated with arbitrary time values (i.e., discrete-event) with one constraint. Events are never allowed to generate new events for objects on different nodes tighter in time than a specified global lookahead value (let's call this value T). Like the lock- step time-driven approach, each node knows the global simulation time (let's call this value t) but instead of processing everything at one time value, events are processed asynchronously in time in their correct time order up to time t + T. Because of the global lookahead constraint, each node is assured that it will never receive an event message with a time- tag value less than t + T. This guarantees that the simulation is processed in a logically correct manner (i.e., events are guaranteed to be processed in their correct time order). When each node gets to time t + T, it stops to synchronize and exchange messages with the other nodes. Processing then resumes once again in the next cycle.

The main benefit gained in the Fixed Time-Bucket approach (over the lock-step time-driven approach) is a much more efficient mechanism for supporting discrete-event processing. Events can occur at their natural time values and time scales instead of being forced to fit into locked time values. therefore, processing is more efficient as well as more accurate. The one drawback is that objects on different nodes are never allowed to interact tighter in time than T units into the future. While it is true that T could be made arbitrarily small, at some point there would be too many synchronizations and as before, we trade performance for fidelity. In other words, the Fixed Time-Bucket approach does not scale in terms of performance as the fidelity of the simulation increases. Therefore, Fixed Time Buckets must be rejected as an acceptable solution because it does not scale.

Chandy Misra Synchronization

The basic Chandy Misra synchronization approach ( Chandy and Misra 79) offers an alternative to the Fixed Time Bucket approach in that it doesn't necessarily require a global lookahead value for synchronization (although known lookahead in general does help considerably). Instead, each object in the simulation is forced to know which other objects interact with it. One further constraint is that object pairs always interact in non-decreasing time. In other words, if object A interacts with object B, messages from A to B are always generated (and arrive) with increasing time values. This is sometimes referred to as a FIFO constraint.

While there are variants of the basic Chandy Misra strategy, in its basic form, each object keeps track of the messages that it has received in a separate input queue; one for each of its corresponding sending objects. If each input queue for an object has at least one unprocessed message, then the object knows that it is safe to process the message with the smallest time tag. When an object has one or more empty input queues, it cannot safely process any of its events because it does not know if it will receive a message in its past. Therefore, the object blocks until this situation is later resolved. When there are many local objects on each processing node, it is very likely that one or more local objects can process events. However, it is possible for a deadlock situation to arise, especially in simulations that have lots of feedback with sparse numbers of messages compared to the total number of object input queues. The Chandy Misra algorithm resolves blocking (and possible deadlocking) situations by sending null messages (i.e., synchronization messages Рnot event messages) which are used to provide time updates on empty input queues. Sometimes, deadlock detection and recovery techniques are also used.

Exploiting known lookahead can greatly enhance the performance of the Chandy Misra algorithm. If an object has an empty input queue, but it knows that it will not receive another message from the sending object until some future time, it may be safe for the receiving object to still process one or more of its events. In practice, exploiting known lookahead is almost always required to obtain good performance in the Chandy Misra algorithm.

While the Chandy Misra approach has had some success in applications where the topology of the simulation is fixed (i.e., each object knows which other objects interact with it), this is certainly not the case for military simulations where each object can potentially interact with virtually every other object. Furthermore, the non-FIFO nature of typical object interactions required in military simulations does not fit the Chandy Misra paradigm.

In the case of a military simulation each object would have to maintain an input queue for each other object (this obviously does not scale in terms of memory usage). Also, as the number of input queues for an object increases, its chances of having at least one empty input queue also increases. Therefore, the Chandy Misra synchronization strategy and its variants do not fit, nor do they scale well for military simulations.

Time Warp

Up to this point, we have looked exclusively at conservative techniques for providing synchronization. Events were only processed when it was safe to do so. Although we have not discussed all of the variations of conservative techniques, they all basically use the same principles (fixed time steps, exploitation of global and/or local lookahead, fixed topologies, null messages, etc.). We discovered that none of the conservative techniques met our requirements for full scalability. We now look at optimistic strategies; the most popular being the Time Warp algorithm ( Jefferson 85, Fujimoto 90).

In Time Warp, events are processed optimistically with the hope that they are processed in their correct time order. A good Time Warp simulation framework automatically provides event management and a rollback infrastructure for each simulation object. The simulation time of each object is defined as the time tag of its last event processed. When an object receives a message in its past, it rolls back to the last correctly processed event before processing the new event. Events that were rolled back are either reprocessed, or rolled forward if possible (a technique called lazy cancellation). An example of a node sending a message in the past of another node is shown in Figure 1.

Figure 1: Optimistic event processing. An event on node 0 sends an event-scheduling message to node 1 in the past of node 1. Rollbacks may be required in order to correctly process events in their correct time order.

In order to support rollbacks, a state saving mechanism must be provided to restore the state of a simulation object (i.e., the set of variables that describe its state) to its original set of values that were present just before the event was processed. One technique that is commonly used is to save a copy of the entire state of the simulation object just before the event is processed. This approach is called full state saving. Rollback due to the arrival of a straggler message is accomplished by returning the state of the simulation object back to its original state just before the time tag of the straggler. This approach, however, does not scale well when the state of a simulation object is large (which may very well be the case for military simulations). furthermore, it can become very complicated when dynamic data structures (like linked lists) are used, or when dynamic memory allocations are required. The full state saving method also has one other fatal drawback. It can very quickly consume all of the available memory on a processor and thereby force either page faulting, or even program crashes as disk swap space becomes filled. This can seriously impact the performance of optimistic military simulations and should therefore not be used.

A more efficient method is to use incremental state saving techniques (Steinman 93c). With incremental state saving, events must be rolled back in the reverse order that they were processed. Also, incremental state saving mechanisms must be directly used by the software developer which can make building a simulation slightly more difficult if the support is not made as transparent as possible. However, because full state saving does not provide scalable support for large object states, incremental state saving is by far the best choice for supporting large scalable military simulations. Therefore, optimistic military simulations should use incremental state saving techniques.

One very important property of optimistic simulations is the notion of Global Virtual Time (GVT). GVT is defined as the time tag of the earliest unprocessed event (or message still in transit) in the simulation. Because events are normally never allowed to schedule new events in their past, processed events with time tags less than GVT can be committed. In other words, those events with time tags less than GVT have been correctly processed and will never be rolled back. Therefore, it is safe to reclaim memory, etc., for all events with time tags less than GVT. The trick then is to determine GVT as often as possible without bogging down the processing of the simulation.

Another reason for frequent GVT updates is to support the tightest interactive simulations. Because humans in the loop, hardware in the loop, external modules, etc., cannot (and should not) participate in rollbacks, messages sent outside the core of the parallel simulation to the external world must be valid. Valid messages from an event can only be safely released when the event is committed (or if the time tag of the event is GVT Р more on this later in the discussion on queries). If GVT is computed very frequently (let's say every 50 ms) in a real time interactive simulation, then interactive response times can be supported very tightly. On the other hand, if GVT is only updated every 5 seconds, then interactive response times will be sluggish. It is therefore very important for optimistic simulations to use an efficient and scalable GVT algorithm (see Steinman 95 for further details and references on scalable GVT algorithms).

When an event is processed, it might generate new events. These new events are scheduled by sending messages (Usually, a more efficient, but transparent, approach supports the generation of local events for objects on the same node). When an event is rolled back, the simulation object must be restored to its original state and any messages that were generated must be canceled by sending corresponding antimessages (this means that Time Warp must maintain information concerning the messages that were generated by each event). The steps in rolling back an event are shown in Figure 2.

Figure 2: Rolling back an event in Time Warp. Here, a straggler message arrives for an object in the past of several erroneously processed events. Time Warp rolls back each erroneously processed event and then processes the straggler. As each event is rolled back, antimessages may be generated which can cause further rollbacks.

Figure 3 shows the steps involved when an antimessage is received for a message that has already been processed. The simulation object is rolled back to the event that should have never been scheduled. Then that erroneously scheduled event is removed from the event list of the simulation object. Of course, as events are rolled back due to the arrival of antimessages, they too might have incorrectly generated messages that must be canceled by releasing yet further antimessages. This leads to possible instabilities sometimes observed in Time Warp Рnamely cascading antimessage explosions. For example, in one simulation study, Time Warp generated about 250,000 messages that were followed by 230,000 antimessages (in other words, almost a total of 500,000 unnecessary messages were transmitted) while the simulation only required about 30,000 events to be processed (Steinman 93a).

Figure 3: Handling antimessages in Time Warp. Here, an antimessage cancels an event that has already been processed by the simulation object. Several events might have to be rolled back in order to cancel this event which can cause further antimessages to be released.

While this gross behavior may not always be exhibited, Time Warp can become unstable when the load balancing is not perfect, when the lookahead is poor, or when fan-in and/or fan-out event scheduling is required. Without providing a flow control mechanism, Time Warp does not meet our needs either. It can become very unstable when excessive rollbacks and antimessages are generated and therefore may not scale well under adverse conditions.

Breathing Time Buckets

An algorithm that completely solves the instability problems sometimes exhibited in Time Warp is the Breathing Time Buckets algorithm. Breathing Time Buckets is a cross between the Fixed Time Buckets algorithm and Time Warp. Like Time Warp, Breathing Time Buckets processes events optimistically. However, unlike Time Warp, messages generated by events are never actually released until it is known that they are valid. This means that bad messages are never released. This also means that antimessages (and all of the overhead required for their support) are not needed. While messages sent by Time Warp have maximal risk of not being valid, messages released by Breathing Time Buckets are sent in a risk-free manner.

Note, the term "risk" is commonly used by the parallel simulation research community to describe message sending risk and should not be confused with optimism. Breathing Time Buckets is fully optimistic, yet it is also risk-free.

Breathing Time Buckets is also like Fixed Time Buckets in the sense that events are processed in cycles. However, unlike Fixed Time Buckets, Breathing Time Buckets does not require apriori knowledge of a global lookahead value. Instead, it determines the optimal cycle width in an adaptive manner as events are optimistically processed. Thus, cycles are not fixed, but instead fluctuate (or "breath").

The fundamental concept behind Breathing Time Buckets is the event horizon. The event horizon is a concept that has meaning even for sequential simulations (i.e., simulations that run on a single processor). Consider, for a moment, the sequential simulation shown in Figure 4. As events are processed, they may generate new events. If these new events are collected in a separate list, the simulation will eventually get to the point where its next event to be processed is in the temporary list. The time tag of this next event is called the event horizon. It is the point in time where events generated by the simulation fold back into the simulation (similar to the notion of the event horizon for a black hole, Hawking 88). At this point, those new events could be sorted and merged back into the main event list (this is actually the basis of the SPEEDES Queue data structure for event list management. A newer and more general data structure which also uses the event horizon principle is the SPEEDES Qheap which exhibits logarithmic worst case behavior with a very small coefficient).

Figure 4: The event horizon. Events are processed in cycles, with the next cycle boundary defined as the earliest time tag of a newly generated event in a current cycle.

The key point to observe here is that the events processed in each cycle could have been processed in parallel since, by definition, no straggler messages arrive with earlier time tags than events in each cycle. The trick, then, is to adaptively determine the global event horizon as events are processed. To adaptively determine the event horizon on the fly, optimistic event processing is used along with a set of non-blocking and blocking scalable communication reduction operations. The net result is the Breathing Time Buckets algorithm (Steinman 92).

The one drawback to Breathing Time Buckets is the possibility that not enough events will be processed on the average in each cycle to remain efficient. Keep in mind, however, that Breathing Time Buckets will always process at least as many events per cycle as the Fixed Time Buckets algorithm. Like Time Warp, Breathing Time Buckets places no event scheduling constraints on the programmer (e.g., global lookahead or limited object interactions), but its performance will not be very good if very few events are processed on the average per cycle. This problem is exacerbated by simulations that frequently interact with objects on other nodes at very tight time scales.

Because of this problem, Breathing Time Buckets does not scale well for simulations that frequently require tight interactions among the simulation objects. In other words, for simulations that don't process enough events per cycle, too many synchronizations would be required and we would lose scalability. Fortunately, an analytic model has been developed (Steinman 94b) that predicts how many events can be processed on the average per cycle for steady state behavior. This model is briefly discussed later in this paper since it provides much insight concerning the intrinsic parallelism of a simulation.

Breathing Time Warp

The final synchronization algorithm (Breathing Time Warp) discussed in this paper does, in fact, meet all of our scalability requirements. It merges the best of both Time Warp and Breathing Time Buckets into a single package and gives direct control over the risk factor and flow control in a distributed simulation (Steinman 93a, Steinman 95). It should be mentioned here that if a global lookahead value is known, the Fixed Time Buckets algorithm could also be used to process events with time tags less than t + T in a low overhead, more conservative manner.

It is important to notice that the main problem exhibited by Time Warp arose from the fact that messages were released with very high risk. Cascading antimessage explosions are very likely when runaway nodes get way out ahead of the rest of the simulation. A runaway node sends lots of bad messages from events processed with time tags that are way out ahead of the rest of the simulation as straggler messages sent by the slower nodes are received. However, events processed close to GVT will probably not be affected by messages arriving from slower nodes.

As a general rule, events processed way out ahead of the rest of the simulation tend to have a larger chance of being rolled back than events close to the current GVT. For example, if each node tried to update GVT after locally processing 100 events beyond the current GVT, but a runaway node has processed 10,000 events beyond GVT while waiting for the other nodes to finish their 100 events, very likely many of the events processed by the runaway node (i.e., the ones that are way out in the future) were processed incorrectly. Because events processed way out ahead of the rest of the simulation have a high probability of being rolled back due to straggler messages sent by slower nodes, it might not be a good idea for those events to immediately release their messages. Instead, it would be better to process those events in a risk-free manner so that if they are rolled back, the network does not get flooded with large numbers of antimessages.

Furthermore, we should note that as another general rule, rollbacks on runaway nodes usually do not affect the critical path of the simulation since the runaway node is still way out ahead of the rest of the simulation and can afford the luxury of extra rollback overheads. It is the bad messages which are later canceled by antimessages that do affect the critical path of the simulation since messages and antimessages must be processed by the slower nodes and since unnecessary messages waste precious bandwidth in the communication network.

As a second flow control mechanism, when a runaway node gets too far out ahead of the rest of the simulation (in terms of numbers of events processed beyond GVT), it is a good idea to stop event processing altogether on the runaway node until the rest of the simulation catches up a bit. Without going into a lot of detail, given below are the main ideas behind the Breathing Time Warp algorithm.

The first Nrisk (a user selected run-time parameter) Events processed locally on each node beyond GVT have their messages released right away (as in the Time Warp mechanism). After that, messages are held back and the Breathing Time Buckets algorithm basically kicks in. When Ngvt events are processed, or when the event horizon is determined, each node requests a GVT update (note, GVT is updated when all of the nodes have made that request). If a node ever processes Nopt events beyond GVT, it stops event processing until the next GVT value is determined. An example of a typical processing cycle for a node is given in Figure 5.

Figure 5: A typical processing cycle (not drawn to scale) in the Breathing Time Warp algorithm. Each node processes events with risk until Nrisk events are locally processed beyond GVT. Then events are processed in a risk-free manner until either the event horizon is crossed or Ngvt events have been locally processed beyond GVT. At this point, GVT is updated and events are committed. Note that, while in this example, Ngvt is greater than Nrisk, this is not a requirement. However, Nopt must always be larger than Nrisk and Nopt. If a runaway node processes Nopt events beyond GVT, it stops processing events until the next GVT cycle.

Breathing Time Warp brings in the necessary flow control for both controlling risk and optimism. Performance is only limited by the amount of parallelism in the simulation itself, not by the synchronization algorithm.

Design decision 2: Conservative synchronization strategies do not scale for distributed military simulations. Optimistic strategies offer complete general simulation support but Time Warp and Breathing Time Buckets potentially do not scale in terms of performance. By providing a general handle on the risk factor, Breathing Time Warp provides a fully scalable synchronization mechanism that meets all of the scalability goals of military simulations.

Scalability And Parallelism

To this point, we have discussed various techniques for synchronizing parallel and distributed simulations. We found that the Breathing Time Warp approach completely scales in terms of being able to handle arbitrary object interactions (i.e., any object can interact with any other object at any future time), as well as keeping unnecessary synchronization messages down to a minimum (i.e., low numbers of antimessages and a very efficient GVT algorithm).

Now, we must ask what it takes to get good parallel performance out of a discrete-event simulation in general. While we have chosen a synchronization algorithm that completely scales, all of the careful arguments we have made concerning synchronization are of little value if the simulation itself has little or no intrinsic parallelism in its construction. As a general rule, a simulation with lots of parallelism typically has few rollbacks while a simulation with little parallelism normally exhibits large numbers of rollbacks.

In particular, we must understand how distributed simulations scale in terms of size (i.e., the number of simulation objects and their number of interactions), the number of processing nodes, and in how the events are scheduled.

Scalability and the Number of Simulation Objects

First of all, we must understand that (to first order) straggler messages that arrive on a node only rollback their corresponding simulation object, not the entire node. All too often, simulationists (even those in the parallel simulation research community) misunderstand this basic principle and sometimes unaware of their error, study rollback mechanisms that are grossly inefficient. Figure 6 shows a more detailed version of Figure 1, where now each node actually contains ten simulation objects. The thing to notice about this example, is that even though Node 1 receives a message in its past, still no incorrect events have been processed for the object receiving the message and so no rollbacks are required (unless when the new message is processed, it schedules other events that cause rollbacks Рbut this is a second order effect).

Figure 6: An example showing how a message from node 0 received by node 1 in its past may not cause rollbacks since no incorrect events were processed by the message's simulation object. Note that this message does not roll back the entire set of events on node 1 that were processed with time tags greater than the straggler.

As a general rule of thumb, the more simulation objects there are per node, the better (in terms of reducing the number of rollbacks). A good way to understand this is to imagine that all of the events were mapped into a single event list. Each node, however, is responsible to process the events for its local set of simulation objects. In other words, each node will be processing different events in the imaginary global event list. Some nodes may be way out in front of the rest of the other nodes, while other nodes may be way behind. we can characterize runaway nodes, and slow nodes by thinking about where these nodes are in terms of the number of events separating themselves.

With this in mind let's revisit Figure 6. If there are ten simulation objects per node, then if node 1 is ten events ahead of node 0, it would only have to roll back one event on the average when node 0 sends a message with a zero time delay. If there were 1,000 simulation objects per node, then node 1 could get 1,000 events ahead of node 0 and still only have 1 rollback on the average for the same situation. This means that the more objects per node, the further apart the nodes can become in their event processing without causing more rollbacks.

Therefore, in order to obtain more parallelism in our simulation (and thus, better scalability) , it is far better to decompose the simulated objects into the smallest constituents possible instead of lumping a large number of objects into a single entity. For example, it would not make sense to take a military model that manages 100 tanks, and plug it into an optimistic simulation as a single simulation object. It would be far better to model each of the 100 tanks individually. It might also be wise to decompose each tank further into components that are somewhat self contained as separate simulation objects themselves.

Design decision 3: It is best to decompose the simulation into many small objects (where it makes sense) in order to reduce the number of rollbacks. Simulations with large numbers of objects scale better than those with small numbers of objects.

Scalability and Event Scheduling Strategies

Now, let's examine what techniques for event scheduling give rise to the most parallelism, and which techniques do not.

Earlier in this paper, we introduced the concept of the event horizon. We noted that events processed in each event horizon cycle could have been processed in parallel (thus leading to the Breathing Time Buckets algorithm). One way to characterize the amount of parallelism in an application is to measure (or predict) how many events on the average are processed in each event horizon cycle. If this number is large (let's say, much larger than the number of computing nodes), then we should observe a high degree of parallelism in the simulation (assuming that the load balance is good).

A theoretical model that very accurately predicts the number of events processed on the average has been developed ( Steinman 94b) for simulations in steady state (i.e., where each event generates a single new event when processed according to a time invariant probability distribution Рthis is sometimes called the hold model). The results of this model are quickly presented here. First, some definitions:

n	=	Total number of events at the start of a cycle.
t	=	Simulation time (tʽʰ is the start of the cycle).
f(t)	=	Random time distribution for event generation.

The cumulative probability distribution is given as,

(See Equation #1 in GIF file)

and G(t) is defined as 1 - F(t).

Then, the pending events are distributed in time according to an event density function given by:

(See Equation #2 in GIF file)

Where,

(See Equation #3 in GIF file)

The probability of an event at time t being processed in the current event horizon (remember, we are assuming that t = 0 is the start of the cycle) is given by,

(See Equation #4 in GIF file)

The average number of events processed in an event horizon cycle m can be computed as,

(See Equation #5 in GIF file)

Global Lookahead:

One very important class of event-generation distributions arises when there is a minimum time delay T between an event and its generated event (i.e., a parallel simulation that might lend itself to the standard Fixed Time Bucket approach). It is possible to make a conservative estimate of the number of events m processed in a cycle.

(See Equation #6 in GIF file)

Since is proportional to the number of pending events in the simulation, this equation tells us that as we double the size of the simulation, the average number of events processed from the start of each cycle up to T time units later also doubles. However, beyond this boundary, the number of events may not scale so nicely.

Exponential Distributions:

Another distribution, commonly used in queuing models, is the exponential function f(t)=exp(-t). For exponential distributions, m can be calculated analytically. The results are,

(See Equation #7 in GIF file)

As we see, simulations that generate events with a fixed exponential distribution do not scale very well. Exponential distributions scale only as the square root of the number of events in the simulation (linear would mean perfect scalability). In other words, exponential distributions don't scale very well.

Near and Far Future Event Scheduling:

This model has also been evaluated using other distributions (flat, triangle up, triangle down, symmetric and asymmetric bell shaped, two hump, near future, far future, etc.). The result is that it is better to schedule events into the far future rather than the near future. The worst kind of event generating strategies (in terms of the average number of events per event horizon) are those that sometimes schedule events in the near future and sometimes in the far future. The amount of time an event is scheduled into the future is often called lookahead by the parallel simulation community. Lookahead and large numbers of objects per node are the keys to obtaining parallelism.

While sometimes it is not possible to schedule events with lots of lookahead, creative algorithms can be developed that exploit as much lookahead as possible. These algorithms should perform much more efficiently than simpler algorithms that contain very little lookahead. In order to preserve scalability in terms of the simulation's performance, it is worth the effort to produce algorithms that have built-in lookahead. Otherwise all of the care taken to produce a scalable simulation will be wasted.

Design decision 4: It is important to exploit lookahead whenever possible when developing a distributed simulation. Simulations with poor lookahead do not scale.

Scalable Communications

One of the most critical issues, when focusing on the subject of scalability in distributed simulations, is in specifying a set of useful communication protocols that are fully scalable. In this section, we do not address particular message types used by distributed simulations. Rather, we address the subject of communications at a lower level that pertains to the basic infrastructure needed to support distributed simulations in general. Later, we discuss application level messages in the critical algorithm section. Discussed below are some of the basic operations required for scalable synchronization.

Non-Blocking Synchronizations

One very important operation that should be supported as efficiently as possible is for a node to be able to request a global synchronization without blocking (Gupta 89). One common use of this capability is when a node requests a GVT update. This request should be provided without stopping event processing. When the last node makes its GVT update request, all of the nodes (roughly) simultaneously break out of their event processing in order to compute a new value for GVT. The new value for GVT is then cooperatively updated by all the nodes in an efficient and scalable synchronous manner. Let's call this first operation nb_sync. As its partner, a second operation called nb_check, returns a flag indicating when all of the nodes have issued their calls to nb_sync.

One might ask how these operations could be supported in a scalable manner over a wide variety of networks.

One technique (originally used on the old Mark III Hypercubes developed by JPL and Caltech) was to use a global hardware line that was connected to each node. The default voltage on the global line was set to a nominal value by each node. As each node issued its call to nb_sync, its applied voltage was turned off. Thus, when the last node issued the nb_sync call, the voltage on the line was effectively zero. With hardware support, an interrupt was then fired on all of the nodes. In the interrupt handler, a flag (used by the nb_check operation) was set in order to indicate that all of the nodes have completed the fuzzy barrier.

On networks of workstations, a server may be used to support the nb_sync function. Each node sends a message to the server as it issues its nb_sync. When the server receives this message from all of the nodes, it sends back a special message to each node indicating that the fuzzy barrier has completed. The nb_check operation simply polls for this message in order to determine when all of the nodes have completed the fuzzy barrier. The server approach does not scale well since all of the nodes send their messages to a single server which then broadcasts back to the nodes (there are a total of 2 x n messages required here assuming reliable message sending protocols are used). However, most networks are connected through a single link (like Ethernet) anyway, and don't scale so this approach does no worse than what could be expected from the network. If multiple communication paths are set up in the network, then a hierarchy of servers could be used to offer this service in a more scalable manner.

On shared memory workstations, this operation can be provided very efficiently without using potentially expensive semaphore system support by using binary tree structures that keep readers and writers consistent (note, we observed a speedup in supporting the nb_sync operation of about 2,000 when not using UNIX semaphores on shared memory Sun Workstations). Networks of shared memory workstations can be hooked up to a server in order to provide non-blocking synchronizations in a very efficient manner where each machine (not each processor of each machine) exchanges messages with the server.

Global reduction networks (constructed as a high-speed binary tree network of fast registers) have been explored that connect workstations together through standard VME bus interfaces (Pancerella and Reynolds 93). These global reduction networks are extremely fast and can provide non-blocking synchronizations very easily by simply counting up the number of nodes that have made the request. When the number is equal to the number of nodes, each node can easily detect that the fuzzy barrier has been completed. This type of global reduction support is also provided in hardware on Thinking Machine's CM5 and could be used in a similar manner to provide non-blocking synchronizations.

On hypercube, mesh, or other types of parallel computers, the nb_sync operation could be supported in a number of scalable ways. One way would be to pass background messages in a tree structure much like the global reduction network and shared memory approaches. The top node of the tree would then issue a broadcast (possibly using the same tree structure in reverse) back to the rest of the nodes indicating that the non-blocking synchronization has completed. Another way to support this would be to have each node communicate with its neighbors that it has issued an nb_sync. When the node receives these same messages from each of its neighbors, it repeats this operation, over and over until all of the nodes have effectively communicated with each other. For hypercubes, this operation takes log2(n) iterations (where n is the number of nodes). For two dimensional meshes, this operation may take sqrt(n) iterations.

Global Reductions

A second type of operation involves computing globally reduced results of integer or double precision values. An example of this might be to determine the Global Virtual Time as the minimum of each node's local virtual time. A second example of this might be to determine how many messages are still in transit by having each node keep track of the number of messages it has sent and the number of messages it has received. The number of messages still in transit would then be the global sum of the number of messages sent by each node minus the global sum of the number of messages received by each node (this can be done in one step). This operation is used by the GVT algorithm in order to flush out messages that are still in transit in order to obtain the true value for GVT.

Other than the global hardware line approach, global reductions can be supported through mechanisms very similar to the nb_sync and nb_check operations except that they are done synchronously in a blocking manner. Non-blocking versions could also be supported but they are typically not necessary so we will assume that global reduction operations are always done in a blocking manner. If we really need to perform a non-blocking reduction operation, we could use an nb_sync first, followed by the global reduction (after the fuzzy barrier has been completed) to achieve nearly the same effect.

Multirouter

One very important type of operation used in synchronous parallel applications (and used extensively by the Fixed Time Buckets, Breathing Time Buckets, and Breathing Time Warp algorithms) is the multirouter. Multiple messages with arbitrary destinations from each node are given to the multirouter which then routes these messages in a synchronous manner. After the route command is given, each node is assured that all of its incoming messages (sent by other nodes) have safely arrived. The messages can be processed with assurance that no other messages will arrive in this cycle from the multirouter.

On hypercubes, the Crystal Router algorithm can be used to support the multirouter in a scalable manner. Each node determines which of its messages need to first go through channel 0 in order to get to its destination. A large meta- message (actually containing many messages packed into a single buffer) is constructed by each node and sent to its neighboring node through channel 0. Each node then repeats this step using channels 1, 2, etc.. By the time log2(n) iterations have completed, each node will have its complete set of messages. This approach could be supported for other topologies such as meshes, etc. in a similar manner with performance given by its topology.

On networks of workstations, a server could again be used to receive and transmit messages to the various nodes. Each node packs a large meta-message (containing many messages) and sends it to the server. When the server receives a meta-message from each of the nodes, it then untangles the meta-messages and generate new meta-messages for each node with their appropriate set of packed messages. Thus, the total number of messages would be 2 x n (where n is the number of nodes). Since these meta- messages are typically large, efficient network bandwidth is utilized. As discussed earlier, networks of workstations are normally interconnected through serial links (like Ethernet) which don't scale anyway so having a single server does not hurt the performance. If a multipath network was provided, we could use multiple servers to support the multirouter in a more scalable manner (in terms of utilizing bandwidth Рnot numbers of messages).

The multirouter has been efficiently supported on shared memory workstations without using potentially costly semaphores by defining a separate piece of shared memory for each node (i.e., there are n separate shared memory segments shared by each node on a machine). Each node copies its sending messages into its buffer along with a header that links messages for the same node in a linked list. Then, in another carefully arranged shared message segment, each node receives a pointer to the first message from each node. Each node simply goes through its set of linked messages (one created by each other node) in order to extract all of its received messages. The shared memory multirouter can be connected to a server in a straight-forward manner to efficiently support distributed shared memory workstations.

Asynchronous Message Sending

General asynchronous message passing is normally provided as the foundation of every parallel and distributed network. The only extra requirements that we add are:

The SPEEDES Communications Library

These communication services should be codified into a single interface that could be provided in a seamless manner on a number of different networks and computer architectures. Such a library has been built and is called the SPEEDES Communications Library (Steinman 94c). While this library has been specifically designed to very efficiently support the fundamental needs of parallel and distributed simulations, it actually provides a very useful set of general distributed programming operations. The SPEEDES Communications Library could be used to support applications other than parallel discrete-event simulations.

Design decision 5: The communications library for supporting scalable distributed and parallel simulations should be encapsulated in a manner that is architecture independent in order to provide scalability to a wide variety of platforms. The interface to this library should not assume: message passing as its only means for communication, shared memory or thread support, or special hardware that provides global synchronizations and/or reduction operations. However, the internal support for these operations should use whatever special hardware is available. Asynchronous message passing should be reliable and non-blocking.

Scalable Interactive Simulations

We have briefly discussed interactive simulations earlier in this paper. Some have seriously questioned whether optimistic synchronization techniques can actually provide interactive support. After all, you can't roll back hardware, nor can you roll back humans who are interacting with the simulation in the loop. Unfortunately, very little research (other than what has been provided by SPEEDES) has been published by the parallel simulation community on this extremely important subject. However, it turns out that optimistic techniques with rollback support are actually superior to conservative techniques when it comes to supporting interactive simulations; even when running sequentially on a single node!

Motivation

One problem faced by conservative interactive real-time simulations is in keeping up with the wall clock. Conservative simulations must be able to keep up with the wall clock at all times. Optimistic simulations, however, can process future events ahead of time instead of blocking (waiting for the wall clock to catch up to the current simulation time) and thereby average costly computations over time. When there are lots of simulation objects on each node, rollbacks due to external interactions will be minimized. Lazy cancellation with tolerances also limits the number of events reprocessed after rollbacks when interactions don't effectively change the results of processed events.

As a second potential interactive use of optimistic analytic simulations (again, this does not require multiple nodes to be effective), imagine what could be done if garbage collection (i.e., returning stored state information for supporting rollbacks) is never performed. In other words, imagine a simulation that could be rolled back (or forward) anywhere in time through user interactions. This capability could be extremely powerful for providing mechanisms to support what-if studies, probing deeply into simulated object behavior, etc..

In order to support this use, state saving memory overheads must be extremely low. Incremental state-saving techniques, such as the delta exchange mechanism provided by SPEEDES, are essential so that the entire state history of the simulation can be saved in memory (Steinman 93c). Furthermore, it is imperative for events to not only be able to roll back, but also roll forward as users move forward and backwards in time through their interactions. Highly efficient (in terms of memory and cpu requirements) support for lazy cancellation with tolerances must be provided in order to allow slight changes (for example, due to external interactions) that don't really affect anything, to not cause events to be reprocessed (this capability is fully supported in SPEEDES through its lazy cancellation mechanisms).

A third way optimistic support could be used for interactive real-time simulations is to imagine a situation where live data is constantly being received in real time. Based on this data, imagine that we need to predict the future (through simulation). For example, suppose that our job is to monitor and control a complex spacecraft while telemetry data is being received in real time. A conservative approach would be to use the telemetry data to set (or really calibrate) the current state values of the simulated spacecraft, and then run the simulation every so often (let's say every half hour) to determine future states of the spacecraft. However, a much more efficient (and accurate) way to provide this capability would be to let the live telemetry data feed into an optimistic simulation, rolling back only things that have changed.

By choosing the optimistic approach, many computations could be saved, thus making more efficient use of the computing resources. For example, there is an excellent chance that many (maybe most) of the computations performed in each iteration, using the conservative simulation approach, would be repeated from run to run. Optimistic techniques (especially when using lazy cancellation with tolerances) would only recompute things that change significantly. As an important note, for these types of simulations, GVT should be roughly throttled by the wall clock to allow live data to roll the simulation back to the current time.

Of course, there are many other types of applications that could benefit from this kind of real-time interactive simulation support, especially those involving military intelligence and battle planning.

A fourth way optimistic interactive techniques could be applied to large-scale military (or virtual) simulators is to model the large numbers of objects (e.g., computer generated forces) on a centralized high-speed parallel supercomputer (or possibly a local area network of shared memory workstations) while providing the ability for external modules (which could be physically located hundreds or even thousands of miles from the centralized parallel simulation) to connect into the central parallel simulation. These external modules could create or attach themselves to ghosted objects in the simulation and thereby interact with other objects in the simulation (which, of course, could be also controlled by external modules).

By using this approach, many of the problems faced by the interactive military simulation community would disappear. For example, support for parallel proximity detection and distributed environmental effects could be provided in an efficient manner on the parallel computer because of its scalable bandwidth and low latency for supporting communications. Logically, external modules only interact with their ghosted counterpart inside the parallel simulation. This approach not only provides a more scalable communications platform (since bandwidths tend to scale better on parallel machines than on distributed networks), but also provides rigorously correct synchronization. When external modules are free to interact with each other directly, rigorous synchronization (other than by use of the wall clock) becomes almost impossible.

With this philosophy for hooking up external modules, and providing computer generated forces in a scalable fashion, the topology of the parallel and distributed simulation looks like a smart and very fast central switch (where the central switch is actually the parallel simulation and the external modules are users interacting with the simulation from potentially thousands of miles away). This approach places the high-speed, tight- interaction, communication needs locally in the parallel simulation, while keeping the less demanding interactions between external modules and their corresponding ghosted objects on a manageable time scale (whatever the distributed network can manage). Instead of each external module communicating with every other external module, they only interact with their ghosted object in the parallel simulation. The parallel simulation (with its higher bandwidth) delivers appropriate information to the other objects inside the parallel simulation in a scalable manner.

Design decision 6: Parallel and distributed military simulations should be decomposed into two parts. First, a parallel component that models large numbers of potentially tightly interacting objects which exploits the bandwidth where it is most needed. Second, external modules running distributed over large distances can connect to this parallel simulation by making a logical connection to a replicated ghost object inside the parallel simulation. This approach scales because it does not have large numbers of external objects interacting with each other directly using network topologies that typically don't scale or that use unreliable protocols (such as UDP/IP). Also, this approach provides rigorously correct synchronization apart from simply using the wall clock.

Synchronization Issues

Various optimistic simulators claim to be interactive. This claim is frequently misleading. For example, it is not unheard of for someone to say, "Our system is interactive. You run it first Рthen look at the output Рchange some parameters Рand then run it again." This is not what interactive simulations are all about! Below is a list of requirements for interactive distributed simulators:

The first requirement rules out the interact-in-between-run strategy mentioned above. Interactive simulations must allow the user to participate with the simulation while it is running live.

The second requirement really motivates the need for very frequent, and highly efficient, GVT updates. Data is only released from the simulation as events are committed (right after GVT is updated). This data corresponds to past events that were processed from the last GVT to the new GVT (the new GVT defines the current simulation time). The user can then inject events with time tags at GVT but no earlier.

The third requirement actually works very well in optimistic simulations if the query event is time-tagged at the current GVT. Normally, optimistic simulations must wait until GVT is updated before releasing data to the outside world. However, if the query event is time-tagged at GVT, then (by definition) it cannot be rolled back (note, this implies that a processed event is not rolled back when a straggler message arrives with the same time tag). Therefore, as users inject query events into the simulation, those query events should be time-tagged to GVT and then immediately processed. The query event releases its output back to the user immediately. If you think about it, it really does make sense to time-tag query events at GVT because users normally want to know the state of an object "now" (and "now" is defined as GVT).

The fourth requirement rejects methods that define special GVT synchronizations to occur in the simulation in order to send data to the outside world (Agre 91). It is possible to funnel data from processed events to a special object in the simulation and then to have that object schedule a special synchronization event that causes a GVT update so that it can release its data to the outside world, but this is not a very desirable mechanism. It is also not clear how to schedule such an event appropriately (it mixes up simulation times with interaction times). For example, what if we could do 10 GVT updates between each specially scheduled synchronized output event. Clearly, this approach is less efficient than the SPEEDES approach that allows events to release their data when GVT is updated. It is also more reasonable to let events send their data directly to the outside world from a practical sense. In SPEEDES, events are actually C++ objects themselves, which means that they can store buffers that are released to the outside world when they are committed. A special virtual function, called commit(), is called for each C++ event object that allows events to release their data in any form they want as they are committed (Steinman 94c).

The fifth requirement forces hybrid methods into the synchronization strategy. It makes no sense to force an interactive user (possibly thousands of miles away) to participate in the high-speed GVT algorithm, especially if GVT needs to be updated every 100 ms. Furthermore, imagine what would happen if a remote user's machine goes down. We cannot allow this situation to bring down the entire simulation. Therefore, we must allow for dynamic connections and disconnections into the simulation through the use of hybrid synchronization. In SPEEDES, when data is sent from an event to an external module, a barrier may be established to hold back GVT until the external module replies with another message. If the external module crashes for some reason, the barrier is easily removed as the TCP/IP communication socket is closed. An example of this is shown in Figure 7 where the Real-time Situation Display (RSD) graphics package obtains time updates from the simulation.

Figure 7: Hybrid synchronization. The Real-time Situation Display (RSD) sends a message to the parallel simulation. The message is time-tagged by SPEEDES to the current GVT value. A barrier is set up inside of SPEEDES to hold back GVT if for any reason the RSD does not respond quickly enough (for example, GVT3 has been held back to GVT1 + Barrier because the RSD took too long to process its third cycle).

The sixth requirement is very tricky to meet. Imagine that an external module is receiving messages from various different nodes in the simulation. How does it know what time it can safely process or display the data? The external module must have its own estimate of GVT in order to safely process the data that it has received (Tung and Steinman 93). This topic is of extreme importance and will be discussed in the remaining portion of this section.

Synchronizing External Modules

The SPEEDES approach for solving the problem of determining GVT for external modules is to use a special gateway TCP/IP communications server called the Host Router. SPEEDES automatically opens a socket connection from each of its processing nodes to the Host Router as the simulation initializes. In a similar manner, external modules also connect to the Host Router by opening a socket. All messages from events on a given node use the same socket for their communications to the Host Router. The Host Router then funnels messages from each node into a single stream that is delivered to the external module. In a reverse manner, external modules send messages for any particular simulation object through their socket to the Host Router which then routes the messages to the appropriate node that contains the target simulation object. SPEEDES reads the message when it arrives on the node and turns it into an event for the target simulation object.

Now, let's assume that starting at simulation time, t = 0, events in the simulation need to send messages over to the external module. These messages could be coming from events on any or all of the nodes. Each message is given a time tag that is identical to the time of its sending event. As these messages arrive at the Host Router, they are queued up for delivery to the external module in the same order that they are received by the Host Router (in other words, somewhat randomly). As the external module reads these time-tagged messages, it needs to know a time value (the external module's estimate of GVT) that it is safe to process, or in the case of graphical tools, display information to the user.

This is accomplished by having a special simulation object (let's call it SyncObject) on one of the nodes (say node 0) that provides simulation time information to the external module. This object also provides flow control which keeps the simulation from flooding the external module with more messages than it can handle. How does this all work?

Since the simulation really starts real event processing at time t = 0, a barrier is set up at time -d (where d is a very small quantity). GVT is not permitted to advance beyond this time until the external module connects to the Host Router (which opens a new socket, creating a new socket id) and then sends its first message over to the SyncObject that is on node 0 of the parallel simulation. This first message removes the barrier so the simulation is free to advance its global virtual time. The first message received by SyncObject schedules an event for each of the other nodes to make this socket id globally known. These events are time tagged at -d/2 so that the external module's socket id is known before time t = 0. When GVT is able to advance beyond 0, each node has the correct socket id for the external module so that as events are committed, they can correctly identify and send their messages to the external module.

Now, at the same time that the SyncObject sends the socket id messages over to the other nodes, it also sends back a message to the external module telling it that it is at time -d. Furthermore, a barrier is set up T simulation time units into the future to prevent the simulation from getting too far out ahead of the external module (just in case the external module falls behind). T is a run-time (or calculated) parameter that prevents the external module from ever falling more than T units of simulation time behind the parallel simulation. The basic flow of messages at the start of the simulation is given in Figure 8.

Figure 8: Sequence of initial messages related to the RSD connecting to the simulation. First, the RSD connects to the Host Router which relays the first message to the SyncObject on node 0 at time -d. Then the RSD receives a message back telling it that the time is -d. At the same time, messages with the socket id for the RSD are sent to nodes 1, 2, and 3. At time 0, the first real simulation events are processed. Later, when GVT is computed for the first time, messages from events with time tags between 0 and GVT1 are sent to the RSD. Messages will be sent after every GVT update as events are committed. While this occurs, the RSD will continue to request time updates from the SyncObject on node 0.

From then on, the external module simply receives messages from various events as they are committed on their nodes while also ping-ponging special synchronization messages back-and-forth with the SyncObject. Every time SPEEDES receives a message from the external module, it removes the barrier that was previously established. When running in the aggressive mode, this message is automatically time tagged to GVT by SPEEDES in order to allow the external module to interact as frequently as possible with the simulation (a very desirable thing when the external module is displaying graphics). The non-aggressive mode lets the external module assign a time tag to the message for more strict synchronization. As the message is processed, it creates a new barrier T simulation time units into the future and then relays its event time back to the external module. This is shown in Figure 9.

Figure 9: The Real-time Situation Display (RSD) graphics external module sends messages through the host router to the SyncObject object which is inside the parallel simulation. These messages are automatically time tagged to the simulation's current GVT value and then an external GVT message is sent back to the RSD which updates its external time. Meanwhile, time tagged messages are sent to the RSD from possibly all of the nodes at the end of each GVT cycle as events are committed. This mechanism insures that the RSD always receives all of its messages with time tags less than the current external GVT value so that it is safe for the RSD to update its display in time up to the external GVT value.

Since the SyncObject's message (assuming the aggressive mode is used) is time-tagged to GVT, its corresponding event could send time information back to the external module right away since the event will never be rolled back. However, care must be taken to avoid a race condition if the external module sends more SyncObject messages during the same GVT cycle. If this ever happens (i.e., if a second SyncObject message arrives before GVT is updated), the time information should not be released right away, but instead sent when the event is committed. This eliminates the potential race condition.

Is it safe now for the external module to process the messages it has received with time tags less than the time value of the latest SyncObject message that has been received? The answer is, "not yet!"

Consider what would happen if a node sends an event message with time tag 10 over to the Host Router during the commit phase, but for some reason, the Host Router is not able to read the message right away. The sender (at the application layer, not the transport layer) thinks that the message has been delivered, but in reality, it has not actually been read yet by the Host Router. Meanwhile, assume that the new GVT value is now 11 and a time message is sent from the SyncObject to the Host Router. It is very possible that the time message (with time tag 11) is read by the Host Router before the event message with time tag 10 is read. When the messages are queued up for transmission to the external module, a time error will occur because the event message with time tag 10 arrives after the time update message with time tag 11. How is this problem solved?

This problem is solved by adding one more simple mechanism to the Host Router protocol. Whenever a node sends a message to the Host Router during a GVT cycle, it asks for an acknowledgment after the last message has been sent. Because of the FIFO nature of TCP/IP, when the response comes back from the Host Router, each node can be guaranteed that its messages have been read by the Host Router and are queued up for transmission to the external module. A global sync after each node receives its acknowledgment guarantees that the Host Router has actually read all of the messages that were sent in a given GVT cycle. Because the Host Router normally runs on a machine that is local to the SPEEDES parallel simulation, long distance acknowledgments across wide area networks are not required.

With this logical structure in place, an external module can safely process its messages up to the time tag of its last received time-synchronization message.

Scalable Software Engineering

Up to this point, we have focused on the scalability issues that relate primarily to synchronization and performance. Now we switch gears and discuss the extremely important subject of scalable software engineering practices Рmore specifically, scalable object-oriented practices. By scalable software engineering, what we mean is the software paradigm for creating a distributed simulation that not only performs well, but also does not break down in terms of its complexity as the number of lines of code increase. Scalable software engineering practices should not only produce understandable, modifiable, and reusable code, but should also be able to more efficiently use larger work forces as the simulation grows in size.

Normally, programs start out as prototypes that become useful enough to warrant further development. Patches are made to the program as new functionality is required. Of course, it is impossible to predict all of the changes that a piece of software will be forced to embrace in its entire life cycle. It is precisely these changes that end up destroying the beauty and design of the original program. At some point, the program must thrown away and rebuilt from scratch in order to meet the new requirements. Unfortunately, most managers/sponsors/developers do not recognize when a new paradigm is required to meet their needs. This not only delays the introduction of superior technology, but also results in wasted resources spent to maintain software that is past its time.

To make this problem even more difficult, often military simulations are asked to confederate with other military simulations to model larger domains than either one was ever originally designed to accommodate. Nobody wants to throw away code, especially code that has been validated. It is not politically (and possibly short-term economically) correct. While frameworks such as ALSP (Weatherly 93 ) and standardized protocols such as DIS (Loral 92) were designed to make these integrations as painless as possible, there really is no magical solution at hand. It takes an enormous effort to transform already existing programs to be compliant to the newer standards.

Simulations that were developed as standalone systems usually have their own way of managing events and simulation time. It becomes very difficult to efficiently integrate multiple simulations that have their own time management schemes. Furthermore, remember our discussion earlier in this paper when we discovered how conservative synchronization strategies do not scale. Yet, the military simulation community has been asked to integrate multiple conservative simulations using the same conservative techniques that we have already shown do not scale! In addition, by confederating separate simulations through some standardized magic "glue", we still have one major problem. Load balancing will only be as good as the slowest module in the confederation. Although CORBA is taking steps in the right direction, we normally don't have the option of integrating separate applications into one meta-simulation where simulated objects can migrate from one machine to another in order to produce optimal load balancing. This would require a logically correct global time management scheme (other than simple use of the wall clock) that has yet to be officially embraced by the distributed military simulation community.

The net result is that simulators are linked together that have no hope of ever providing anything close to optimal performance. Nor do these techniques offer scalable software engineering. Instead, simulations, long past their useful life cycles, get more and more complex as they are asked to do things that they were never intended to do.

Object Management

The first specific topic we discuss concerning scalable software engineering is object management. How do we integrate different types of objects, very likely developed by different organizations, into a common integrated simulation environment? There are probably many ways to solve this problem, but the SPEEDES solution is to provide an object manager interface for each type of simulation object. Object managers inherit from a base class provided by SPEEDES (note, SPEEDES uses the C++ object-oriented programming language). This provides (through the use of virtual functions) a generic mechanism for SPEEDES to invoke object manager methods without knowing any specific details about the individual object managers. Object managers are responsible for creating, destroying, initializing, and decomposing their set of simulation objects. They also provide information to the rest of the simulation concerning generic information or algorithms that are appropriate for their object types. An object manager for each type of object is instantiated on each node in the simulation to locally provide object information to the rest of the simulation. Object managers are generally ignorant of other object types and object managers.

By separating the management of simulation objects into separate object managers, we have introduced a mechanism where software developers can manage their objects independently from each other, thereby creating a parallel development environment that scales. In other words, there are no critical dependencies to slow down software development. For example, one developer (responsible for providing objects of a certain type) might read input files in one format, while another developer could use a different format and strategy for creating and initializing its objects. Since the object managers are encapsulated from each other, it is not necessary to force all object managers to do things exactly the same way (although it might be nice to have some consistency). This is one of the huge benefits derived from the object-oriented strategy of encapsulation.

Because object managers are responsible for creating and destroying objects of their types, and because once the simulation is started, the only mechanism for processing is through events, it is imperative for object managers to be simulation objects themselves. The appropriate place to create and initialize an object of a specific type is in its object manager since only object managers really know how to handle the specific initialization characteristics of their objects. If an event schedules another event to create a simulation object at a certain simulation time, that event should be applied to the appropriate object manager on the appropriate node. If the first event is rolled back, then there needs to be a mechanism to cancel (and possibly roll back) the second event for the object manager that was supposed to create the new simulation object. All of the rollback machinery is provided to object managers if they also inherit from the base class simulation object (note, all simulation objects must inherit from a common base class simulation object in order for various synchronization mechanisms to be transparent).

Design decision 7: Separate object managers, inheriting from a standard base-class object manager, should manage objects of different types. This provides encapsulation for the creation, destruction, initialization, and decomposition of different types of objects, while providing object specific methods to the rest of the simulation. Furthermore, object mangers should also inherit from the base class simulation object to provide rollback support as creation events are rolled back. By providing encapsulated object managers, it is possible for different classes of simulation objects to be linked together in one framework.

Events as Objects

Object-oriented programming has been recognized, especially by the simulation community, as a superior paradigm for building complex applications. Yet, how do we characterize the common traits of good object-oriented software designs?

"Object-oriented is well on its way to becoming the buzzword of the 1980s. Suddenly everybody is using it, but with such a range of radically different meanings that no one seems to know exactly what the other is saying." ( Cox 1987)

Recently, the comp.simulation news group was polled to find out what experienced simulationists thought were the necessary ingredients for an application to be truly object oriented. The responses were very interesting.

"I'm not certain what you are getting at but in my view, simulations written in SIMULA (the 1st OOPL) are the epitome of object-orientation."

"To be truly object oriented, I think you need to accurately model objects in the real system you intend to simulate."

"I've been very satisfied with SES/Workbench, a simulation tool from Scientific and Engineering Software, a company in Austin Texas. It is object based (no inheritance), but sufficiently powerful."

The first response almost seems to say that no matter how you design your application, if you use the SIMULA programming language, you will have a beautifully structured object-oriented simulation. There must be more to object-oriented simulations than just choosing the right language! It has been said that, "A good FORTRAN programmer can write FORTRAN in any language!" Probably, most of us have seen the truth of this statement, especially when criticizing the programming practices of others. However, what about our own programming practices? After all, we in the simulation community were responsible for inventing and popularizing object-oriented programming. Are we really doing object-oriented programming or are we just saying that we are because we use object-oriented programming languages?

The second response was very interesting in that it suggested that the modeled objects are indistinguishable from the real objects. While this might be a very desirable property in test- bed emulations, it doesn't always apply to simulations in general where statistical properties and random numbers generators are frequently used. These real-world objects would fit better in the context of external modules, not as simulation objects which can rollback due to the arrival of straggler messages.

The third response gives a very practical view of object- oriented systems but it is questionable whether an environment, no matter how powerful and object-based it is, could pass the test of being object oriented without providing one of the most powerful properties of object oriented design Рnamely inheritance with support for virtual functions. In other words, without language support, it is doubtful whether object-oriented programming can really be supported in a reasonable manner.

With disagreement coming from virtually everyone involved in object-oriented programming, it is no wonder that there are as many failures as there are successes in building object-oriented applications. For example, let's look at what the text books have to say concerning how to choose the objects in applications.

"Just use as your first software objects representations of the obvious external objects." (Meyer 1988)

"Read through the requirements specification carefully again, this time looking for noun phrases... You can carefully glean candidate classes." (Wirfs-Brock 1990)

These strategies tell novice programmers to simply map real- world objects to software objects as a starting point. This is not necessarily bad advice, but unfortunately most programmers then proceed to define how all of the objects interact by defining methods within these objects. The objects are then forced to know about each other because they directly interact with each other. This can result in spaghetti object dependency graphs. In this case, objects are no longer independent from each other and thus they are not reusable.

On the other hand, consider what would happen if all of the interactions between passive "noun" objects were mediated by a new set of active objects called "verb" objects. This is precisely what happens when events are objects. This is possibly much closer to what Booch had in mind.

"The tangible things in the problem domain, the roles they play, and the events that may occur form the candidate classes and objects of our design, at its highest level of abstraction." (Booch 1991)

By separating event code into active event objects that act from the outside on passive simulation objects, the simulation objects stay ignorant of the application and therefore are reusable. In this paradigm, simulation objects don't even have to know that they are being used in a simulation. This means that they can actually be reused in real applications after they have been validated by the simulation.

As a general rule, only passive objects are reusable. When an object becomes active, it immediately ties itself to the specific software application. Active objects can also be reusable, but only in the context (or application) in which they were developed. Thus, when building a simulation, one of the most important steps should be to consider the tradeoffs of building active simulation objects vs. building passive simulation objects with active event objects mediating the interactions.

Scalability of Active Simulation Objects:

Let us evaluate the scalability of these two choices. First, consider what happens if simulation objects are designed as active objects (i.e., event code is either provided through event handling methods, or the process model is used where simulation objects are treated as interacting processes). This is almost always how simulations are built. Now, imagine that our simulation has 10 different types of simulation objects. Suppose that the simulation grows in complexity to the point where hundreds (maybe thousands) of different types of interactions are modeled. Suppose that the simulation grows to 1,000,000 lines of code. This would mean that (all things being equal) each simulation object would contain about 100,000 lines of code.

Notice, we have violated the first rule of object oriented programming. We have completely lost the notion of encapsulation. With 100,000 lines of code having free access to the variables contained within an object, too many lines of code are dependent on the structure of the object's internal variables. For example, changing how internal variables are stored could seriously impact the 100,000 lines of code that use those variables. Therefore, having large active simulation objects does not scale in terms of providing encapsulation. In order for object-oriented applications to scale, there must be some mechanism that creates new objects as the size of the application increases in order to keep the size of objects small. Otherwise, we lose scalability. This is not provided when simulation objects are active.

Now, consider what happens as a new method needs to be defined for the object (possibly as a new event handler). Because the class definition needs to be modified, the entire 100,000 lines of code for the simulation object (which depend on the class definition) need to be recompiled. Furthermore, because other objects directly interact with this simulation object, they will also need to be recompiled. In other words, as new methods are added to simulation objects, it is very likely that the entire simulation will have to be recompiled. Again, this does not scale. Compiling 1,000,000 lines of code every time a new method is added would bring the development and maintenance of any application to its knees.

Now consider what happens when trying to use a large work force to further develop this simulation. Because the objects themselves have become so large, and because there are so few objects, it will be very difficult for large teams of software developers to efficiently work together without stepping on each other's work. This problem is further amplified when the software team is actually distributed over multiple companies. Building active simulation objects does not scale in terms of effectively utilizing large work forces for software development and maintenance.

As a final thought, consider what happens when trying to reuse an active simulation object in another, completely different, application (possibly not even a simulation related application). For example, it might be desirable to perform standalone testing of an object for verification of some of its methods. Since we have built the simulation objects as active objects that directly interact with other objects in the simulation, it would be nearly impossible to pull these simulation objects out of the simulation and reuse them in other applications. Therefore, developing active simulation objects does not scale in terms of reusability.

Scalability of Passive Simulation Objects:

Now consider the same simulation (1,000,000 lines of code with 10 simulation objects), except that instead of event handling methods built directly in the simulation objects, events are separately encapsulated objects that act from the outside on passive simulation objects.

Notice that the simulation objects remain small. As new events are integrated into the simulation, new event objects are created, thus satisfying our first rule of scalable object- oriented programming. Namely, that there is a mechanism for creating new objects as the number of lines of code in the application increase. This keeps objects small and the notion of encapsulation is preserved.

Furthermore, when a new event is created, the only code that needs to be compiled is the event object itself (and the part of the code where the event object is defined for the simulation framework). The bottom line is that new events can be created without forcing the entire simulation to be recompiled.

Large work forces can also effectively work together without stepping on each other's code because not only are events separately encapsulated from the simulation objects, but they also are separately encapsulated from each other. This means that software developers can safely work on different objects (either event objects or simulation objects) without directly affecting each other. This is one of the most powerful aspects of object-oriented programming.

As a further benefit of separating events as active objects that act on passive simulation objects, consider two types of software developers. First of all, there are the domain experts who know the details of how to do things like: aim and fly scud missiles, form tracks from multiple detections, model various sensors in high fidelity, emulate communications, supply tactical maneuvers for aircraft dog fights, provide combat strategies, etc.. Domain experts might be engineers, or scientists who don't know a lot about distributed simulation, but they are true experts when it comes to knowing how their "black box" simulated entities are supposed to perform. These people are well suited to build the simulation objects. Often, their code can be tested in a standalone manner.

On the other hand, there are distributed simulationists whose responsibility is to develop the complex interactions between the various simulation objects using event objects. These programmers may have no idea how the actual simulated "black box" entities work, but that's OK because they don't need to know those details anyway. Their responsibility is to construct event interactions in a coherent fashion using their skill in distributed simulation to develop event scheduling strategies that scale (i.e., strategies that have enough lookahead to get good parallel performance, use minimal numbers of messages to perform various interactions, use the right levels of abstraction to allow events to act on different types of objects through the use of virtual functions, etc.). These people tend to have backgrounds in computer science (in contrast to the domain experts who are typically engineers and scientists).

Distributed simulationists should work with domain experts in defining appropriate simulation object hierarchies so that events can operate on different types of objects generically through virtual functions defined in base classes. For example, consider a simulation that has a dozen types of sensors. Each of these types of sensors could inherit from a common base-class sensor object (which ultimately inherits from the base-class simulation object). A single generic scan event object could simply pass the positions, velocities, etc., to methods provided by the different sensor objects through a virtual function defined in the sensor base class. This allows the scan event to be ignorant of the different types of sensors while still providing a clean way to invoke their individual methods.

Events as Objects in SPEEDES:

We have discussed many software engineering benefits that arise when events are defined as objects. Now we discuss some of their interesting and practical properties in the context of the SPEEDES parallel simulation framework.

Each user defined event object inherits from a base-class event object provided by SPEEDES. This base class object defines several virtual functions that are invoked by SPEEDES at their appropriate times. Event processing is therefore broken down into several stages (See Figure 10).

Figure 10: A SPEEDES event object

To avoid confusion, we need to first distinguish between messages and events (they are not the same although there is a one-to-one correspondence between messages and events). When an event schedules a new event, a message is created with header information that defines the type of event, simulation time, type of simulation object, and its local id. Following the header is any user defined data that is required to initialize the event object. When the message is received by the destination node, SPEEDES uses the header information to create the appropriate event object (free lists are used to speed up the event object creation process). The message is then passed into a virtual function called init() that is provided by the user. Here, data contained in the message is extracted and stored in the event object itself. The event is then automatically placed in the event list by SPEEDES where it will be executed at its appropriate simulation time.

When it is time to process the event, a second virtual function defined in the event object is invoked by SPEEDES called process(). Here is where the basic computations (which may be rolled back) for the event are done. Each event object is given a pointer to the simulation object that it operates on. The process() method normally extracts information from its simulation object (this could involve complicated computations provided by the simulation object) which can be stored in the event object (this will be motivated later in the discussion on lazy cancellation). Then, the state of the simulation object can be modified using the incremental state saving tools provided by SPEEDES and new events may be generated.

Immediately following the process() method, SPEEDES calls the event's exchange() method. Here, super-efficient incremental state saving techniques can optionally be used (in addition to the state-saving tools provided for the process() method) to swap values stored in the event object with similar values in the event object. A rollback would simply require calling the exchange() method once again. The exchange() method is considerably faster than the other incremental state saving techniques. It also requires much less overhead but it is only practical for simple assignments. As a reminder, the super- efficient exchange mechanism for incremental state saving only works when events are defined as objects.

Much later, after a new GVT value has been determined, events with time tags less than GVT are safely committed. As each event is committed, one more opportunity for user code to be executed is provided. SPEEDES calls the commit() method for each event as it is committed to support a general mechanism for releasing messages to the outside world in any format or communication protocol desired. Again, this capability really only makes sense when events are defined as objects.

If the init() method needs to allocates memory (for example, if it needs to make a copy of the entire message for later processing), then there must be a mechanism to later delete the memory. Otherwise, we will have a memory leak in the simulation. We need to be careful here because simply deleting the memory in the commit method will not work. Remember that it is possible for the event to be canceled by an antimessage (or directly through a user cancellation message). Therefore, SPEEDES provides a method called cleanup() that is always called for event objects when they are no longer needed, even those that have been canceled.

One last method, called lazy(), is defined to provide support for lazy cancellation with tolerances. Remember, we mentioned that as an event is processed, it might store values (provided by the simulation object) inside the event object. When an event is rolled back (due to a straggler), SPEEDES later allows events to quickly check (before reprocessing) if reprocessing the event would produce the same results. This will always be true if the critical state values in the simulation object have not changed significantly. The lazy() method provides a way for events to quickly check if these critical values are still the same. If so, the lazy() method returns the value 1, and the event is simply rolled forward. A return value of 0 means that the event must be reprocessed.

Lazy cancellation should be used with care. Only the event types that have the best chances of passing the lazy() test should participate in lazy cancellation. The techniques used here to support lazy cancellation are much more efficient than traditional approaches because we don't have to compare full states (i.e., the full state saving approach), nor are we required to compare generated messages with old messages to determine if they are the same. Super-efficient lazy cancellation with tolerances is provided in SPEEDES because events are objects. Otherwise, this approach would not work.

Design decision 8: In order to provide scalable software engineering practices, events should be supported as active objects that are separate from passive simulation objects. By doing this, we also allow events to be processed in multiple phases which supports very efficient strategies for incremental state saving and lazy cancellation with tolerances.

Scalable Fundamental Algorithms

In this section, we very briefly discuss some of the functions that are critical for supporting distributed military simulations. This is not the same as coming up with standards for PDUs since we are interested in lookahead, message traffic, correctness, etc.. In other words, we are focusing on scalable solutions to some of the hardest fundamental problems in distributed military simulations; not simply defining standards for plugging external entities into a distributed military simulation.

The first critical function that must be provided is parallel proximity detection (which is approximated by dead reckoning algorithms in DIS simulations) (Lin 93). Objects with sensing capabilities must be able to determine the exact position, velocity, and other pertinent information for all of the moving objects that are within their detection range at all current simulation times. This must be done in a manner that exploits lookahead whenever possible, minimizes the number of messages, and allows moving objects to arbitrarily change their motion. A solution to this problem, called The Distribution List Algorithm, has been developed in SPEEDES that meets all of these goals (for more details, see Steinman and Wieland 94).

A second, and possibly even more challenging problem facing distributed military simulations is to provide terrain, scene generation and environmental effects in a general and scalable manner. We have not studied this problem in detail yet, so rather than speculating on possible methods, we simply remind ourselves of the difficulties involved.

First of all, environmental data bases can be extremely large, especially if the resolution is sufficiently small. In order to provide a scalable mechanism for supporting large sets of data, this data base should probably be decomposed across multiple nodes. A very powerful way to do this would be to include environment simulation objects in a SPEEDES simulation. Simulated entities could be put on distribution lists to receive updates for the physical space that is in their field of view (this is similar to the parallel proximity detection methods developed in SPEEDES). However, this approach does not scale well when objects on each node are randomly distributed across the entire battlefield, thus essentially requiring each node to locally include the entire environmental data base. One solution to this problem would be to decompose the environment in large patches across the simulation nodes, and then migrate the simulation objects to the nodes containing the terrain corresponding to their current position. However, this again does not scale when all of the simulated entities move to the same battlefield area (which would cause all of the simulation objects to migrate to the same node). As we mentioned before, this is a tough problem that needs much research before definite recommendations can be made!

Design decision 9: Critical fundamental algorithms, such as proximity detection and environmental effects, must be identified and well thought-out in order for a parallel and distributed military simulation to achieve scalability. Critical algorithms that don't scale can cripple the performance of the simulation. The parallel proximity detection methods developed in SPEEDES do scale.

Summary And Conclusions

We have discussed a wide variety of issues concerning scalability in parallel and distributed military simulations.

First, we pointed out that in order to develop simulations that efficiently support real-time virtual interactive uses and constructive analytical uses, it is imperative for logically correct synchronization to be provided. Simply using the wall clock for synchronization does not support efficient analytical simulation studies.

Second, we discussed various approaches for synchronizing distributed simulations. We found that conservative strategies do not scale, while the optimistic strategies that provide flow control over risk and optimism are the only known scalable solutions. Breathing Time Warp, supported in the SPEEDES parallel simulation framework provides this functionality.

Third, we showed that parallel simulations with large numbers of fine grained objects have more intrinsic parallelism (i.e., less rollbacks) than simulations with small numbers of large grained objects. We then presented an analytic model that predicts the average number of events processed in an event horizon cycle for steady state simulations. The results of this model give a good indication of the intrinsic parallelism of a simulation for different event generating time statistics. We found that scheduling events into the far future (i.e., with lots of lookahead) gives more parallelism than when scheduling events in the near future.

Fourth, we discussed the basic scalable communications operations that are required to support parallel and distributed simulations. These operations include non-blocking synchronizations, global reduction operations, synchronous message routing, and asynchronous message sending. By encapsulating these operations with a standard interface, SPEEDES has been able to build a set of communication libraries that support these operations on a wide number of machines, networks, and parallel computers in a transparent and scalable manner, taking advantage of any special hardware support that might be provided.

Fifth, we discussed why optimistic simulations actually provide a superior platform for supporting interactive simulations. Interactive optimistic simulations can support real-time monitor and control applications by only rolling back those objects that change as live data is fed into the simulation. Optimistic simulations also permit expensive calculations to be performed ahead of time which amortizes computations over larger periods of time instead of requiring the simulation to always be able to keep up with the clock (which would be required if computations were done conservatively). We then laid out the rules for interactive simulation support and gave a very specific example showing how an external module can be correctly and efficiently synchronized to the simulation.

Sixth, we shifted gears and brought into focus scalable software engineering practices; specifically, techniques to provide scalable object-oriented software. First, we showed how the object manager concept provides the means to keep the management of different types of objects totally encapsulated from each other while still providing a mechanism to integrate different types of objects into a single simulation. We then pointed out much of the confusion in the world of object-oriented software design. Then, we contrasted two very different approaches for defining events in simulations and found that active simulation objects do not scale well while active event objects that act on passive simulation objects scales nicely. We showed other benefits gained when defining events as objects (incremental state saving, lazy cancellation, processing as events are committed).

Finally, we very briefly discussed some of the fundamental algorithms that are required by distributed military simulations. Parallel proximity detection (i.e., providing positions, velocities, etc. of all moving objects within each sensing object's field of view) has been studied in SPEEDES and a new scalable algorithm called The Distribution List Algorithm has been developed that solves this problem. Another very difficult problem is to provide distributed support for terrain, environment, and scene generation in a scalable manner. Some strategies were discussed, but because we haven't fully studied this subject, we could not present a conclusive strategy that is fully scalable.

SPEEDES is a parallel distributed framework (patent pending) that was designed to meet all of the scalability requirements presented in this paper. SPEEDES was developed at the Jet Propulsion Laboratory under government contracts and can be made freely available for agencies doing government work through a NASA license agreement. For more information on SPEEDES, contact Jeff Steinman:

email: jss@pebbles.jpl.nasa.gov
Phone: (818) 354-7180

Also, postscript SPEEDES papers are available through anonymous ftp:

ftp://pebbles.jpl.nasa.gov/pub/SPEEDES_Papers/

Acknowledgements

I would like to thank Fred Wieland for his helpful input concerning the organization of this paper.

The research described in this paper was carried out by the Jet Propulsion Laboratory, California Institute of Technology, and was sponsored by BMDO through the National Test Facility at Falcon Air Force Base, Colorado Springs, and through Innovative Science & Technology at the Pentagon by agreement with the National Aeronautics and Space Administration.

Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology.

Biography

References

Agre J. and Tinker P. 1991. "Useful Extensions to a Time Warp Simulation System." In Proceedings of the Advances in Parallel and Distributed Simulation Conference (PADS91). Vol. 23, No. 1, January 1991, Pages 78-85.

Booch G. 1991. "Object Oriented Design With Applications." The Benjamin/Cummings Publishing Company Inc., 390 Bridge Parkway, Redwood City, California 94065.

Chandy, K., and Misra, J. 1979. Distributed Simulation: A Case Study in Design and Verification of Distributed Programs. IEEE Transactions on Software Engineering. Vol. SE-5, No. 5, pages 440-452.

Cox, Brad J. 1986. Object Oriented Programming: An Evolutionary Approach. Addison-Wesley Publishing Company, Reading Massachusetts.

Fujimoto R. 1990. Parallel Discrete Event Simulation. Communications of the ACM. Vol. 33, No. 10, pages 30-53.

Geist, et al. 1993. "PVM 3.0 User's Guide and Reference Manual." Oak Ridge National Laboratory, ORNL/TM-12187, Feb. 1993.

Gupta, R. 1989. "The Fuzzy Barrier: A Mechanism for High Speed Synchronization of Processors." Third Int'l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS-III), April 3-6, 1989, Pages 54-63.

Hawking S. 1988. "A Brief History of Time." Bantam Books, New York, New York, 1988.

Jefferson D. 1985. "Virtual Time." ACM Transactions on Programming Languages and Systems. Vol. 7, No. 3, Pages 404-425.

Lin K. 1993. Development of Improved Dead Reckoning Algorithms for Distributed Interactive Simulation, Second Draft, Institute for Simulation and Training, University of Central Florida, May 1993.

Loral Systems Company 1992. Strawman Distributed Interactive Simulation Architecture Description Document. Prepared for Program Manager - Training Devices Naval Training Systems Center. Orlando, Florida, ADST/WDL/TR-92003010, Vols. 1 and 2.

Meyer B. 1988. "Object-Oriented Software Construction." Prentice Hall International (UK) Ltd., 66 Wood Lane End, Hemel Hempstead, Hertfordshire, HP2 4RG.

Pancerella C. and Reynolds P. 1993, "Disseminating Critical Target-Specific Synchronization Information in Parallel Discrete-Event Simulations." In Proceedings of the 7'th Workshop on Parallel and Distributed Simulation (PADS93). Vol. 23, No. 1, July 1993, Pages 52-59.

Steinman J. 1992. "SPEEDES: A Multiple-Synchronization Environment for Parallel Discrete-Event Simulation." International Journal in Computer Simulation. Vol. 2, Pages 251- 286.

Steinman, Jeff 1993a. "Breathing Time Warp." In proceedings of the 7th Workshop on Parallel and Distributed Simulation (PADS93). Vol. 23, No. 1, Pages 109-118.

Steinman, Jeff 1993b. "Synchronization of Parallel and Distributed Interactive Military Simulations Using SPEEDES." In proceedings of the 1993 Summer Computer Simulation Conference. Pages 701-710.

Steinman, Jeff 1993c. "Incremental State Saving in SPEEDES Using C++." In proceedings of the 1993 Winter Simulation Conference. Pages 687-696.

Steinman, Jeff and Wieland, Fred 1994. "Parallel Proximity Detection and the Distribution List Algorithm." In proceedings of the 1994 Parallel And Distributed Simulation Conference. Pages 3-11.

Steinman, Jeff 1994b. "Discrete-Event Simulation and the Event Horizon." In proceedings of the 1994 Parallel And Distributed Simulation Conference. Pages 39-49.

Steinman, Jeff 1994c. "SPEEDES User's Guide Beta 2.0." The MITRE Corporation and The Jet Propulsion Laboratory.

Steinman, Jeff 1995. "Global Virtual Time and Distributed Synchronization." To appear in proceedings of the 1995 Parallel And Distributed Simulation Conference.

Tung, Yu-Wen and Steinman, Jeff 1993. "Interactive Graphics for the Parallel and Distributed Computing Simulation." In proceedings of the 1993 Summer Computer Simulation Conference. Pages 695-700.

Weatherly R., Wilson A., Griffin S. 1993. "ALSP РTheory, Experience, and Future Directions." In Proceedings of the 1993 Winter Simulation Conference. Pages 1068-1072.

Wirfs-Brock R., Wilkerson B., Wiener L. 1990. "Designing Object- Oriented Software." P T R Prentice Hall, Englewood Cliffs, New Jersey 07632. Page 38.