From Traditional Fault Tolerance to Blockchain. Wenbing Zhao
Чтение книги онлайн.

Читать онлайн книгу From Traditional Fault Tolerance to Blockchain - Wenbing Zhao страница 22

СКАЧАТЬ (from which the participant receives the first CHECKPOINT message), and waits for a RESUME message.

       ◾ Upon receiving a SAVED message (from one of its downstream neighbors), it relays the message to its upstream neighbor.

       ◾ When it receives a RESUME message, it propagates the message along all its outgoing channels except the one that connects to the process that sends it the message. The participant then resumes normal execution.

      EXAMPLE 2.3

Schematic illustration of normal operation of the Tamir and Sequin checkpointing protocol in an example three-process distributed system.

      Upon receiving the CHECKPOINT message from P0, P1 stops normal execution and sends a CHECKPOINT message along each of its outgoing channel to P0 and P2, respectively. Similarly, P2 sends the CHECKPOINT message to P0 and P1, respectively, once it receives the first CHECKPOINT message.

      Due to the FIFO property of the connections, P0 receives m0 before it collects all the CHECKPOINT messages from all its incoming channels, and P1 receives m1 before it receives the CHECKPOINT messages from P2. According to the protocol rule, such regular messages are logged instead of delivered because normal execution must be stopped once the global checkpointing is initiated. These logged messages will be appended to the local checkpoint once it is taken. In fact, such messages reflect the channel states of the distributed system. These messages won’t be delivered for execution until a process resumes normal execution.

      When P0 receives the CHECKPOINT messages from P1 and P2, it takes a local checkpoint, C0,0 and append the message log to the checkpoint. Similarly, P1 takes a local checkpoint when it receives the CHECKPOINT messages from P0 and P2, and P2 takes a local checkpoint when it receives the CHECKPOINT messages from P0 and P1.

      Subsequently, P1 and P2 send their SAVED messages to P0, i.e., the global checkpointing coordinator. P0 then informs P1 and P2 to resume normal execution with a RESUME message to each of them.

      A more complicated distributed system in which some processes do not have direct connection with the coordinator will require some of the coordinator’s neighbors to relay the SAVED notification to the coordinator.

       2.2.2.2 Correctness of the Protocol.

      1 All messages sent by one process prior to its taking a local checkpoint have been received and executed before the other process takes its local checkpoint.

      2 Some messages sent by one process prior to its taking a local checkpoint might arrive after the other process has checkpointed its state, however, these messages are logged at stable storage for replay.

      In the Tamir and Sequin protocol, if neither the coordinator nor any of the participants receives any regular message once the global checkpointing is initiated, then the scenario 1 holds. On the other hand, if a process receives one or more regular messages, it logs them and append them to the local checkpoint, ensuring their replayability. Hence, the scenario 2 holds. Because the protocol prohibits any process from continuing normal execution (including the sending of a message) as soon as it initiates (if it is the coordinator) or receives the very first CHECKPOINT message (for a participant), no process would receive a message prior to its checkpointing that has been sent by another process after that process has taken its local checkpoint in the same round. That is, the inconsistent global state scenario shown in Figure 2.2(a) does not occur.

      2.2.3 Chandy and Lamport Distributed Snapshot Protocol

      The Tamir and Sequin global checkpointing protocol is very elegant. However, it is a blocking protocol in that normal execution is suspended during each round of global checkpointing. For applications that do not wish to suspend the normal execution for potentially extensive period of time, the Chandy and Lamport distributed snapshot protocol [5] might be more desirable.

Schematic illustration of finite state machine specification for the Chandy and Lamport distributed snapshot protocol.

       2.2.3.1 Protocol Description.

       ◾ The global checkpointing can be initiated by any of the processes in the distributed system. Once a process decides to initiate a global checkpointing round, it takes a local checkpoint and sends a Marker message to each of its outgoing channels. The state of the process changes from Normal to Checkpointing as a result.

       ◾ СКАЧАТЬ