From Traditional Fault Tolerance to Blockchain. Wenbing Zhao
Чтение книги онлайн.

Читать онлайн книгу From Traditional Fault Tolerance to Blockchain - Wenbing Zhao страница 28

СКАЧАТЬ checkpoint includes its message log, and the regular messages logged and the corresponding ACK messages might not reach their the destination processes due to the process failure, the recovering process retransmit the regular messages or the ack messages based on the following rule:

       ◾ If the entry in the log for a message contains no rsn value, then a REGULAR message is retransmitted because the intended receiving process might not have received this message.

       ◾ If the entry in the log for a message contains a valid rsn value, then an ACK message is sent so that the receiving process can send regular messages.

      When a process receives a regular message, it always sends a corresponding ORDER message in response. There are three scenarios:

       ◾ The message is not a duplicate, in which case, the current rsn counter value is assigned to the message as its receiving order, and the corresponding ORDER message is sent. The process must then wait for the ACK message before it sends any regular message.

       ◾ The message is a duplicate, and the corresponding rsn value is found in its history list, in which case, an ORDER is message is sent and the duplicate message itself is discarded. The process must then wait for the ACK message before it sends any regular message. Note that it is impossible for the process to have received the corresponding ACK message before because otherwise the recovering process must have logged the rsn value for the regular message.

       ◾ The message is a duplicate, and there is no corresponding entry in the history list. In this case, the process must have checkpointed its state after receiving the message and it is no longer needed for recovery. As a result, the process sends an ORDER message with a special constant indicating that the message is no longer needed and the sending processing can safely purge the entry from its message log.

      After replaying these messages, the process is recovered to a state that is visible to, and consistent with, other processes prior to the failure. For regular messages without rsn values, the recovering process can replay them in an arbitrary order because the process must not have sent any regular message since the receipt of such messages prior to its failure.

       2.3.2.4 Limitations and Correctness.

      The sender-based message logging protocol described above ensures proper recovery of a distributed system as long as a single failure occurs at a time. That is, after a process fails, no other processes fail until the failed process is fully recovered. Note that the protocol cannot cope with two or more concurrent failures. If two or more failures occur concurrently, the determinant for some regular messages (i.e., the rsn values) might be lost, which would lead to orphan processes and the cascading rollback (i.e., the domino effect).

      EXAMPLE 2.7

Schematic illustration of two concurrent failures could result in the loss of determinant information for regular messages.

      Let’s assume a process Pi fails and another process Pj receives a regular message sent by Pi prior to the failure, we need to prove that the message must have been logged at Pi either prior to its failure or will have been logged before the end of the recovery.

      If Pi checkpointed its state after sending the regular message prior to the failure, the message must have been logged in stable storage and is guaranteed to be recoverable. Otherwise, the message itself would have been lost due the failure because it was logged in volatile memory. However, we prove that the message will be regenerated during the recovery.

       2.3.2.5 Discussion.

      As we have mentioned before, unlike the receiver-based pessimistic logging, performing a local checkpointing at a process does not truncate its message log because the log contains messages sent to other processes and they might be needed for the recovery of these other processes. This is rather undesirable. Not only it means unbounded message log size, but it leads to unbounded recovery time as well.

      The sender-based message logging protocol can be modified to at least partially fix the problem. However, it will be at the expense of the locality of local checkpointing. Once a process completes a local checkpoint, СКАЧАТЬ