Название: Declarative Logic Programming
Автор: Michael Kifer
Издательство: Ingram
Жанр: Компьютеры: прочее
Серия: ACM Books
isbn: 9781970001983
isbn:
All derivations vs. all answers. Prolog and other logic programming approaches generally focused on finding a solution to a problem or deciding the truth of a statement. (Are X and Y related?) Top-down approaches can be efficient in such cases, because the sequence of rule applications is directed by the current goal, where bottom-up approaches can derive a lot of extraneous facts that are not connected to the goal. Datalog was targeting database settings, where a collection of answers is expected. (Who are all the people related to Y?) While Prolog-style top-down evaluation can generate multiple answers by backtracking to choice points, this process may be inefficient, because there can be more than one way to derive the same answer. (In deductive terms: there can be more than one proof for some facts.) For example, if persons A and B are related because of a common ancestor C, then their relatedness can also be deduced from any ancestor D of C. Moreover, with some rule sets, there can be more than one way to derive relatedness using C. If the number of derivations greatly exceeds the number of facts, Prologlike approaches can be inefficient for finding all answers. Datalog techniques tried to avoid multiple derivations, for example, by remembering previously derived facts, and avoiding re-deriving them (in top-down approaches) or excluding them when trying to derive new facts (in bottom-up approaches).
Recursion. A key advantage of Datalog over other relational languages, like SQL, was the natural expression of recursion in queries. The top-down, leftto-right evaluation style of Prolog can behave differently depending on how rules are written. Some rules can cause infinite regress in trying to prove a goal. For example, if the first rule for related
were
Prolog evaluation would keep solving the first goal with a recursive application of the same rule, and never actually encounter rules or facts for the advised
predicate. Even when rule sets are better constructed, there can still be an unbounded number of search paths even when there are only finitely many answers. In fact, if the SLD tree is infinite and all answers are desired, no complete search strategy will terminate. Here the bottom-up approaches have a decided advantage, as termination is guaranteed if the set of answers is finite (though bottom up can re-derive facts).
Large fact bases. Prolog implementations largely assumed that rules and facts all fit in memory. However, this assumption was questionable for applications of Datalog to large datasets. If virtual memory is required for the fact base, then accessing facts turns into disk accesses. Different binding patterns for goals lead to different orders of access, hence not all disk accesses will be clustered, leading to random reads on disk. If there is alternative storage for facts (in the file system or a database), then there can be costs associated with moving facts into internal memory structures—such as the parsing associated with the Prolog assert
command. For efficient access to large fact bases on disk, data needs to be read in chunks, and manipulated in its format on disk.
A great deal of the early work on Datalog (and deductive databases more generally) centered on alternative evaluation mechanisms. These approaches generally started either from basic bottom-up (BU) or top-down (TD) evaluation. The bottomup methods tried to avoid re-deriving established facts, and providing more direction as to which facts to derive. Top-down methods tried to access facts in groups (rather than one at a time), rearrange the order of sub-goals, generate database queries to solve goal lists, and memoize derived facts to break recursive cycles. Theoretical proposals for alternative evaluation methods proliferated, with further and further refinements. However, side-by-side comparisons of actual implementations of these approaches were not common (although there was some comparative analysis of methods [Bancilhon and Ramakrishnan 1986]). The potential for scaling and performance of logic languages was promoted, such as via parallel evaluation in the Japanese Fifth-Generation Project [Moto-oka and Stone 1984] or using database techniques such as indexing and query optimization.
1.5.1 Bottom-Up Methods
In Section 1.1, we described the bottom-up evaluation of a Datalog program P at the level of single instantiations of rules to generate new facts. This derivation process is commonly organized into “rounds,” where in one round we consider all satisfied instances of all rules. A round can be viewed as a transformation TP, called the immediate consequence operator, that takes a set of established facts F and produces a set G of newly established facts [Van Emden and Kowalski 1976]. If we let F0 represent the initial EDB facts, then bottom-up evaluation can be seen as producing a sequence of fact-sets F0, F1, F2, … where
For a Datalog program, we eventually reach a fixed point Fj where Fj = Fj−1 (that is, a point at which no new facts can be established using rules in program P). Call this fixpoint F*. At this point, a query over the virtual database embodied by P can be answered. This iterative application of TP is called the naïve approach to bottom-up evaluation. There are two big problems with the naïve approach: repeated derivation and irrelevant results, which give rise to more sophisticated evaluation strategies.
Repeated Derivation. Note that Fi ⊇ F by construction. For a standard Datalog program P, TP is monotone in the sense that adding more facts to the input cannot reduce the number of facts in the output. Thus TP(Fi) ⊇ TP(Fi−1). So each round in the naïve approach re-establishes all the facts from the previous round, which is wasted work. The semi-naïve approach addresses this problem by only using rule instances in a round that have not been tried in previous rounds. It does so by determining which are the new facts generated in a round, then making sure any rule instance used in the next round contains at least one of the new facts. Thus, semi-naïve determines the newly established facts at round i as
and restricts TP to only generate facts if a rule instance uses at least one fact from Di. Rather than examining rule instances on a one-by-one basis, semi-naïve constructs an alternative rule set Q using a version rN
and rO
of each predicate r
, where rN
contains the “new” r
-facts from Di and rO
the “old” facts from Fi−1. So, for example, the rule
in program P would be replaced by three rules in program Q:
where rN
and rO
are computed by the system at each iteration anew. Note that the semi-naïve approach does not completely avoid repeated derivation of facts, since distinct rule instances can establish the same facts independently. For example, the rule instances
both establish r(a, b)
.
A systematic method based on incrementalization [Liu and Stoller 2009] has been developed to ensure that each combination of facts that makes two subgoals of a rule true simultaneously is considered at most once, and furthermore to provide precise time and space complexity guarantees. The method compiles СКАЧАТЬ