p.48 score-based methods approach the problem of structure learning as an optimization problem. We define a score function that can score each candidate structure with respect to the training data, and then search for a high-scoring structure... A natural choice for scoring function is the likelihood function... This measures the probability of the data given a model... The problem with the likelihood score is that it overfits the training data. It will learn a model that precisely fits the specifics of the empirical distribution in our training set. This model captures both dependencies that are "true" of the underlying distribution, and dependencies that are artifacts of the specific set of instances that were given as training data. It therefore fails to generalize well to new data cases: these are sampled from the underlying distribution, which is not identical to the empirical distribution in our training set.
p.48 An alternative scoring function is based on Bayesian considerations. Recall that the main principle of the Bayesian approach is that, whenever we have uncertainty over anything, we should place a distribution over it.
p.49 The ability to ascribe a prior over structures gives us a way of preferring some structures over others.
p.49 The Bayesian score is biased toward simpler structures, but as it gets more data, it is willing to recognize that a more complex structure is necessary. In other words, it trades off fit to data with model complexity.
p.52 A network is tree-structured if each variable has at most one parent.
p.52 To define the heuristic search algorithm, we must define the search space and search procedure. We can think of a search space as a graph, where each vertex or node is a candidate network structure to be considered, and edges denote possible "moves" that the search procedure can perform. The search procedure defines an algorithm that explores the search space, without necessarily seeing all of it. The simplest search procedure is the greedy one that whenever it is a node chooses to move the neighbor that has the highest score, until it reaches a node that has a better score than all of its neighbors. [JLJ - yes, but in complex games of strategy the way forward is often unclear, and the results of present interaction relationships are often difficult to predict many moves down the line. One strategy involves finding traps that - when set - look to a simple-minded opponent like he is advancing comfortably, while instead hidden movements are available which turn the advantage to the other side. Now what? We can alternatively think in terms of building adaptive capacity to mobilize coercion, and estimate this via diagnostic tests. Perhaps we can equivalently think in terms of simulating the playout of the interlocking effects of present forces, and rely on a general-purpose capacity to recover from the unexpected.]
p.93 Conditional Random Fields (CRFs) combine the modeling flexibility of graphical models with the ability to use rich, nonindependent features of the input.
p.93 Relational data has two characteristics: first, statistical dependencies exist between the entities we wish to model, and second, each entity often has a rich set of features that can aid classification.
p.93 Graphical models are a natural formalism for exploiting the dependence structure among entities.
p.93-94 A CRF is simply a conditional distribution p(y|x) with an associated graphical structure. Because the model is conditional, dependencies among the input variables x do not need to be explicitly represented, affording the use of rich, global features of the input.
p.122 CRFs are a natural choice for many relational problems because they allow both graphically representing dependencies between entities, and including rich observed features of entities.
p.129 Probabilistic, relational models (PRMs) are a rich representation language for structured statistical models... Bayesian networks have been used with great success in a wide variety of real-world research application. However, despite their success, Bayesian networks are often inadequate for representing large and complex domains. A Bayesian network for a given domain involves a prespecified set of random variables, whose relationship to each other is fixed in advance. Hence, a Bayesian network cannot be used to deal with domains where we might encounter a varying number of entities in a variety of configurations. This limitation of Bayesian networks is a direct consequence of the fact that they lack the concept of an "object" (or domain entity). Hence, they cannot represent general principles about multiple similar objects which can then be applied in multiple contexts.
p.168 Regardless of the specific heuristic search algorithm used, an important component of the search is the scoring of candidate structures... However, there are still a very large number of possible structures to consider. We propose a heuristic search algorithm that addresses this issue.
p.176 Consider a collection of hypertext documents that we want to classify using some set of labels. Most naively, we can use a bag-of-words model, classifying each webpage solely using the words that appear on the page. However, hypertext has a very rich structure that this approach loses entirely
p.197 We propose an approach for collective classification and link prediction in relational domains. Our approach provides a coherent probabilistic foundation for the process of collective prediction, where we need to classify multiple entities and links, exploiting the interactions between the variables. We have shown that we can exploit a very rich set of relational patterns in classification, significantly improving the classification accuracy over standard flat classification.
p.453 Using rich sets of features generated from relational data often improves the predictive accuracy of regression models. The number of feature candidates, however, rapidly grows prohibitively large as richer feature spaces are explored. We present a framework, structural generalized linear regression (SGLR), which flexibly integrates feature generation with model selection allowing
- augmentation of relational representation with cluster-derived concepts, and
- dynamic control over the search strategy used to generate features.
Clustering increases the expressivity of feature spaces by creating new concepts which contribute to the creation of new features, and can lead to more accurate models. Dynamic feature generation, in which decisions of which features to generate are based on the results of run-time feature selection, can lead to the discovery of accurate models with significantly less computation than generating all features in advance.
|