Authors: Matthias Broecheler, Andrea Pugliese, VS Subrahmanian

Direct link to paper

**Abstract**

The Social Semantic Web (SSW) refers to the mix of RDF data in web content, and social network data associated with those who posted that content. Applications to monitor the SSW are becoming increasingly popular. For instance, marketers want to look for semantic patterns relating to the content of tweets and Facebook posts relating to their products. Such applications allow multiple users to specify patterns of interest, and monitor them in real-time as new data gets added to the web or to a social network. In this paper, we develop the concept of SSW view servers in which all of these types of applications can be simultaneously monitored from such servers. The patterns of interest are views. We show that a given set of views can be compiled in multiple possible ways to take advantage of common substructures, and define the concept of an optimal merge. We develop a very fast MultiView algorithm that scalably and efficiently maintains multiple subgraph views. We show that our algorithm is correct, study its complexity, and experimentally demonstrate that our algorithm can scalably handle updates to hundreds of views on real-world SSW databases with up to 540M edges.

**Poster**

**Title**: Social Network Data Management.

**Abstract**

With the increasing usage of online social networks and the semantic web’s graph structured RDF framework, and the rising adoption of networks in various fields from biology to social science, there is a rapidly growing need for indexing, querying, and analyzing massive graph structured data. Facebook has amassed over 500 million users creating huge volumes of highly connected data. Governments have made RDF datasets containing billions of triples available to the public. In the life sciences, researches have started to connect disparate data sets of research results into one giant network of valuable information. Clearly, networks are becoming increasingly popular and growing rapidly in size, requiring scalable solutions for network data management.

This thesis focuses on the following aspects of network data management. We present a hierarchical index structure for external memory storage of network data that aims to maximize data locality. We propose efficient algorithms to answer subgraph matching queries against network databases and discuss effective pruning strategies to improve performance. We show how adaptive cost models can speed up subgraph matching query answering by assigning budgets to index retrieval operations and adjusting the query plan while executing.

We develop a cloud oriented social network database, COSI, which handles massive network datasets too large for a single computer by partitioning the data across multiple machines and achieving high performance query answering through asynchronous parallelization and cluster-aware heuristics.

Tracking multiple standing queries against a social network database is much faster with our novel multi-view maintenance algorithm, which exploits common substructures between queries.

To capture uncertainty inherent in social network querying, we define probabilistic subgraph matching queries over deterministic graph data and propose algorithms to answer them efficiently.

Finally, we introduce a general relational machine learning framework and rule-based language, Probabilistic Soft Logic, to learn from and probabilistically reason about social network data and describe applications to information integration and information fusion.

]]>Authors: Matthias Broecheler, Andrea Pugliese, VS Subrahmanian

Direct link to paper

**Abstract**

Users querying massive social networks or RDF databases are often not 100% certain about what they are looking for due to the complexity of the query or heterogeneity of the data. In this paper, we propose “probabilistic subgraph” (PS) queries over a graph/network database, which afford users great flexibility in specifying “approximately” what they are looking for. We formally define the probability that a substitution satisfies a PS-query with respect to a graph database. We then present the PMATCH algorithm to answer such queries and prove its correctness. Our experimental evaluation demonstrates that PMATCH is efficient and scales to massive social networks with over a billion edges.

**Presentation**

Authors: Matthias Broecheler, Andrea Pugliese, VS Subrahmanian

Direct link to paper

**Abstract**

As social network and RDF data grow dramatically in size to billions of edges, the ability to scalably answer queries posed over graph datasets becomes increasingly important.

In this paper, we consider subgraph matching queries which are often posed to social networks and RDF databases - for such queries, we want to find all matching instances in a graph database. Past work on subgraph matching queries uses static cost models which can be very inaccurate due to long-tailed degree distributions commonly found in real world networks. We propose the BudgetMatch query answering algorithm. BudgetMatch costs and recosts query parts adaptively as it executes and learns more about the search space. We show that using this strategy, BudgetMatch can quickly answer complex subgraph queries on very large graph data. Specifically, on a real world social media data set consisting of 1.12 billion edges, we can answer complex subgraph queries in under one second and significantly outperform existing subgraph matching algorithms.

**Presentation**

There is a growing interest in methods for exploiting causal or correlational dependencies in structured domains. Exploiting such dependencies often results in improved predictive performance on complex inference tasks in diverse domains such as information integration, natural language processing, and computer vision. In this presentation, we introduce probabilistic soft logic (PSL), a general-purpose framework for expressing, reasoning about and learning structural dependencies.

PSL provides a declarative language tailored to relational domains that require reasoning about similarity and/or probability. Some of the novel aspects of PSL include a representation based on continuous valued random variables, efficient polynomial-time inference algorithms, native support for reasoning about sets, and the ability to estimate confidences values for predictions. This presentation provides a detailed account of PSL covering its mathematical foundation, logic programming semantics, inference and learning algorithms, scalability through parallelization, and different applications. We close with a demonstration of the PSL system implementation.

**Presentation**

Authors: Stephen H. Bach, Matthias Broecheler, Stanley Kok, and Lise Getoor

Direct link to paper

**Abstract**

We introduce the concept of a decision-driven model, a probabilistic model that reasons directly over the uncertain information of interest to a decision maker. We motivate the use of these models from the perspective of personalized medicine. Decision-driven models have a number of benefits that are of particular value in this domain, such as being easily interpretable and naturally quantifying confidences in both evidence and predictions. We show how decision-driven models can easily be constructed using probabilistic soft logic, a recently introduced framework for statistical relational learning and inference which allows the specification of medical domain knowledge in concise first-order-logic rules with assigned confidence values.

The paper is presented as a poster. Follow the link for more information on decision-driven modeling in probabilistic soft logic.

]]>Authors: Matthias Broecheler and Lise Getoor

Direct link to paper

Data set used in the experiments

**Abstract**

Continuous Markov random fields are a general formalism to model joint probability distributions over events with continuous outcomes. We prove that marginal computation for constrained continuous MRFs is #P-hard in general and present a polynomial-time approximation scheme under mild assumptions on the structure of the random field. Moreover, we introduce a sampling algorithm to compute marginal distributions and develop novel techniques to increase its efficency. Continuous MRFs are a general purpose probabilistic modeling tool and we demonstrate how they can be applied to statistical relational learning. On the problem of collective classification, we evaluate our algorithm and show that the standard deviation of marginals serves as a useful measure of confidence.

**Poster Presentation**

Authors: Matthias Broecheler, Paulo Shakarian, and V.S. Subrahmanian

Direct link to paper

**Abstract**

Multiple phenomena often diffuse through a social network, sometimes in competition with one another. Product adoption and political elections are two examples where network diffusion is inherently competitive in nature. For example, individuals may choose to only select one product from a set of competing products (i.e. most people will need only one cell-phone provider) or can only vote for one person in a slate of political candidate (in most electoral systems). We introduce the weighted generalized annotated program (wGAP) framework for expressing competitive diffusion models. Applications are interested in the eventual results from multiple competing diffusion models (e.g. what is the likely number of sales of a given product, or how many people will support a particular candidate). We define the â€œmost probable interpretationâ€ (MPI) problem which technically formalizes this need. We develop algorithms to efficiently solve MPI and show experimentally that our algorithms work on graphs with millions of vertices.

**Presentation**

View more presentations from Matthias Broecheler.

Authors: Matthias BrÃ¶cheler, Andrea Pugliese and V.S. Subrahmanian

Direct link to paper

**3rd Best Paper Award**

**Abstract**

Subgraph matching is a key operation on graph data. Social network (SN) providers may want to find all subgraphs within their social network that â€œmatchâ€ certain query graph patterns. Unfortunately, subgraph matching is NP-complete, making its application to massive SNs a major challenge. Past work has shown how to implement subgraph matching on a single processor when the graph has 10-25M edges. In this paper, we show how to use cloud computing in conjunction with such existing single processor methods to efficiently match complex subgraphs on graphs as large as 778M edges. A cloud consists of one â€œmasterâ€ compute node and k â€œslaveâ€ compute nodes. We first develop a probabilistic method to estimate probabilities that a vertex will be retrieved by a random query and that a pair of vertices will be successively retrieved by a random query. We use these probability estimates to define edge weights in an SN and to compute minimal edge cuts to partition the graph amongst k slave nodes. We develop algorithms for both master and slave nodes that try to minimize communication overhead. The resulting COSI system can answer complex queries over real-world SN data containing over 778M edges very efficiently.

**Presentation at ASONAM 2010**

View more presentations from Matthias Broecheler.

Presented at: Conference on Uncertainty in Artificial Intelligence, held on Santa Catalina Island, CA, USA from July 8th - July 11th, 2010

Authors: Matthias Broecheler, Lilyana Mihalkova, Lise Getoor

Direct link to paper

Data set used in the experiments

**Abstract**

Many machine learning applications require the ability to learn from and reason about noisy multi-relational data. To address this, several effective representations have been developed that provide both a language for expressing the structural regularities of a domain, and principled sup- port for probabilistic inference. In addition to these two aspects, however, many applications also involve a third aspectâ€“the need to reason about similaritiesâ€“which has not been directly supported in existing frameworks. This paper introduces probabilistic similarity logic (PSL), a general-purpose framework for joint reason- ing about similarity in relational domains that incorporates probabilistic reasoning about similarities and relational structure in a principled way. PSL can integrate any existing domain- specific similarity measures and also supports reasoning about similarities between sets of entities. We provide efficient inference and learn- ing techniques for PSL and demonstrate its effectiveness both in common relational tasks and in settings that require reasoning about similarity.