Maximizing Social Influence in Nearly Optimal Time (1212.0884v5)

Published 4 Dec 2012 in cs.DS, cs.SI, and physics.soc-ph

Abstract: Diffusion is a fundamental graph process, underpinning such phenomena as epidemic disease contagion and the spread of innovation by word-of-mouth. We address the algorithmic problem of finding a set of k initial seed nodes in a network so that the expected size of the resulting cascade is maximized, under the standard independent cascade model of network diffusion. Runtime is a primary consideration for this problem due to the massive size of the relevant input networks. We provide a fast algorithm for the influence maximization problem, obtaining the near-optimal approximation factor of (1 - 1/e - epsilon), for any epsilon > 0, in time O((m+n)k log(n) / epsilon^2). Our algorithm is runtime-optimal (up to a logarithmic factor) and substantially improves upon the previously best-known algorithms which run in time Omega(mnk POLY(1/epsilon)). Furthermore, our algorithm can be modified to allow early termination: if it is terminated after O(beta(m+n)k log(n)) steps for some beta < 1 (which can depend on n), then it returns a solution with approximation factor O(beta). Finally, we show that this runtime is optimal (up to logarithmic factors) for any beta and fixed seed size k.

Citations (818)

View on Semantic Scholar

Summary

The paper introduces TIM and TIM+ algorithms that use reverse reachable sets to achieve a near-optimal (1-1/e-ε) approximation for influence maximization.
The methodology significantly reduces computational complexity compared to Monte Carlo-based approaches, scaling efficiently to large networks.
The approach combines a parameter estimation phase with a greedy node selection strategy, enabling practical, parallelizable influence spread estimation.

The problem of Influence Maximization (IM) seeks to identify a set $S$ of $k$ initial seed nodes in a social network graph $G=(V, E)$ such that the expected number of nodes eventually influenced by $S$ , denoted $\sigma(S)$ , is maximized. The paper "Maximizing Social Influence in Nearly Optimal Time" (1212.0884) focuses on the widely adopted Independent Cascade (IC) model of influence diffusion. In the IC model, when a node $u$ becomes active, it gets one chance to activate each of its currently inactive neighbors $v$ with a probability $p_{uv}$ independently. The primary challenge addressed is the computational cost, as prior algorithms, notably the greedy approach with Monte Carlo simulations, required substantial runtime, often prohibitive for large-scale networks. This work introduces the Two-phase Influence Maximization (TIM and TIM+) algorithms, achieving a near-optimal approximation guarantee with significantly improved time complexity.

Algorithmic Framework: TIM and TIM+

The core innovation lies in efficiently estimating the influence spread $\sigma(S)$ without resorting to expensive Monte Carlo simulations for every potential seed set. The approach leverages the concept of Reverse Reachable (RR) sets, drawing upon the work of Borgs et al. (2012) which connected influence estimation to reachability probabilities.

An RR set is generated by first selecting a node $v$ uniformly at random from $V$ . Then, a sample graph $g$ is realized from the distribution induced by the IC model (i.e., for each edge $(u, v)$ , it is included in $g$ with probability $p_{uv}$ ). Finally, the RR set consists of all nodes $u$ from which $v$ is reachable in $g$ . Equivalently, this can be viewed as performing a breadth-first search (BFS) starting from $v$ on the graph with edges reversed, where edge inclusion is determined probabilistically. Let $\mathcal{R}$ be a collection of $\theta$ such RR sets $R_1, \ldots, R_\theta$ .

The crucial insight is that the fraction of RR sets in $\mathcal{R}$ that are covered by a node set $S$ (i.e., $R_i \cap S \neq \emptyset$ ) provides an unbiased estimator for $\sigma(S)/n$ . Specifically, let $K(S, \mathcal{R}) = |\{i \mid R_i \cap S \neq \emptyset\}|$ . Then $\mathbb{E}\left[\frac{n}{\theta} K(S, \mathcal{R})\right] = \sigma(S)$ .

The IM problem can thus be reformulated as a Maximum Coverage problem: find a set $S$ of size $k$ that maximizes $K(S, \mathcal{R})$ . This is a classic set cover variant, solvable greedily with a $(1 - 1/e)$ approximation factor.

The TIM algorithm operates in two phases:

Parameter Estimation Phase: Determine the minimum number of RR sets, $\theta$ , required to achieve the desired approximation guarantee $(1 - 1/e - \epsilon)$ with high probability. This phase involves iterative estimation and refinement based on concentration bounds (e.g., Chernoff bounds). It establishes a lower bound $\theta^*$ on the number of RR sets needed.
Node Selection Phase: Generate $\theta \ge \theta^*$ RR sets. Apply the standard greedy algorithm for Maximum Coverage on the collection $\mathcal{R} = \{R_1, \ldots, R_\theta\}$ to select the $k$ seed nodes. The greedy algorithm iteratively selects the node $u$ that covers the maximum number of currently uncovered RR sets until $k$ nodes are chosen.

The TIM+ algorithm refines the parameter estimation phase to further optimize constants and potentially reduce the required number of RR sets $\theta$ in practice, while maintaining the same asymptotic guarantees.

The generation of a single RR set involves:

Selecting a random node $v \in V$ .
Simulating the random propagation process in reverse: For each incoming edge $(u, v)$ , activate it with probability $p_{vu}$ . Perform a graph traversal (like BFS or DFS) starting from $v$ backwards along activated edges to find all nodes that could reach $v$ .

function GenerateRRSet(G=(V,E), p):
  v = sample node uniformly from V
  R = {v}
  queue = Queue()
  queue.enqueue(v)
  visited = {v}

  while not queue.isEmpty():
    curr = queue.dequeue()
    for each neighbor u of curr such that (u, curr) is an edge in E:
      if u not in visited:
        # Check if edge (u, curr) activates based on probability p_uc
        if random() < p_uc:
           visited.add(u)
           R.add(u)
           queue.enqueue(u)
  return R

function TIM(G=(V,E), p, k, epsilon):
  # Phase 1: Estimate theta
  theta = EstimateRequiredRRSets(G, k, epsilon) # Based on theoretical bounds

  # Phase 2: Node Selection
  RR_sets = []
  for i from 1 to theta:
    RR_sets.append(GenerateRRSet(G, p))

  S = {} # Seed set
  covered_indices = set()
  for i from 1 to k:
    best_node = null
    max_gain = -1

    for node u in V \ S:
      current_gain = 0
      for idx, R in enumerate(RR_sets):
        if idx not in covered_indices and node in R:
          current_gain += 1

      if current_gain > max_gain:
        max_gain = current_gain
        best_node = node

    if best_node is not null:
        S.add(best_node)
        for idx, R in enumerate(RR_sets):
            if idx not in covered_indices and best_node in R:
                covered_indices.add(idx)
  return S

Theoretical Guarantees and Complexity

The TIM algorithm provides a $(1 - 1/e - \epsilon)$ -approximation guarantee for the influence maximization problem under the IC model with high probability (typically $1 - n^{-l}$ for some constant $l$ ).

The primary contribution is the runtime complexity. Generating one RR set takes expected time proportional to the number of edges explored in the reverse BFS, which is bounded by $O(m+n)$ where $n=|V|$ and $m=|E|$ . The parameter estimation phase requires careful analysis but its runtime is dominated by the second phase. The second phase involves generating $\theta$ RR sets and running the greedy algorithm for Maximum Coverage. The value of $\theta$ derived from the analysis is $O((k + \log n) \cdot n \cdot \epsilon^{-2} / \text{OPT})$ , where OPT is the influence of the optimal solution. Using a loose lower bound for OPT (OPT $\ge k$ ), $\theta$ becomes $O(n \log n / \epsilon^2)$ . More refined analysis in the paper establishes $\theta = O\left(\frac{(m+n)k \log n}{\epsilon^2 \cdot \text{OPT}}\right)$ influence estimations might be needed in the estimation step, leading to the overall expected runtime of:

$O\left((m+n)k \log(n) / \epsilon^2\right)$

This complexity represents a significant improvement over the $\Omega(mnk \cdot \text{POLY}(1/\epsilon))$ runtime of previous state-of-the-art greedy algorithms relying on Monte Carlo simulations for influence estimation. The paper demonstrates that this runtime is nearly optimal, matching known lower bounds up to logarithmic factors for algorithms achieving a $(1-1/e-\epsilon)$ approximation.

Furthermore, TIM exhibits an "early termination" property. If the algorithm is stopped after generating only $\beta \theta^*$ RR sets for some $\beta < 1$ , it still returns a seed set $S$ with an approximation guarantee of $O(\beta)$ , albeit potentially weaker than the target $(1 - 1/e - \epsilon)$ . The runtime in this case becomes $O(\beta(m+n)k \log(n))$ . This property is valuable in scenarios with strict time budgets.

Implementation and Practical Considerations

Implementing TIM involves several key steps:

RR Set Generation: This is the most computationally intensive part. Efficient implementation requires fast random node sampling and optimized graph traversal (BFS/DFS) on the reversed graph. The probabilistic edge activation within the traversal is crucial. Pre-calculating or efficiently sampling edge activations can optimize performance. Storing the graph using adjacency lists is standard. If edge probabilities $p_{uv}$ are uniform ( $p_{uv}=p$ ), sampling can be simplified.
Data Structures for Maximum Coverage: The greedy algorithm requires efficiently finding the node that covers the most currently uncovered RR sets. Maintaining counts for each node (how many uncovered sets it belongs to) and potentially using data structures like heaps or bucket queues can accelerate the selection process, although a simple linear scan is often sufficient given the overall complexity dominated by RR set generation.
Memory: Storing $\theta$ RR sets can require significant memory. Each RR set contains node IDs. The total memory depends on $\theta$ and the average size of an RR set. For large graphs and small $\epsilon$ , $\theta$ can be large. Techniques like sketching or approximate counting for the Maximum Coverage step could be explored if memory becomes a bottleneck, potentially at the cost of theoretical guarantees or increased complexity.
Parallelism: RR set generation is inherently parallelizable. Multiple RR sets can be generated independently across different cores or machines. The Maximum Coverage greedy algorithm is sequential by nature, but its runtime is typically much smaller than the RR set generation phase for large graphs.
Parameter Tuning: The parameter $\epsilon$ controls the trade-off between approximation quality and runtime/memory. Smaller $\epsilon$ yields a better approximation guarantee closer to $(1 - 1/e)$ but increases $\theta$ (quadratically), thus increasing runtime and memory usage. Choosing an appropriate $\epsilon$ depends on the specific application requirements and available resources.

Comparison and Significance

The TIM/TIM+ algorithms marked a significant advance in the field of influence maximization. Prior approaches, like the Kempe, Kleinberg, and Tardos (KKT) greedy algorithm, offered the optimal $(1-1/e)$ approximation but required repeated, costly Monte Carlo simulations to estimate influence, leading to high polynomial runtimes impractical for web-scale graphs. TIM provides essentially the same approximation guarantee (up to $\epsilon$ ) but reduces the runtime dramatically by decoupling influence estimation (via RR sets) from the greedy selection process. Its near-linear dependence on graph size ( $m+n$ ) and polylogarithmic dependence on $n$ make it scalable to massive networks where previous methods failed. The near-optimal runtime complexity established its theoretical importance, while the practical performance demonstrated its applicability.

Conclusion

The work presented in "Maximizing Social Influence in Nearly Optimal Time" (1212.0884) provides a highly efficient and theoretically grounded algorithm (TIM/TIM+) for the Influence Maximization problem under the Independent Cascade model. By leveraging Reverse Reachable sets to efficiently estimate influence, it achieves a $(1 - 1/e - \epsilon)$ approximation guarantee in $O((m+n)k \log(n) / \epsilon^2)$ expected time, which is near-optimal. This represents a substantial improvement over previous approaches, enabling practical influence maximization on large-scale networks. Its design allows for parallelization and offers tunable trade-offs between accuracy and computational resources, making it a foundational algorithm in the paper of network diffusion processes.

PDF Markdown