Scientific Article Recommendation: Exploiting Common Author Relations and Historical Preferences (2008.04652v1)

Published 9 Aug 2020 in cs.SI and cs.DL

Abstract: Scientific article recommender systems are playing an increasingly important role for researchers in retrieving scientific articles of interest in the coming era of big scholarly data. Most existing studies have designed unified methods for all target researchers and hence the same algorithms are run to generate recommendations for all researchers no matter which situations they are in. However, different researchers may have their own features and there might be corresponding methods for them resulting in better recommendations. In this paper, we propose a novel recommendation method which incorporates information on common author relations between articles (i.e., two articles with the same author(s)). The rationale underlying our method is that researchers often search articles published by the same author(s). Since not all researchers have such author-based search patterns, we present two features, which are defined based on information about pairwise articles with common author relations and frequently appeared authors, to determine target researchers for recommendation. Extensive experiments we performed on a real-world dataset demonstrate that the defined features are effective to determine relevant target researchers and the proposed method generates more accurate recommendations for relevant researchers when compared to a Baseline method.

Citations (92)

View on Semantic Scholar

Summary

The paper introduces CARE, a method that leverages common author relations to enhance article recommendations based on user reading history.
It selects target researchers using FE1 and FE2 features to identify those following author-based search patterns.
The graph-based ranking using Random Walk with Restart significantly improves precision, recall, and F1 scores for the selected researcher subset.

This paper presents a novel method called CARE (Common Author relation-based REcommendation) for recommending scientific articles to researchers. The core idea is to leverage the observation that researchers often look for papers written by the same authors they have previously found relevant ("author-based search pattern"). However, the authors recognize that not all researchers exhibit this pattern. Therefore, CARE is designed to first identify suitable target researchers and then apply a specialized recommendation algorithm for them (2008.04652).

Problem Addressed:

Existing article recommenders often use generic algorithms for all users, ignoring individual search behaviors like focusing on specific authors.
Content-based methods can be complex and computationally expensive due to the large volume of text in articles.
Standard collaborative filtering often ignores valuable information like authorship links between papers.

CARE Methodology:

The CARE method consists of two main components:

Target Researcher Selection:
- This module identifies researchers who are likely to have an "author-based search pattern" based on their historical reading preferences (articles saved in their library).
- Two features are defined to quantify this pattern:
  - FE1: The ratio of article pairs within a researcher's library that share common authors to the total number of possible article pairs. A higher ratio suggests the researcher collects papers linked by authorship. $FE1 = \frac{\text{Number of pairs with common authors}}{\text{Total number of pairs } (C^N_2)}$ (where N is the number of articles in the library)
  - FE2: The ratio of articles written by the single most frequently occurring author in the researcher's library to the total number of articles in the library. A higher ratio indicates a focus on a specific author's work.
    
    $FE2 = \frac{\text{Count of most frequent author}}{\text{Total number of articles } (N)}$

* Researchers whose FE1 or FE2 values exceed predefined thresholds are considered suitable targets for the CARE ranking algorithm.

Graph-based Article Ranking:
- For the selected target researchers, a heterogeneous graph $G=(V_R\bigcup V_A,E_{RA}\bigcup E_{AA})$ $G = (V_{R} ⋃ V_{A}, E_{R A} ⋃ E_{AA})$ is constructed.
  - $V_R$ : Set of researcher nodes.
  - $V_A$ : Set of article nodes.
  - $E_{RA}$ : Edges representing reading history (researcher $R_i$ read article $A_j$ ).
  - $E_{AA}$ : Edges representing common author relations (article $A_i$ and article $A_j$ share at least one author).
- A Random Walk with Restart (RWR) algorithm is applied to this graph.
- The walk starts at the target researcher node ( $v_0$ ).
- At each step, the walker moves to a neighboring node with probability $\alpha$ (based on calculated transition probabilities $T$ ) or restarts at $v_0$ with probability $1-\alpha$ .
- Transition probabilities ( $T_{RA}$ , $T_{AR}$ , $T_{AA}$ ) are calculated based on the adjacency matrices representing reading relations ( $W_{RA}$ ) and common author relations ( $W_{AA}$ ). For instance, the probability of moving from article $i$ to article $j$ ( $T_{AA}(i,j)$ ) is:
  
  $T_{AA}(i,j) = \frac{W_{AA}(i,j)}{\sum_{k1}W_{AR}(i,k1) + \sum_{k2}W_{AA}(i,k2)}$
  
  This normalizes the probability based on the total number of connections (to researchers or other articles) from article $i$ .
- The algorithm iteratively updates the scores of all nodes until convergence. The final scores of the article nodes ( $ScoreArticle$ ) represent their relevance to the target researcher.
- Top-N ranked articles not already in the researcher's library are recommended.

Implementation Considerations:

Data Requirements: Requires researcher reading history (e.g., from CiteULike libraries) and author information for each article. The authors crawled CiteULike to obtain author data missing from the original dataset version.
Graph Construction: Building the adjacency matrices $W_{RA}$ and $W_{AA}$ is the first step. $W_{AA}$ requires pairwise comparison of author lists for all articles, which can be computationally intensive for large datasets. Defining "common authors" (e.g., requiring at least two shared authors, as done in the paper) can mitigate noise from common names.
RWR Parameters: The restart probability $\alpha$ and the number of iterations (maxStep) need tuning. The paper found $\alpha=0.8$ worked well.
Scalability: RWR on large graphs can be computationally demanding. Techniques like graph partitioning or approximation methods might be needed for very large datasets.
Feature Thresholds: The thresholds for FE1 and FE2 need to be determined, potentially via cross-validation on a hold-out set, to balance the trade-off between the number of targeted researchers and the performance gain.

Pseudocode for RWR (Algorithm 1 in paper):

Algorithm Graph-based article ranking
Input: Graph G, restart probability α, target researcher v0, max iterations maxStep, Transition matrix T
Output: Ranking scores for articles ScoreArticle[1..m]

Initialize ScoreAll[1..n+m] = 0
ScoreAll[v0] = 1

for step = 0 to maxStep-1:
  Initialize tmpScore[1..n+m] = 0
  for each node vx in G:
    for each neighbor vy of vx:
      tmpScore[vy] = tmpScore[vy] + α * ScoreAll[vx] * T(vx, vy)
  # Add restart probability
  tmpScore[v0] = tmpScore[v0] + (1 - α)
  ScoreAll = tmpScore

ScoreArticle = ScoreAll[n+1 .. n+m] // Extract scores for article nodes
Return ScoreArticle

Evaluation and Results:

The experiments were conducted on a CiteULike dataset.
Key Finding: CARE significantly outperformed the Baseline (RWR without common author relations or researcher selection) only when applied to the researchers selected using FE1 and FE2. When applied to all researchers, its performance was similar or slightly worse than the Baseline.
This validates the paper's two main hypotheses: (1) incorporating common author relations helps for specific researchers, and (2) the features FE1 and FE2 effectively identify these researchers.
Increasing the thresholds for FE1 and FE2 generally led to higher precision, recall, and F1 scores for CARE on the selected subset, further confirming the features' relevance.
Two alternative features (FE3: absolute number of common author pairs; FE4: ratio of authors common to all articles) were tested and found ineffective.

Practical Implications:

This research provides a practical approach for enhancing scientific article recommendations by tailoring the algorithm to user behavior. Instead of a single complex model, it proposes a two-stage process: identify users who follow authors, then apply a graph-based method incorporating authorship links for those users. This hybrid strategy can lead to more relevant recommendations for a specific user segment without negatively impacting others, potentially improving user satisfaction on academic platforms.

PDF Markdown

Scientific Article Recommendation: Exploiting Common Author Relations and Historical Preferences (2008.04652v1)

Summary

Related Papers