- The paper introduces CARE, a method that leverages common author relations to enhance article recommendations based on user reading history.
- It selects target researchers using FE1 and FE2 features to identify those following author-based search patterns.
- The graph-based ranking using Random Walk with Restart significantly improves precision, recall, and F1 scores for the selected researcher subset.
This paper presents a novel method called CARE (Common Author relation-based REcommendation) for recommending scientific articles to researchers. The core idea is to leverage the observation that researchers often look for papers written by the same authors they have previously found relevant ("author-based search pattern"). However, the authors recognize that not all researchers exhibit this pattern. Therefore, CARE is designed to first identify suitable target researchers and then apply a specialized recommendation algorithm for them (2008.04652).
Problem Addressed:
- Existing article recommenders often use generic algorithms for all users, ignoring individual search behaviors like focusing on specific authors.
- Content-based methods can be complex and computationally expensive due to the large volume of text in articles.
- Standard collaborative filtering often ignores valuable information like authorship links between papers.
CARE Methodology:
The CARE method consists of two main components:
- Target Researcher Selection:
* Researchers whose FE1
or FE2
values exceed predefined thresholds are considered suitable targets for the CARE ranking algorithm.
- Graph-based Article Ranking:
- For the selected target researchers, a heterogeneous graph G=(VR⋃VA,ERA⋃EAA) is constructed.
- VR: Set of researcher nodes.
- VA: Set of article nodes.
- ERA: Edges representing reading history (researcher Ri read article Aj).
- EAA: Edges representing common author relations (article Ai and article Aj share at least one author).
- A Random Walk with Restart (RWR) algorithm is applied to this graph.
- The walk starts at the target researcher node (v0).
- At each step, the walker moves to a neighboring node with probability α (based on calculated transition probabilities T) or restarts at v0 with probability 1−α.
- Transition probabilities (TRA, TAR, TAA) are calculated based on the adjacency matrices representing reading relations (WRA) and common author relations (WAA). For instance, the probability of moving from article i to article j (TAA(i,j)) is:
TAA(i,j)=∑k1WAR(i,k1)+∑k2WAA(i,k2)WAA(i,j)
This normalizes the probability based on the total number of connections (to researchers or other articles) from article i.
- The algorithm iteratively updates the scores of all nodes until convergence. The final scores of the article nodes (ScoreArticle) represent their relevance to the target researcher.
- Top-N ranked articles not already in the researcher's library are recommended.
Implementation Considerations:
- Data Requirements: Requires researcher reading history (e.g., from CiteULike libraries) and author information for each article. The authors crawled CiteULike to obtain author data missing from the original dataset version.
- Graph Construction: Building the adjacency matrices WRA and WAA is the first step. WAA requires pairwise comparison of author lists for all articles, which can be computationally intensive for large datasets. Defining "common authors" (e.g., requiring at least two shared authors, as done in the paper) can mitigate noise from common names.
- RWR Parameters: The restart probability α and the number of iterations (
maxStep
) need tuning. The paper found α=0.8 worked well.
- Scalability: RWR on large graphs can be computationally demanding. Techniques like graph partitioning or approximation methods might be needed for very large datasets.
- Feature Thresholds: The thresholds for
FE1
and FE2
need to be determined, potentially via cross-validation on a hold-out set, to balance the trade-off between the number of targeted researchers and the performance gain.
Pseudocode for RWR (Algorithm 1 in paper):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
Algorithm Graph-based article ranking
Input: Graph G, restart probability α, target researcher v0, max iterations maxStep, Transition matrix T
Output: Ranking scores for articles ScoreArticle[1..m]
Initialize ScoreAll[1..n+m] = 0
ScoreAll[v0] = 1
for step = 0 to maxStep-1:
Initialize tmpScore[1..n+m] = 0
for each node vx in G:
for each neighbor vy of vx:
tmpScore[vy] = tmpScore[vy] + α * ScoreAll[vx] * T(vx, vy)
# Add restart probability
tmpScore[v0] = tmpScore[v0] + (1 - α)
ScoreAll = tmpScore
ScoreArticle = ScoreAll[n+1 .. n+m] // Extract scores for article nodes
Return ScoreArticle |
Evaluation and Results:
- The experiments were conducted on a CiteULike dataset.
- Key Finding: CARE significantly outperformed the Baseline (RWR without common author relations or researcher selection) only when applied to the researchers selected using
FE1
and FE2
. When applied to all researchers, its performance was similar or slightly worse than the Baseline.
- This validates the paper's two main hypotheses: (1) incorporating common author relations helps for specific researchers, and (2) the features
FE1
and FE2
effectively identify these researchers.
- Increasing the thresholds for
FE1
and FE2
generally led to higher precision, recall, and F1 scores for CARE on the selected subset, further confirming the features' relevance.
- Two alternative features (
FE3
: absolute number of common author pairs; FE4
: ratio of authors common to all articles) were tested and found ineffective.
Practical Implications:
This research provides a practical approach for enhancing scientific article recommendations by tailoring the algorithm to user behavior. Instead of a single complex model, it proposes a two-stage process: identify users who follow authors, then apply a graph-based method incorporating authorship links for those users. This hybrid strategy can lead to more relevant recommendations for a specific user segment without negatively impacting others, potentially improving user satisfaction on academic platforms.