Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 56 tok/s
Gemini 2.5 Pro 39 tok/s Pro
GPT-5 Medium 15 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 155 tok/s Pro
GPT OSS 120B 476 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Talk like a Graph: Encoding Graphs for Large Language Models (2310.04560v1)

Published 6 Oct 2023 in cs.LG

Abstract: Graphs are a powerful tool for representing and analyzing complex relationships in real-world applications such as social networks, recommender systems, and computational finance. Reasoning on graphs is essential for drawing inferences about the relationships between entities in a complex system, and to identify hidden patterns and trends. Despite the remarkable progress in automated reasoning with natural text, reasoning on graphs with LLMs remains an understudied problem. In this work, we perform the first comprehensive study of encoding graph-structured data as text for consumption by LLMs. We show that LLM performance on graph reasoning tasks varies on three fundamental levels: (1) the graph encoding method, (2) the nature of the graph task itself, and (3) interestingly, the very structure of the graph considered. These novel results provide valuable insight on strategies for encoding graphs as text. Using these insights we illustrate how the correct choice of encoders can boost performance on graph reasoning tasks inside LLMs by 4.8% to 61.8%, depending on the task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training, 2021.
  2. Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47, 2002.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  4. Emergence of scaling in random networks. science, 286(5439):509–512, 1999.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020a.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020b.
  7. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  8. Exploring the potential of large language models (llms) in learning on graphs. arXiv preprint arXiv:2307.03393, 2023.
  9. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  11. Transformers as soft reasoners over language. arXiv preprint arXiv:2002.05867, 2020.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. A generalization of transformer networks to graphs. arXiv preprint arXiv:2012.09699, 2020.
  14. On random graphs. Publicationes Mathematicae Debrecen, 6:290–297, 1959.
  15. Retrieval augmented language model pre-training. In International conference on machine learning, pp. 3929–3938. PMLR, 2020.
  16. Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008.
  17. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  18. Stochastic blockmodels: First steps. Social networks, 5(2):109–137, 1983.
  19. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  20. Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022.
  21. Patton: Language model pretraining on text-rich networks. arXiv preprint arXiv:2305.12268, 2023.
  22. Lambada: Backward chaining for automated reasoning in natural language. arXiv preprint arXiv:2212.13894, 2022.
  23. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022.
  24. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  25. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  26. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  27. What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804, 2021.
  28. Attending to graph transformers. arXiv preprint arXiv:2302.04181, 2023.
  29. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  30. Graphworld: Fake graphs bring real insights for gnns. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  3691–3701, 2022.
  31. Unifying large language models and knowledge graphs: A roadmap, 2023.
  32. Automatic prompt optimization with” gradient descent” and beam search. arXiv preprint arXiv:2305.03495, 2023.
  33. A decade of knowledge graphs in natural language processing: A survey. arXiv preprint arXiv:2210.00105, 2022.
  34. Generate & rank: A multi-task framework for math word problems. arXiv preprint arXiv:2109.03034, 2021.
  35. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  36. Can language models solve graph problems in natural language? arXiv preprint arXiv:2305.10037, 2023.
  37. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  38. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
  39. Examining the effects of degree distribution and homophily in graph learning models, 2023.
  40. Deep bidirectional language-knowledge graph pretraining, 2022.
  41. Natural language is all a graph needs. arXiv preprint arXiv:2308.07134, 2023.
  42. Graph-bert: Only attention is needed for learning graph representations. arXiv preprint arXiv:2001.05140, 2020.
  43. Exploring the mit mathematics and eecs curriculum using large language models. arXiv preprint arXiv:2306.08997, 2023a.
  44. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023b.
  45. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  46. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022a.
  47. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022b.
Citations (76)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a novel framework that converts graph structures into text, facilitating LLMs to perform graph-based tasks.
  • It demonstrates that incident encoding and tailored prompt engineering significantly enhance performance on graph reasoning tasks.
  • Results reveal that LLMs struggle with global graph properties, underscoring limitations and pointing to the need for hybrid model approaches.

Encoding Graphs for LLMs: A Comprehensive Study

Introduction

The paper "Talk like a Graph: Encoding Graphs for LLMs" (2310.04560) presents a systematic investigation into the problem of representing graph-structured data as text for consumption by LLMs. The paper addresses a critical gap: while LLMs have demonstrated strong performance on a variety of text-based reasoning tasks, their ability to reason over graph-structured data—ubiquitous in domains such as social networks, recommender systems, and knowledge graphs—remains underexplored. The authors introduce a new benchmark, GraphQA, and conduct extensive experiments to analyze how graph encoding, prompt engineering, and graph structure affect LLM performance on fundamental graph reasoning tasks. Figure 1

Figure 1: Overview of the framework for reasoning with graphs using LLMs, highlighting the modularity of graph encoding and prompt engineering.

Graph Encoding as Text: Methodological Framework

The core technical challenge addressed is the transformation of arbitrary graphs G=(V,E)G = (V, E) into textual sequences WW suitable for LLM input. The authors formalize this as the design of a graph encoding function g:GWg: G \mapsto W and a question rephrasing function q:WWq: W \mapsto W, such that the LLM ff can be queried as A=f(g(G),q(Q))A = f(g(G), q(Q)) for a question QQ about the graph. Figure 2

Figure 2: Overview of the framework for encoding graphs via text, illustrating the mapping from graph structure to natural language representations.

The paper systematically explores a taxonomy of graph encoding strategies, varying both node and edge representations. Node encodings include integer indices, English names, character names from popular media, and alphabetic labels. Edge encodings range from explicit adjacency lists to natural language statements of relationships (e.g., "A and B are friends"). The authors also experiment with different prompt engineering heuristics, including zero-shot, few-shot, chain-of-thought (CoT), and bag prompting.

Empirical Evaluation: GraphQA Benchmark

The GraphQA benchmark comprises a suite of basic graph tasks: edge existence, node degree, node count, edge count, connected nodes, cycle check, and disconnected nodes. These tasks are designed to probe both local and global graph reasoning capabilities of LLMs.

Key Findings

1. LLMs Underperform on Basic Graph Tasks

Across all evaluated models and tasks, LLMs exhibit poor performance on basic graph reasoning, often failing to surpass simple majority baselines, especially for tasks such as edge existence and cycle detection. This highlights a fundamental limitation in the ability of LLMs to perform even elementary graph computations when provided with naïve textual encodings.

2. Graph Encoding Function Critically Impacts Performance

The choice of graph encoding function g(.)g(.) has a substantial effect on LLM accuracy. For example, incident encoding (where each node lists its neighbors) outperforms adjacency encoding for tasks like node degree and connected nodes, as it places relevant information in closer textual proximity. Integer node encodings improve arithmetic tasks, while named node encodings are advantageous for tasks with non-integer outputs.

3. Prompt Engineering and Question Framing Matter

Prompting strategies significantly influence outcomes. Zero-shot prompting suffices for simple tasks, but few-shot and CoT prompting yield improvements for more complex queries. Notably, rephrasing questions in application-specific language (e.g., "How many friends does Alice have?") consistently outperforms abstract graph-theoretic formulations.

4. Model Capacity Correlates with Graph Reasoning Ability

Larger LLMs (e.g., PaLM 62B) demonstrate improved performance on graph tasks compared to smaller variants, but even the largest models do not consistently outperform majority baselines on all tasks. The effect of scale is more pronounced for tasks requiring aggregation or multi-hop reasoning.

5. Graph Structure and Generator Influence LLM Performance

The structure of the input graph—determined by the graph generator (Erdős–Rényi, Barabási–Albert, SBM, star, path, complete)—has a marked impact on LLM accuracy. For instance, cycle detection is trivial for complete graphs but challenging for path graphs, reflecting LLMs' strong priors and susceptibility to distractors in the encoding. Figure 3

Figure 3: Samples of graphs generated with different graph generators, illustrating the diversity of structures in the GraphQA benchmark.

6. LLMs Lack a Global Model of the Graph

Tasks requiring reasoning about the absence of edges (e.g., disconnected nodes) expose a critical weakness: LLMs are unable to infer global properties not explicitly encoded in the text, achieving near-zero accuracy.

Analysis of Graph Encoding Strategies

The authors provide a detailed ranking of encoding functions across tasks and prompting methods. Incident encoding is generally optimal for most prompting strategies, except in zero-shot settings where encodings with familiar names (e.g., politicians, fictional characters) perform better. The paper also finds that distractive statements in the encoding degrade performance, especially in dense graphs. Figure 4

Figure 4: Example graph used to illustrate the output of different graph encoding functions.

Implications and Future Directions

Practical Implications

  • Black-box LLMs: The paper focuses on scenarios where LLM weights are inaccessible, emphasizing the importance of prompt and encoding design for practical deployment.
  • Task-Specific Encoding: Careful selection of encoding and question phrasing can yield performance gains of 4.8% to 61.8% on graph reasoning tasks, underscoring the need for task-aware prompt engineering.
  • Benchmarking: The GraphQA benchmark provides a valuable resource for evaluating and comparing LLMs on structured reasoning tasks.

Theoretical Implications

  • Limitations of Textual Encodings: The inability of LLMs to construct a global model of the graph from text suggests fundamental representational bottlenecks.
  • Inductive Biases: LLMs exhibit strong priors based on training data distributions, which can be maladaptive for synthetic or out-of-distribution graph structures.

Future Research Directions

  • Hybrid Architectures: Integrating explicit graph neural modules or external memory with LLMs may address the observed limitations in global reasoning.
  • Automated Encoding Search: Meta-learning or reinforcement learning approaches to discover optimal graph-to-text encodings could further improve performance.
  • Instruction Tuning: Fine-tuning LLMs on graph-structured data or augmenting pretraining corpora with synthetic graph-text pairs may enhance inductive biases for structured reasoning.

Conclusion

This paper provides a rigorous, empirical foundation for understanding how LLMs process graph-structured data when presented as text. The results demonstrate that LLM performance on graph reasoning is highly sensitive to encoding choices, prompt engineering, and graph structure. While current LLMs are not yet reliable for general graph reasoning in a black-box setting, the insights and benchmarks introduced here lay the groundwork for future advances in structured reasoning with LLMs.

Youtube Logo Streamline Icon: https://streamlinehq.com