Emergent Mind

Abstract

LLMs have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on giving a provable guarantee for the one-layer attention network objective function $L(X,Y) = \sum{j0 = 1}n \sum{i0 = 1}d ( \langle \langle \exp( \mathsf{A}{j0} x ) , {\bf 1}n \rangle{-1} \exp( \mathsf{A}{j0} x ), A{3} Y{*,i0} \rangle - b{j0,i0} )2$. Here $\mathsf{A} \in \mathbb{R}{n2 \times d2}$ is Kronecker product between $A1 \in \mathbb{R}{n \times d}$ and $A2 \in \mathbb{R}{n \times d}$. $A3$ is a matrix in $\mathbb{R}{n \times d}$, $\mathsf{A}{j0} \in \mathbb{R}{n \times d2}$ is the $j0$-th block of $\mathsf{A}$. The $X, Y \in \mathbb{R}{d \times d}$ are variables we want to learn. $B \in \mathbb{R}{n \times d}$ and $b{j0,i0} \in \mathbb{R}$ is one entry at $j0$-th row and $i0$-th column of $B$, $Y{*,i0} \in \mathbb{R}d$ is the $i0$-column vector of $Y$, and $x \in \mathbb{R}{d2}$ is the vectorization of $X$. In a multi-layer LLM network, the matrix $B \in \mathbb{R}{n \times d}$ can be viewed as the output of a layer, and $A1= A2 = A3 \in \mathbb{R}{n \times d}$ can be viewed as the input of a layer. The matrix version of $x$ can be viewed as $QK\top$ and $Y$ can be viewed as $V$. We provide an iterative greedy algorithm to train loss function $L(X,Y)$ up $\epsilon$ that runs in $\widetilde{O}( ({\cal T}{\mathrm{mat}}(n,n,d) + {\cal T}{\mathrm{mat}}(n,d,d) + d{2\omega}) \log(1/\epsilon) )$ time. Here ${\cal T}_{\mathrm{mat}}(a,b,c)$ denotes the time of multiplying $a \times b$ matrix another $b \times c$ matrix, and $\omega\approx 2.37$ denotes the exponent of matrix multiplication.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.