-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTransformer-as-logic-base.tex
175 lines (135 loc) · 7.78 KB
/
Transformer-as-logic-base.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
\input{../YKY-preamble.tex}
\setmainfont[BoldFont=Alibaba_Sans_Regular.otf,ItalicFont=Alibaba_Sans_Light_Italic.otf]{Alibaba_Sans_Light.otf}
\usepackage[active,tightpage]{preview} % for continuous page(s)
\renewcommand{\PreviewBorder}{0.5cm}
\renewcommand{\thempfootnote}{\arabic{mpfootnote}}
\usepackage[absolute,overlay]{textpos} % for page number on upper left corner
\usepackage{color}
\usepackage{mathtools}
\usepackage[hyperfootnotes=false]{hyperref}
% \usepackage[backend=biber,style=numeric]{biblatex}
% \bibliography{../AGI-book}
% \renewcommand*{\bibfont}{\footnotesize}
\usetikzlibrary{shapes}
\usepackage[export]{adjustbox} % ??
\usepackage{verbatim} % for comments
% \usepackage{newtxtext,newtxmath} % Times New Roman font
% \titleformat{\subsection}[hang]{\bfseries\large\color{blue}}{}{0pt}{}
% \numberwithin{equation}{subsection}
\newcommand{\underdash}[1]{%
\tikz[baseline=(toUnderline.base)]{
\node[inner sep=1pt,outer sep=10pt] (toUnderline) {#1};
\draw[dashed] ([yshift=-0pt]toUnderline.south west) -- ([yshift=-0pt]toUnderline.south east);
}%
}%
%\DeclareSymbolFont{symbolsC}{U}{txsyc}{m}{n}
%\DeclareMathSymbol{\strictif}{\mathrel}{symbolsC}{74}
\DeclareSymbolFont{AMSb}{U}{msb}{m}{n}
\DeclareSymbolFontAlphabet{\mathbb}{AMSb}
% \setmathfont{Latin Modern Math}
\DeclareMathOperator*{\argmin}{arg\,min}
% \usepackage[most]{tcolorbox}
\tcbset{on line,
boxsep=4pt, left=0pt,right=0pt,top=0pt,bottom=0pt,
colframe=red,colback=pink,
highlight math style={enhanced}
}
\newcommand{\atom}{\vcenter{\hbox{\tcbox{\mbox{ }}}}}
\let\oldtextbf\textbf
\renewcommand{\textbf}[1]{\textcolor{blue}{\oldtextbf{#1}}}
\newcommand{\logic}[1]{{\color{violet}{\textit{#1}}}}
\newcommand{\underconst}{\includegraphics[scale=0.5]{../2020/UnderConst.png}}
\newcommand{\KBsymbol}{\vcenter{\hbox{\includegraphics[scale=1]{../KB-symbol.png}}}}
\newcommand{\token}{\vcenter{\hbox{\includegraphics[scale=1]{token.png}}}}
\newcommand{\proposition}{\vcenter{\hbox{\includegraphics[scale=0.8]{proposition.png}}}}
\begin{document}
\begin{preview}
\cc{
\title{\vspace{-1.5cm} \bfseries\color{blue}{\LARGE Transformer 是一种逻辑结构}}
}{
\title{\vspace{-1.5cm} \bfseries\color{blue}{\LARGE Transformer as Logic-Base}}
}
% \author{YKY} % Your name
\date{\vspace{-2cm}} % Date, can be changed to a custom date
\maketitle
\setcounter{section}{-1}
% (1) Circled page number on upper left corner
%\begin{textblock*}{5cm}(2.1cm,2.3cm) % {block width} (coords)
%{\color{red}{\large \textcircled{\small 1}}}
%\end{textblock*}
\begin{minipage}{\textwidth}
\setlength{\parskip}{0.4\baselineskip}
In this infographic I'd explain a major finding that is the culmination of many years of my research: the Transformer is a symbolic-logic machine.
For your convenience here is a refresher on the Transformer's \textbf{Self-Attention} mechanism:
\begin{equation}
\vcenter{\hbox{\includegraphics[scale=0.9]{../2022/self-attention-2.png}}}
\end{equation}
``Input'' tokens are translated to $Q,K,V$ (query, key, value)'s via matrix multiplication, which can be regarded as a kind of table look-up, or \textbf{memory store}:
\begin{equation}
\vcenter{\hbox{\includegraphics[scale=0.9]{../2022/QKV-memory-store.png}}}
\end{equation}
From an abstract point of view, the Transformer has the following structure, which gives rise to its \textbf{equivariance} property (if input elements are swapped in a certain order, the output elements changes the same way):
\begin{equation}
\vcenter{\hbox{\includegraphics[scale=1]{../2022/abstract-self-attention.png}}}
\end{equation}
The equivariance property corresponds to the \textbf{exchangeability} of logic propositions:
\begin{equation}
A \wedge B \quad \Leftrightarrow \quad B \wedge A
\end{equation}
For example:
\begin{equation}
\mbox{it's raining} \wedge \mbox{I'm heart-broken} \quad \Leftrightarrow \quad \mbox{I'm heart-broken} \wedge \mbox{it's raining}
\end{equation}
Propositions are made up of \textbf{atomic concepts}, but here, at the sub-propositional level, atoms cannot be permuted freely, eg:
\begin{equation}
\mbox{I} \cdot \mbox{love} \cdot \mbox{her} \quad \neq \quad \mbox{she} \cdot \mbox{loves} \cdot \mbox{me}
\end{equation}
otherwise there would be no such things as heart-breaks.
Now let's recall some notions of \textbf{classical logic-based AI}. This is its basic architecture:
\begin{equation}
\vcenter{\hbox{\includegraphics[scale=1]{../2022/LBAI-basic-config-en.png}}}
\end{equation}
There would be a huge number of rules in the Knowledge Base, and the system needs to match these rules one by one against propositions in the system's \textbf{state} (= working memory):
\begin{equation}
\vcenter{\hbox{\includegraphics[scale=1]{../2022/rete-explained-1b.png}}}
\end{equation}
For the Transformer, it is a kind of memory stored \textbf{between} input elements (stored as the $Q, K, V$ matrices), and it \textbf{implicitly} plays the role of a logic rule-base:
\begin{equation}
\label{fig:self-attention-as-Rete}
\vcenter{\hbox{\includegraphics[scale=0.9]{../2022/rete-explained-3b.png}}}
\end{equation}
A crucial insight is that the \textbf{Self-Attention} mechanism had its origin in NTMs (\textbf{Neural Turing Machines}) proposed by Graves \textit{et al} 2014. The Turing machine needs to have a ``memory tape'' but in the neural setting this memory must be \textit{differentiable}. If the memory is addressed by an index $i \in \mathbb{N}$, then it won't be differentiable. So they came up with a \textbf{content-addressable} memory mechanism where a memory matrix is looked up using the ``query-key-value'' method. A nice explantion of NTMs can be found in the book \textit{Fundamentals of Deep Learning} [Buduma, Locascio 2017].
Now consider LLMs (\textbf{Large Language Models}) such as BERT and GPT.
Given a natural-language sentence, we'd like to convert or \textbf{decompose} it into a list of logic propositions:
\begin{equation}
\label{fig:NL-2-propositions}
\vcenter{\hbox{\includegraphics[scale=1]{../2022/NL-sentence-2-propositions.png}}}
\end{equation}
The structure on the right of (\ref{fig:NL-2-propositions}) is the \textbf{mental state} of a logical AI system. It is composed of (exchangeable) propositions, which are in turn composed of atomic concepts. This 2-level structure is characteristic of all \textbf{symbolic logic} systems.
Surprisingly, we found that the Transformer completely satisfies this 2-level logic structure.
On the \textbf{first layer}, a Transformer transforms each input word token into one proposition:
\begin{equation}
\vcenter{\hbox{\includegraphics[scale=1]{../2022/Transformer-1st-layer.png}}}
\end{equation}
The crucial point here is that the propositions are composed of atoms ($\atom$), this is achieved in the Transformer by \textbf{adding} vectors (that represent atomic concepts), ie, by \textbf{superposition}. Note also that the Transformer is equivariant, so we must add ``positional encoding'' to each word, to indicate their ordering.
At \textbf{higher layers}, there is no need for positional encoding, and logic propositions can be freely exchanged, exactly as what happens in Transformers:
\begin{equation}
\vcenter{\hbox{\includegraphics[scale=1]{../2022/Transformer-higher-layers.png}}}
\end{equation}
Note that in the above, every $\Uparrow$ arrow uses the same $(Q,K,V)$ matrices as ``rule-base'', that may limit the number of rules that can be represented. To circumvent this, \textbf{Multi-Head Attention} allows to use different $(Q,K,V)$ matrices on the same layer.
\end{minipage}
\end{preview}
\begin{comment}
\begin{preview}
\begin{minipage}{\textwidth}
\setlength{\parskip}{0.4\baselineskip}
\begin{textblock*}{20cm}(2.1cm,2cm) % {block width} (coords)
{\color{red}{\large \textcircled{\small 2}}}
\hspace{8cm}
\color{blue}{\footnotesize \cc{逻辑 Transformer}{Logic Transformer}}
\end{textblock*}
\vspace*{0.3cm}
\end{minipage}
\end{preview}
\end{comment}
\end{document}