2 minute read

๐Ÿ” Self-Attention Mechanism

  • Self-Attention Mechanism์€ ํŠธ๋žœ์Šคํฌ๋จธ(Transformer) ๋ชจ๋ธ์˜ ํ•ต์‹ฌ ์š”์†Œ.
  • ๋ฌธ์žฅ ๋‚ด ๋‹จ์–ด๋“ค ๊ฐ„์˜ ์˜์กด ๊ด€๊ณ„๋ฅผ ๊ฑฐ๋ฆฌ์™€ ์ƒ๊ด€์—†์ด ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์›€.
  • ์ „ํ†ต์ ์ธ RNN์€ ๋‹จ์–ด๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜์ง€๋งŒ, Self-Attention์€ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์—ฌ ์—ฐ์‚ฐ ํšจ์œจ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ.

1๏ธโƒฃ Self-Attention์ด๋ž€?

Self-Attention์€ ๊ฐ ๋‹จ์–ด๊ฐ€ ๋ฌธ์žฅ ๋‚ด ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค๊ณผ ์–ผ๋งˆ๋‚˜ ์—ฐ๊ด€์ด ์žˆ๋Š”์ง€๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜.

Self-Attention

๐Ÿ’ก ์˜ˆ์ œ ๋ฌธ์žฅ:
โ€œ๊ทธ ๋™๋ฌผ์€ ๋„ˆ๋ฌด ํ”ผ๊ณคํ•ด์„œ ๊ธธ์„ ๊ฑด๋„ˆ์ง€ ์•Š์•˜๋‹ค.โ€
โžก ์—ฌ๊ธฐ์„œ โ€œ๊ทธโ€๋Š” โ€œ๋™๋ฌผโ€์„ ๊ฐ€๋ฆฌํ‚ค๋Š”์ง€? ์•„๋‹ˆ๋ฉด โ€œ๊ธธโ€์„ ์˜๋ฏธํ•˜๋Š”์ง€?
โžก Self-Attention์€ ์ด๋Ÿฌํ•œ ๊ด€๊ณ„๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์›€!


2๏ธโƒฃ Scaled Dot-Product Attention

๋‘ ๋‹จ์–ด์˜ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•œ๋‹ค. ํŠธ๋žœ์Šคํฌ๋จธ์—์„œ๋Š” ๋‘ ๋‹จ์–ด ๋ฒกํ„ฐ์˜ ๋‚ด์ ์„ ํ†ตํ•ด ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•œ๋‹ค.

ํŠธ๋žœ์Šคํฌ๋จธ์˜ Self-Attention Mechanism์€ Scaled Dot-Product Attention์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋™์ž‘.

\[\text{Attention}(Q, K, V) = \text{softmax} \left(\frac{QK^T}{\sqrt{d_k}}\right) V\]

์—ฌ๊ธฐ์„œ:

๊ธฐํ˜ธ ์˜๋ฏธ
(Q) (Query) ํ˜„์žฌ ๋‹จ์–ด์˜ ํ‘œํ˜„ ๋ฒกํ„ฐ
(K) (Key) ๋ฌธ์žฅ์—์„œ ๋ชจ๋“  ๋‹จ์–ด์˜ ํ‘œํ˜„ ๋ฒกํ„ฐ
(V) (Value) ๋ชจ๋“  ๋‹จ์–ด์˜ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด ๋ฒกํ„ฐ
(d_k) Key ๋ฒกํ„ฐ์˜ ์ฐจ์› ์ˆ˜
Softmax ์–ดํ…์…˜ ์ ์ˆ˜๋ฅผ ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ๋ณ€ํ™˜

3๏ธโƒฃ Self-Attention ๋™์ž‘ ๊ณผ์ •

๐Ÿ”น Step 1: Query, Key, Value ๋ฒกํ„ฐ ๊ณ„์‚ฐ

์ž…๋ ฅ๋œ ๋ฌธ์žฅ์˜ ๊ฐ ๋‹จ์–ด(ํ† ํฐ)๋Š” ์„ธ ๊ฐœ์˜ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜๋จ:
Query ((Q)), Key ((K)), Value ((V))

\[Q = XW_q, \quad K = XW_k, \quad V = XW_v\]

์—ฌ๊ธฐ์„œ ( W_q, W_k, W_v )๋Š” ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ.


๐Ÿ”น Step 2: ์–ดํ…์…˜ ์ ์ˆ˜(์œ ์‚ฌ๋„) ๊ณ„์‚ฐ

  • Query ๋ฒกํ„ฐ((Q))์™€ Key ๋ฒกํ„ฐ((K)) ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ์ ๊ณฑ(dot product) ์—ฐ์‚ฐ์œผ๋กœ ๊ณ„์‚ฐ:

Self-Attention


\[\frac{QK^T}{\sqrt{d_k}}\]
  • ( \sqrt{d_k} )๋กœ ๋‚˜๋ˆ„๋Š” ์ด์œ :
    โ†’ ๊ฐ’์ด ๋„ˆ๋ฌด ์ปค์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ณ , ํ•™์Šต ์•ˆ์ •์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•จ.

๐Ÿ”น Step 3: Softmax ์ ์šฉ

  • ์–ดํ…์…˜ ์ ์ˆ˜๋ฅผ Softmax ํ•จ์ˆ˜์— ํ†ต๊ณผ์‹œ์ผœ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์ƒ์„ฑ:
\[\text{Softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right)\]
  • ๋†’์€ ๊ฐ’์ผ์ˆ˜๋ก ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ๋” ์ค‘์š”ํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง.

๐Ÿ”น Step 4: Value ๋ฒกํ„ฐ์™€ ๊ฐ€์ค‘ํ•ฉ

  • Softmax์—์„œ ์–ป์€ ๊ฐ€์ค‘์น˜๋ฅผ Value ๋ฒกํ„ฐ((V))์— ๊ณฑํ•˜์—ฌ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑ:
\[\text{Output} = \text{Softmax} \left(\frac{QK^T}{\sqrt{d_k}}\right) V\]
  • ์ด ๊ณผ์ •์„ ํ†ตํ•ด ๊ฐ ๋‹จ์–ด๊ฐ€ ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค๊ณผ ์–ผ๋งˆ๋‚˜ ๊ด€๋ จ์ด ์žˆ๋Š”์ง€ ๋ฐ˜์˜๋œ ์ƒˆ๋กœ์šด ๋ฒกํ„ฐ๊ฐ€ ์ƒ์„ฑ๋จ.

4๏ธโƒฃ Self-Attention ์‹ค์ œ ๋™์ž‘ ์˜ˆ์ œ

์˜ˆ์ œ ๋ฌธ์žฅ

โ€œ๊ณ ์–‘์ด๊ฐ€ ๋งคํŠธ ์œ„์— ์•‰์•„ ์žˆ๋‹ค.โ€

๋‹จ์–ด Query ((Q)) Key ((K)) Value ((V))
๊ณ ์–‘์ด (Q_1) (K_1) (V_1)
์•‰๋‹ค (Q_2) (K_2) (V_2)
๋งคํŠธ (Q_3) (K_3) (V_3)
์œ„์— (Q_4) (K_4) (V_4)
  1. ๊ฐ ๋‹จ์–ด์˜ Query์™€ ๋ชจ๋“  ๋‹จ์–ด์˜ Key๋ฅผ ์ ๊ณฑ(dot product)ํ•˜์—ฌ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ
  2. Softmax๋กœ ์ •๊ทœํ™”ํ•˜์—ฌ ์–ดํ…์…˜ ์ ์ˆ˜ ๊ณ„์‚ฐ
  3. ์–ดํ…์…˜ ์ ์ˆ˜๋ฅผ Value ๋ฒกํ„ฐ์— ๊ณฑํ•˜์—ฌ ์ตœ์ข… ์ถœ๋ ฅ ๋ฒกํ„ฐ ์ƒ์„ฑ

โžก ๊ฒฐ๊ณผ์ ์œผ๋กœ, ๋ฌธ๋งฅ์„ ๊ณ ๋ คํ•˜์—ฌ โ€œ๊ณ ์–‘์ดโ€์™€ โ€œ์•‰๋‹คโ€๊ฐ€ ๋” ๊ฐ•ํ•œ ์—ฐ๊ด€์„ฑ์„ ๊ฐ€์ง€๋„๋ก ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ!


5๏ธโƒฃ Self-Attention์ด ์ค‘์š”ํ•œ ์ด์œ 

โœ… ๋ฌธ๋งฅ(Context)์„ ์ž˜ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Œ (๊ธฐ์กด RNN๋ณด๋‹ค ๋” ๊ฐ•๋ ฅํ•œ ๊ด€๊ณ„ ํ•™์Šต ๊ฐ€๋Šฅ)
โœ… ๋ณ‘๋ ฌ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅ (๊ธฐ์กด RNN์€ ์ˆœ์ฐจ์  ์ฒ˜๋ฆฌ์ด์ง€๋งŒ, Self-Attention์€ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์ง€์›)
โœ… ๊ธด ๋ฌธ์žฅ์—์„œ๋„ ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ๋‹จ์–ด ๊ฐ„ ๊ด€๊ณ„๋ฅผ ํ•™์Šต ๊ฐ€๋Šฅ
โœ… ๋ฒˆ์—ญ, ์งˆ์˜์‘๋‹ต, ์š”์•ฝ, ๋Œ€ํ™” ๋ชจ๋ธ ๋“ฑ ๋‹ค์–‘ํ•œ NLP ์ž‘์—…์—์„œ ํ•„์ˆ˜์ 