Summary architecture CBOW

Architecture

CBOW consists of:

  1. Input Layer – Context words (one-hot vectors)
  2. Hidden Layer (Embedding Layer) – Dense embeddings
  3. Output Layer – Target word prediction

Two weight matrices:

  • W₁ (Input → Hidden) of size V × N
  • W₂ (Hidden → Output) of size N × V

Where:

  • V = Vocabulary size
  • N = Embedding dimension

Weight Matrix W₁ (Input → Hidden)

Each row represents embedding of one word.

Embedding Lookup

For a one-hot input word :

This simply selects:

                                                                    vi​=[wi1​,wi2​,…,wiN​]

For k context words:

Context Aggregation

CBOW averages all embeddings:

This produces a single context vector C.

Weight Matrix W₂ (Hidden → Output)

Output Computation

Context vector multiplied with W₂:

This gives:

                                                                          Z=[z1​,z2​,…,zV​]

Raw scores for all vocabulary words.

Softmax

(Softmax:

  1. Takes raw scores z1,z2,…,zVz_1, z_2, …, z_Vz1​,z2​,…,zV​
  2. Converts them to exponentials
  3. Divides each by total sum
  4. Produces probabilities between 0 and 1
  5. All probabilities sum to 1

)

Softmax converts scores into probabilities:

  • → raw score
  • → vocabulary size
  • → probability of j-th word

Denominator → normalization

The word with highest probability is predicted.

Loss Calculation

Cross-entropy loss:  L=−log(P(ytrue​))

Backpropagation

Error is propagated backward to update:

  • W₂ (hidden → output)
  • W₁ (input → hidden / embeddings)

Repeat

Process is repeated for:

  • all sliding windows
  • all sentences
  • multiple epochs