Chomsky Normal Form — Theory of Computation

🌱 What Is Chomsky Normal Form?

A CFG is said to be in Chomsky Normal Form (CNF) when every production rule follows one of only two shapes:

1. A → BC

A non-terminal produces two non-terminals.

2. A → a

A non-terminal produces one terminal symbol.

Special case:

The start symbol may also allow:

S → ε
(only if the language includes the empty string)

That’s all.
No long rules.
No mixing terminals with variables.
No shortcuts.

CNF makes grammar extremely structured.


💡 Why Do We Convert Grammars into CNF?

Even though CNF may look restrictive, it gives us many advantages:

1. Parsing becomes mechanical

Algorithms like CYK work only with CNF.

2. Parse trees become perfectly binary

Every split produces exactly two branches, making tree shapes uniform.

3. It helps in theoretical proofs

Many results in automata theory rely on CNF’s clean structure.

4. It removes unnecessary mess

Rules become short, clear, and easy to follow.

Think of CNF like turning messy handwritten notes into a clean, organized digital document.


🎯 A Grammar That is NOT in CNF

Let’s take a small CFG:

S → aBB
B → bB | bb

This breaks CNF rules because:

  • aBB mixes terminals and variables
  • bB mixes terminal + variable
  • bb contains two terminals
  • Some rules have more than two symbols

CNF allows none of that.


🔧 How CNF Fixes the Grammar

When converting a grammar to CNF, we break down complex rules into small, neat pieces.

For example:

S → aBB
would eventually be rewritten using new helper symbols:

S → X B
X → a B

These new non-terminals exist purely to enforce CNF structure.
The meaning stays the same — only the shape changes.


🌳 A Simple CNF Grammar and Parse Tree

Consider the following clean CNF grammar:

S → AB
A → a
B → b

It generates the string "ab".

Here is its parse tree:

        S
       / \
      A   B
      |   |
      a   b

Every internal node splits into two children.
Every leaf is a terminal.
This neat binary shape is exactly what CNF aims for.


🧠 Important Things to Remember

Here’s a quick checklist in plain English:

✔ Productions must be either A → BC or A → a

Nothing else is allowed.

✔ No mixing terminals and non-terminals on the same right side

So rules like A → aB are not allowed.

✔ Long rules must be broken into pieces

A production like A → BCD must be rewritten step-by-step.

✔ Only the start symbol may produce ε

And only if the language includes the empty string.