Floating Point — Computer Arithmetic

Posted on November 10, 2025November 17, 2025 by examhopeinfo@gmail.com

🌊 What Does “Floating Point” Even Mean?

Let’s start simple.

When you write the number 12345, that’s an integer — it doesn’t have any fractional part.
But numbers like 3.14, 0.00045, or 2.71828 are real numbers — they include fractions or decimals.

Now, computers are great with integers, but real numbers can get very large or very tiny.
For example:

Distance from Earth to Sun ≈ 1.496 × 10⁸ km
Radius of an atom ≈ 5 × 10⁻¹¹ m

That’s a huge range!
Clearly, we can’t represent all these numbers just by fixed decimal places — we’d run out of bits!

So, computers use floating-point representation — where the decimal (or binary) point “floats” to where it’s needed.

🧮 The Floating-Point Format

A floating-point number is stored in this general scientific form:

[
N = (-1)^S \times M \times 2^E
]

Let’s break it down in plain English:

Part	Name	Meaning
S	Sign bit	0 means positive, 1 means negative
M	Mantissa (or Significand)	Represents the actual digits of the number
E	Exponent	Decides where the binary (or decimal) point “floats”

🧠 Think of It Like This:

If we write 6.022 × 10²³, the number 6.022 is like the mantissa, and 23 is the exponent.
It’s the same idea in computers — just using base 2 (binary) instead of base 10.

💾 IEEE 754 Standard (The Common Format)

Almost every computer today uses the IEEE 754 standard for floating-point representation.

It defines two main types:

Type	Total Bits	Sign	Exponent	Mantissa
Single Precision	32 bits	1	8	23
Double Precision	64 bits	1	11	52

🧩 Example: Single-Precision (32-bit) Format

| 1 bit | 8 bits     | 23 bits                  |
|  Sign |  Exponent  |     Mantissa (Fraction)  |

So a binary number like this:

0 10000010 10100000000000000000000

is interpreted as:

Sign = 0 → Positive
Exponent = 10000010₂ = 130₁₀
Mantissa = 1.101 × 2⁰ (The hidden ‘1’ is always assumed)
Actual number = 1.101 × 2^(130−127) = 1.101 × 2³ = 13.0

✏️ Step-by-Step Example

Let’s represent -5.75 in IEEE 754 (single precision):

Convert to binary:
5.75 = 101.11₂
Normalize it:
101.11 = 1.0111 × 2²
→ Mantissa = 0111
→ Exponent = 2 + 127 = 129 = 10000001₂
→ Sign = 1 (since it’s negative)
Combine all parts:

   1 | 10000001 | 01110000000000000000000

That’s your 32-bit floating-point representation of -5.75. ✅

⚙️ Floating-Point Arithmetic Operations

Just like integers, floating-point numbers can be added, subtracted, multiplied, or divided.
But since the point “floats,” the process takes a few extra steps.

Let’s explore them one by one in easy terms.

➕ Floating-Point Addition (and Subtraction)

Align the exponents
The numbers must have the same exponent before adding.
Example: 1.23 × 10² + 4.56 × 10³ → shift 1.23 to 0.123 × 10³.
Add (or subtract) the mantissas
Once the exponents match, add the fractional parts normally.
Normalize the result
If the sum is not in normalized form (1.xxx × 2ⁿ), shift and adjust the exponent.
Round if necessary
Keep precision within 23 (or 52) bits.

✖️ Floating-Point Multiplication

Add exponents (after removing the bias)
Multiply the mantissas
Normalize and round
Set sign bit based on rule:

   (+) × (+) = (+)
   (+) × (–) = (–)
   (–) × (–) = (+)

Example:
(1.1 × 2³) × (1.0 × 2²) = (1.1 × 1.0) × 2⁵ = 1.1 × 2⁵

➗ Floating-Point Division

Subtract the exponents
Divide the mantissas
Normalize and round the result
Set the sign accordingly

⚖️ Precision and Rounding

Because mantissa bits are limited (23 or 52), not all decimal numbers can be represented exactly.
For example, 0.1 (decimal) becomes an endless binary fraction.

So computers round off the result to fit the available bits — leading to tiny errors called rounding errors.
That’s why sometimes in programming you’ll see weird outputs like 0.30000000000004 — it’s not a bug, it’s just rounding at work!

🧭 Floating-Point Representation Diagram

Here’s a simple visual showing how a floating-point number is structured:

  +---------------------------------------------------------------+
  | Sign (1 bit) | Exponent (8 bits) | Mantissa (23 bits)         |
  +---------------------------------------------------------------+
         ↓              ↓                      ↓
       Negative     Controls “float”        Holds digits
       or Positive   (power of 2)           of the number

And in formula form:

[
\text{Value} = (-1)^{Sign} \times (1 + Fraction) \times 2^{(Exponent – Bias)}
]

💬 Everyday Analogy

Think of floating-point numbers like scientific notation in your calculator.
When your calculator shows 3.14E+02, it really means 3.14 × 10² = 314.

Computers do the same — but with base 2 (binary) and fixed-size memory spaces.