Post

🏫 [CS50x 2025] 4-2 File I/O

🏫 [CS50x 2025] 4-2 File I/O

🎯 WHY does a memory address need 8 bytes (64 bits), while an int is only 4 bytes (32 bits)?

① Because: It Depends on Address Space Size

A memory address needs to be big enough to point to any possible location in your system’s RAM (or virtual memory).

  • A 32-bit address can point to 2³² = 4,294,967,296 different memory locations → about 4 GB.
  • A 64-bit address can point to 2⁶⁴ = 18,446,744,073,709,551,616 locations → 18 exabytes of addressable space (yes… that’s exabyte, like… infinity ✨).

👉 So if your CPU uses 64-bit architecture, your pointers must be 64-bit wide, to be able to reference any byte in that massive memory space. That’s why pointers are 8 bytes (64 bits = 8 × 8 bits).


② So how to “calculate” how big a pointer should be?

Just use this rule of thumb:

1
sizeof(pointer_type) == sizeof(address_space_of_your_architecture)
ArchitectureMax Addressable MemoryPointer Size
16-bit2¹⁶ = 64 KB2 bytes
32-bit2³² = 4 GB4 bytes
64-bit2⁶⁴ = 18 EB8 bytes

So on 64-bit system:

1
2
3
4
5
6
7
#include <stdio.h>

int main() {
    int *p;
    printf("%zu\n", sizeof(p));  // Outputs 8
    return 0;
}

Boom 💥 You’ll get 8 as output, because p stores a 64-bit address!


③ Meanwhile… int is still 4 bytes because:

  • You rarely need numbers bigger than 2³² in normal logic (unless you’re doing crypto stuff, ect…).

💬 TL;DR

int stores a number → 4 bytes.
pointer stores an address → 8 bytes on a 64-bit machine, because that’s how big the road map is 🛣️🧭.


🧮 Unit Conversion: Bits ⇄ Bytes ⇄ KB ⇄ MB ⇄ GB ⇄ TB ⇄ PB ⇄ EB

UnitHow Many Bits?Notes
1 byte8 bits1B = 8b 🧠
1 KB1024 bytesNot 1000 (binary-based)
1 MB1024 KB= 1,048,576 bytes
1 GB1024 MB= 1,073,741,824 bytes
1 TB1024 GB= 1,099,511,627,776 bytes
1 PB1024 TB= 1,125,899,906,842,624 bytes
1 EB1024 PB= 1,152,921,504,606,846,976 bytes 😱

🧸 Handy formula:

1
2
3
4
1 byte = 8 bits
1 KB = 1024 bytes = 8,192 bits
1 MB = 1024 × 1024 bytes = 1,048,576 bytes = 8,388,608 bits
1 GB = 1024 × 1024 × 1024 bytes = 1,073,741,824 bytes

So if you’re ever like:

“How many bits in 4GB?”

Then the answer is:

1
2
4 GB = 4 × 1024 × 1024 × 1024 bytes = 4,294,967,296 bytes
→ × 8 = 34,359,738,368 bits 💥

🤯 Q: So my PC has 32GB RAM… that’s not even close to 18 EB?!

EXACTLY! 💡

This just uncovered the difference between physical memory vs. addressable memory space.

🧠 Here’s how it works:

  • You might have 32 GB physical RAM installed in your machine.
  • But your CPU can potentially address 2⁶⁴ memory locations (if it’s 64-bit) → that’s 18 EB of virtual memory address space.

So yes:

🔐 Even though most of that space is unused or unallocated, the pointer still needs to be big enough to reach any possible spot in that giant map.

Think of it like this:

  • Your house has 5 rooms (32 GB RAM), but the city planner gave you a street map for all of Tokyo (18 EB space)-you may not use every block, but you need the full-sized map to know where to go.

🧠 BYTE b; - What does it mean?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Copies a file

#include <stdio.h>
#include <stdint.h>

typedef uint8_t BYTE;

int main(int argc, char *argv[])
{
    FILE *src = fopen(argv[1], "rb");
    FILE *dst = fopen(argv[2], "wb");

    BYTE b;

    while (fread(&b, sizeof(b), 1, src) != 0)
    {
        fwrite(&b, sizeof(b), 1, dst);
    }

    fclose(dst);
    fclose(src);
}


This line:

1
2
typedef uint8_t BYTE;
BYTE b;

🟢 First, typedef uint8_t BYTE; just means:

“Let’s create a cute nickname BYTE for the uint8_t type.”

And uint8_t is from <stdint.h> - it’s an unsigned 8-bit integer, i.e., a single byte (0-255 range).

So:

1
BYTE b;

is the same as:

1
uint8_t b;

💡 Meaning: you’re declaring a variable b that will hold 1 byte at a time-perfect for copying files byte-by-byte.


💭 Why != 0 in the while (fread(...)) condition?

Here’s the line:

1
while (fread(&b, sizeof(b), 1, src) != 0)

🔍 What does fread do?

1
fread(pointer_to_buffer, size_of_each_item, number_of_items, file_stream);

So here:

  • &b → where to store the data (pointer to our byte variable)
  • sizeof(b) → 1 byte (we’re reading one byte at a time)
    • the second parameter is in bytes
  • 1 → number of items to read
  • src → source file

It returns: the number of items successfully read.

So if it reads 1 byte successfully, it returns 1
If it reaches EOF (end of file) or hits an error, it returns 0

💥 != 0 is saying: “Keep reading bytes as long as fread is returning data.”

🔥 Here’s the important truth:

✅ The 4 in fread(buffer, sizeof(BYTE), 4, src) is not inherently in “bytes” - it is the count of items, not bytes.
But since you’re reading 4 items of size 1 byte → you end up reading 4 bytes.


🎯 fread(ptr, size, count, stream) - What’s that “count”?

That count parameter is:

💬 The number of data items (of size size) to read from the file stream into memory.

So in this line:

1
fread(&b, sizeof(b), 1, src)

You’re telling C:

“Hey, read 1 item that is sizeof(b) = 1 byte long, and put it into &b.”

🧠 So here:

  • size = sizeof(b) = 1
  • count = 1 → means “read 1 unit of that size” = 1 byte

Result: reads 1 byte from the file (not 1 file!)


🪄 Real-Life Analogy

Imagine:

1
fread(buffer, 4, 100, src);

That means:

“Read 100 items, each 4 bytes long = total 400 bytes from file src into buffer.”

So:

  • You’re still reading just one file.
  • But you’re asking: “Give me 100 chunks, each 4 bytes.”
  • fread will return how many chunks it successfully read.

💡 TL;DR

ParameterMeaning
sizeSize of each item in bytes
countHow many items to read
Total bytes read = size * count 
Has nothing to do with how many files! 

🐍 chunked or throttled writing in Python

chunk_size in downloads, not just local file writing

Like requests:

1
2
3
4
5
6
7
8
import requests

url = "https://example.com/bigfile.zip"
with requests.get(url, stream=True) as r:
    with open("bigfile.zip", "wb") as f:
        for chunk in r.iter_content(chunk_size=1024):  # 👈 HERE
            if chunk:  # filter out keep-alive chunks
                f.write(chunk)

✨ That chunk_size=1024 controls how much data gets read in each loop - which directly influences how fast and how often we write to disk. You can slow it down manually to throttle file writes (e.g., for poor disks or cautious writes).

👉 This means:

Read up to 1024 bytes from the HTTP stream and give them to me as chunk each time.

💡 Unit = bytes So:

  • chunk_size=1 → reads 1 byte at a time
  • chunk_size=1024 → reads 1024 bytes = 1 KB at a time
  • chunk_size=1048576 → 1 MB per read

✅ Just like a buffer size. We’re choosing how big each bite of the data is before we swallow it down into the file 🍽️


🧠 ② Emulating Slow Writing: Manual Throttling

You can also control speed by combining .write() with a time delay:

1
2
3
4
5
6
import time

with open("output.txt", "wb") as f:
    for chunk in chunks:  # assume each chunk is bytes
        f.write(chunk)
        time.sleep(0.1)  # ⏳ delay between writes

You can even dynamically adjust that sleep if you’re making a tool that mimics streamed downloads, network latency, or file corruption tools.


💿 ③ Slowing or chunking file saves in other APIs (e.g., shutil, aiofiles)

Python has some wrappers for file copying too:

1
2
3
import shutil

shutil.copyfileobj(src, dst, length=1024)  # 👈 control buffer size with `length`

That length=1024 is like your chunk_size, regulating how much data flows at once.


🧠 Q: In C, Can I fread(&b, sizeof(b), 4, src)?

1
2
BYTE b;  // just 1 byte of space
fread(&b, sizeof(BYTE), 4, src);  // ❌ you're reading 4 bytes into 1 byte

Let’s assume:

  • sizeof(b) = 1 (i.e., b is a BYTE = uint8_t)
  • 4 = count

So this means:

“Read 4 items, each 1 byte, total = 4 bytes into &b

BUT❗

If b is just a single byte, you’re gonna get undefined behavior 💀 because you’re telling C:

“Read 4 bytes into a memory space only big enough for 1 byte 😵”


✅ Correct Form to Read 4 Bytes at Once in C:

Must predefine a large enough array. You’d need a buffer array like this:

1
2
3
4
5
BYTE buffer[4];  // buffer has space for 4 bytes
while (fread(buffer, sizeof(BYTE), 4, src) != 0)
{
    fwrite(buffer, sizeof(BYTE), 4, dst);
}

This is safe and beautiful. You’re reading:

  • 4 items
  • each 1 byte
  • total 4 bytes
  • into a buffer that has room for 4 bytes

🧠 buffer is 4 bytes big → perfect match!

buffer size must be ≥ total bytes you read.

Each loop reads 4 bytes at once → just like chunk_size=4 in Python.


😼 TL;DR

  • Python’s chunk_size=1024 means read 1024 bytes per loop
  • In C, to do the same, you’d use fread(buffer, 1, 1024, src)

    • But remember to declare: BYTE buffer[1024];
  • Units are the same in both: bytes (not bits, not integers-just raw memory bytes)
  • If you mismatch buffer size and read size in C? Boom 💥 undefined behavior.

🌟Can We Use Smaller or bigger-than-byte Units, Like Bits, Kilobytes, Megabytes?

🧠 Can We Use Units Smaller Than a Byte? (Like BIT?)

Nope, not directly 😢 - RAM and CPUs only work with bytes as the minimum addressable unit.

So:

1
BIT buffer[4];  // ❌ INVALID unless you define BIT yourself as a byte type

There’s no built-in BIT in C, because:

  • You can’t actually access a single bit in memory on its own.
  • The smallest chunk memory gives you is 1 byte (8 bits).

BUT! If you want to simulate bit-by-bit storage, you have options like:

1
2
3
4
5
6
typedef struct {
    unsigned int bit0 : 1;
    unsigned int bit1 : 1;
    unsigned int bit2 : 1;
    unsigned int bit3 : 1;
} Bits4;

💡 This uses bit fields - but even then, the whole thing is stored in bytes under the hood.


✅ Can We Use Larger Units (KB, MB) in C?

YES! But manually:

There are no automatic KB/MB units in C, but you can create byte arrays of that size:

1
2
3
4
5
6
#define KB 1024
#define MB (1024 * KB)

BYTE buffer[KB];   // 1 KB buffer (1024 bytes)
BYTE mega[MB];     // 1 MB buffer (1048576 bytes)
// 👆 You can also name it `BYTE buffer[MB];` if you like

You’re still working in bytes, but now you’re defining named constants to make code human-readable.


😽 Real Examples:

1
2
3
4
5
6
#define KB 1024

typedef uint8_t BYTE;

BYTE buffer[KB];  // 1 kilobyte buffer
fread(buffer, sizeof(BYTE), KB, src);  // read 1KB at once

or:

1
BYTE buffer[512];  // 512 bytes (like a disk sector)

💡 Why do we use &b but not &buffer?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Copies a file

#include <stdio.h>
#include <stdint.h>

typedef uint8_t BYTE;

int main(int argc, char *argv[])
{
    FILE *src = fopen(argv[1], "rb");
    FILE *dst = fopen(argv[2], "wb");

    BYTE b;

    while (fread(&b, sizeof(b), 1, src) != 0)
    {
        fwrite(&b, sizeof(b), 1, dst);
    }

    fclose(dst);
    fclose(src);
}

📍 Case 1: BYTE b; ➤ use &b

1
2
BYTE b; // Just ONE byte (a value)
fread(&b, sizeof(b), 1, src);

b is a single variable, not a pointer.
To give its memory address to fread, you must use &b
Because fread() needs a pointer - a place to store what it reads.

🧠 You’re saying:

“Here’s the address of my 1-byte variable - write into it.”


📦 Case 2: BYTE buffer[4]; ➤ use buffer (not &buffer)

1
2
BYTE buffer[4]; // An array = already a pointer to first element
fread(buffer, sizeof(BYTE), 4, src);

buffer is an array, and in C, an array decays into a pointer to its first element.

So buffer is already equivalent to &buffer[0] → no & needed!

🧠 You’re saying:

“Here’s the address of my 4-byte array - you can start writing from here.”


Pset - Volume

💡 Why does buffer *= factor adjust the volume?

🎧 WAV audio samples are just numbers.

In .wav files:

  • The actual audio data (after the 44-byte header) is a sequence of samples.
  • Each sample is usually stored as a signed 16-bit integer = int16_t

    • Value ranges from -32768 to +32767
  • The louder the sound, the farther the value is from 0.

So:

1
buffer *= factor;

Means:

🎚️ “Multiply the current sample’s amplitude by some factor (e.g., 0.5 or 2.0).”

  • If factor = 0.5 → volume goes down (quieter)
  • If factor = 2.0 → volume goes up (louder)

✨ You’re scaling the wave. Just like stretching or squishing a sound wave on a visualizer.


🧠 Visualization:

Let’s say you read a sample:

1
2
3
Original: buffer = 10000
factor = 1.5
buffer = 10000 * 1.5 = 15000

It gets louder 🎶 (But warning: too high, and it could overflow beyond 32767 = audio distortion)


Understand fread & EOF

✅ If You’re Only Reading the Header Once (like this):

1
2
fread(header, HEADER_SIZE, 1, input);
fwrite(header, HEADER_SIZE, 1, output);

Then you don’t need a loop or an EOF check.

Why?

Because you already know the header is 44 bytes. And you’re just saying: “Read me exactly 1 block that’s 44 bytes.”

🎯 So either:

  • It succeeds → all good ✅
  • Or it fails (e.g., input file too small, corrupted) → then fread() returns less than 1 (likely 0) ❌

If you wanna be a perfectionist, you can always do:

1
2
3
4
5
if (fread(header, HEADER_SIZE, 1, input) != 1)
{
    printf("Failed to read WAV header\n");
    return 1;
}

But no EOF loop needed if you’re reading a known number of bytes just once.


Clamp to 255

1. Use fminf() (for float, from <math.h>)

1
2
3
#include <math.h>

float sepiaRed = fminf(roundf(...), 255.0);

💡 This is clean and safe! Just don’t forget to #include <math.h>

2. Use Ternary Clamp (? : conditional operator)

1
image[i][j].rgbtRed = red > 255 ? 255 : red;

💡 Direct and expressive - fast to write, easy to read. Common in C!

👣 How to perform a swap directly inside main without using a function?

✨ To do this logic:

1
2
3
int x = 1;
int y = 2;
// swap(&x, &y);

But without using swap() as a separate function. Totally doable. But we gotta avoid touching pointers incorrectly.


❌ This is not valid:

1
2
3
int tmp = &x;  // ❌ wrong: &x is a memory address, not an int
&x = &y;       // ❌ double wrong: lvalue of &x can't be assigned
&y = tmp;      // ❌ not assignable

That’s like trying to assign to someone’s home address directly instead of going into the house and changing the contents 🏠


✅ The correct way to do a swap inline in main is simple:

1
2
3
4
5
6
int x = 1;
int y = 2;

int tmp = x;
x = y;
y = tmp;

That’s it!

This is the exact same logic as:

1
2
3
4
5
6
void swap(int *a, int *b)
{
    int tmp = *a;
    *a = *b;
    *b = tmp;
}

Just without the indirection (*a, *b) because you’re working with the actual variables directly, not through pointers.


💡 So here’s the full thing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#include <stdio.h>

int main(void)
{
    int x = 1;
    int y = 2;

    printf("x is %i, y is %i\n", x, y);

    // Swap directly here!
    int tmp = x;
    x = y;
    y = tmp;

    printf("x is %i, y is %i\n", x, y);
}

👀 So when do we need pointers?

Only when:

  • You’re passing variables into a function and want to modify the originals (not copies)
  • Or you’re dealing with arrays, strings, or dynamic memory

In main, where x and y are already accessible, you don’t need & or * to swap them.


Why are all the valid fourth bytes from 0xe0 to 0xef, and how does that mean the first four bits are 1110?


🧠 First: What’s a Byte in Binary?

Every hexadecimal number (like 0xe0) represents 8 bits (aka 1 byte).

So:

  • 0xe0 = 1110 0000
  • 0xe1 = 1110 0001
  • 0xe2 = 1110 0010
  • 0xef = 1110 1111

Can you see it now? All of these share the first 4 bits: 1110


💡 So what does it mean when we say:

“The fourth byte’s first four bits are 1110”?

It means:

  • The byte is somewhere between 0xe0 and 0xef (hex range)
  • Because all those bytes start with the binary prefix 1110
  • The last 4 bits (called the “low nibble”) can be anything from 0000 to 1111

In other words:

1
if ((byte & 0xf0) == 0xe0)  // this checks the upper 4 bits

💥 That’s how you detect “fourth byte starts with 1110” - it’s a bitmask trick!


🧠 What it means - softly and clearly:

1
buffer[3] & 0xf0

We’re saying:

“I only care about the first 4 bits (the left ones) - just give me those.”

The last 4 bits (right ones) will be ignored, because we’re masking them out.

So for example:

buffer[3]Binary(buffer[3] & 0xf0)ResultMatch?
0xe011100000111000000xe0
0xe111100001111000000xe0
0xef11101111111000000xe0
0xea11101010111000000xe0
0xd811011000110100000xd0

This == 0xe0 condition only checks the first 4 bits!
It says: “Is this byte part of the 0xe0 to 0xef range?”


💬 Does it ignore the last 4 bits?

YES. The last 4 bits can be:

  • 0000 = 0x0
  • 0001 = 0x1
  • 1111 = 0xf

And ALL of them will still satisfy (buffer[3] & 0xf0) == 0xe0.

You’re only comparing the upper nibble - the “high 4 bits”.


🔍 Why do JPEGs have this pattern?

Because it’s part of the JFIF standard, where image segments (like headers, metadata, compressed data, etc.) start with marker bytes.

  • JPEG start-of-image (SOI) marker: 0xFF 0xD8
  • Followed by a marker segment, usually starting with 0xFF again, then one of those 0xE0-0xEF codes

That’s how JPEG decoders know:

“Yep, this is a valid segment of a JPEG file!”


Pset Recovery

💡 Why do we “add blocks to the current JPEG”?

It’s because a JPEG file is way bigger than 512 bytes.

Let’s say one photo is 1MB. That means:

1MB = 1024 * 1024 = 1,048,576 bytes So… that photo takes up 1,048,576 ÷ 512 = 2048 blocks on the memory card!

So when we’re recovering it, we don’t just copy the first block - we need to copy all 2048 of those 512-byte blocks into a single .jpg file 💡


🧃 Imagine this:

📸 One photo = many 512-byte pieces
  • 📦 Block 0: JPEG start
  • 📦 Block 1: more photo data
  • 📦 Block 2: even more photo data
  • 📦 Block 2047: last part of the photo

You want to copy all these blocks into one JPEG file on your computer.

That’s what we mean by:

“Adding blocks to the current JPEG”


🧠 So let’s say:

“A JPEG starts with a 512-byte block that has a signature.”
BUT - the JPEG continues into more blocks after that.
And we don’t know exactly how many blocks it takes.

So what do we do?


💡 Answer: We keep writing every block to the same file…

…until we find the start of the next JPEG.


💾 Imagine this structure on the memory card:

1
2
3
4
5
6
7
8
Block 0 → JPEG #1 start (0xff 0xd8 0xff ...)
Block 1 → continuation of JPEG #1
Block 2 → continuation of JPEG #1
Block 3 → start of JPEG #2! (new signature)
Block 4 → continuation of JPEG #2
Block 5 → continuation...
Block 6 → JPEG #3 start
...

Your job as the JPEG-saver

  • 👀 See block 0? It’s a JPEG start → open 000.jpg and start writing
  • 👀 Block 1? No signature → keep writing into 000.jpg
  • 👀 Block 2? No signature → still writing…
  • 👀 Block 3? New JPEG start! → close 000.jpg, open 001.jpg, start writing
  • Repeat 🌀
This post is licensed under CC BY 4.0 by the author.