🏫 [CS50x 2025] 4-2 File I/O
🎯 WHY does a memory address need 8 bytes (64 bits), while an int
is only 4 bytes (32 bits)?
① Because: It Depends on Address Space Size
A memory address needs to be big enough to point to any possible location in your system’s RAM (or virtual memory).
- A 32-bit address can point to 2³² = 4,294,967,296 different memory locations → about 4 GB.
- A 64-bit address can point to 2⁶⁴ = 18,446,744,073,709,551,616 locations → 18 exabytes of addressable space (yes… that’s exabyte, like… infinity ✨).
👉 So if your CPU uses 64-bit architecture, your pointers must be 64-bit wide, to be able to reference any byte in that massive memory space. That’s why pointers are 8 bytes (64 bits = 8 × 8 bits).
② So how to “calculate” how big a pointer should be?
Just use this rule of thumb:
1
sizeof(pointer_type) == sizeof(address_space_of_your_architecture)
Architecture | Max Addressable Memory | Pointer Size |
---|---|---|
16-bit | 2¹⁶ = 64 KB | 2 bytes |
32-bit | 2³² = 4 GB | 4 bytes |
64-bit | 2⁶⁴ = 18 EB | 8 bytes |
So on 64-bit system:
1
2
3
4
5
6
7
#include <stdio.h>
int main() {
int *p;
printf("%zu\n", sizeof(p)); // Outputs 8
return 0;
}
Boom 💥 You’ll get 8
as output, because p
stores a 64-bit address!
③ Meanwhile… int
is still 4 bytes because:
- You rarely need numbers bigger than 2³² in normal logic (unless you’re doing crypto stuff, ect…).
💬 TL;DR
int
stores a number → 4 bytes.
pointer
stores an address → 8 bytes on a 64-bit machine, because that’s how big the road map is 🛣️🧭.
🧮 Unit Conversion: Bits ⇄ Bytes ⇄ KB ⇄ MB ⇄ GB ⇄ TB ⇄ PB ⇄ EB
Unit | How Many Bits? | Notes |
---|---|---|
1 byte | 8 bits | 1B = 8b 🧠 |
1 KB | 1024 bytes | Not 1000 (binary-based) |
1 MB | 1024 KB | = 1,048,576 bytes |
1 GB | 1024 MB | = 1,073,741,824 bytes |
1 TB | 1024 GB | = 1,099,511,627,776 bytes |
1 PB | 1024 TB | = 1,125,899,906,842,624 bytes |
1 EB | 1024 PB | = 1,152,921,504,606,846,976 bytes 😱 |
🧸 Handy formula:
1
2
3
4
1 byte = 8 bits
1 KB = 1024 bytes = 8,192 bits
1 MB = 1024 × 1024 bytes = 1,048,576 bytes = 8,388,608 bits
1 GB = 1024 × 1024 × 1024 bytes = 1,073,741,824 bytes
So if you’re ever like:
“How many bits in 4GB?”
Then the answer is:
1
2
4 GB = 4 × 1024 × 1024 × 1024 bytes = 4,294,967,296 bytes
→ × 8 = 34,359,738,368 bits 💥
🤯 Q: So my PC has 32GB RAM… that’s not even close to 18 EB?!
EXACTLY! 💡
This just uncovered the difference between physical memory vs. addressable memory space.
🧠 Here’s how it works:
- You might have 32 GB physical RAM installed in your machine.
- But your CPU can potentially address 2⁶⁴ memory locations (if it’s 64-bit) → that’s 18 EB of virtual memory address space.
So yes:
🔐 Even though most of that space is unused or unallocated, the pointer still needs to be big enough to reach any possible spot in that giant map.
Think of it like this:
- Your house has 5 rooms (32 GB RAM), but the city planner gave you a street map for all of Tokyo (18 EB space)-you may not use every block, but you need the full-sized map to know where to go.
🧠 BYTE b;
- What does it mean?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Copies a file
#include <stdio.h>
#include <stdint.h>
typedef uint8_t BYTE;
int main(int argc, char *argv[])
{
FILE *src = fopen(argv[1], "rb");
FILE *dst = fopen(argv[2], "wb");
BYTE b;
while (fread(&b, sizeof(b), 1, src) != 0)
{
fwrite(&b, sizeof(b), 1, dst);
}
fclose(dst);
fclose(src);
}
This line:
1
2
typedef uint8_t BYTE;
BYTE b;
🟢 First, typedef uint8_t BYTE;
just means:
“Let’s create a cute nickname
BYTE
for theuint8_t
type.”
And uint8_t
is from <stdint.h>
- it’s an unsigned 8-bit integer, i.e., a single byte (0-255 range).
So:
1
BYTE b;
is the same as:
1
uint8_t b;
💡 Meaning: you’re declaring a variable b
that will hold 1 byte at a time-perfect for copying files byte-by-byte.
💭 Why != 0
in the while (fread(...))
condition?
Here’s the line:
1
while (fread(&b, sizeof(b), 1, src) != 0)
🔍 What does fread
do?
1
fread(pointer_to_buffer, size_of_each_item, number_of_items, file_stream);
So here:
&b
→ where to store the data (pointer to our byte variable)sizeof(b)
→ 1 byte (we’re reading one byte at a time)- the second parameter is in bytes
1
→ number of items to readsrc
→ source file
It returns: the number of items successfully read.
So if it reads 1 byte successfully, it returns 1
✅
If it reaches EOF (end of file) or hits an error, it returns 0
❌
💥
!= 0
is saying: “Keep reading bytes as long as fread is returning data.”
🔥 Here’s the important truth:
✅ The
4
infread(buffer, sizeof(BYTE), 4, src)
is not inherently in “bytes” - it is the count of items, not bytes.
But since you’re reading 4 items of size 1 byte → you end up reading 4 bytes.
🎯 fread(ptr, size, count, stream)
- What’s that “count”?
That count
parameter is:
💬 The number of data items (of size
size
) to read from the file stream into memory.
So in this line:
1
fread(&b, sizeof(b), 1, src)
You’re telling C:
“Hey, read 1 item that is
sizeof(b)
= 1 byte long, and put it into&b
.”
🧠 So here:
size = sizeof(b) = 1
count = 1
→ means “read 1 unit of that size” = 1 byte
Result: reads 1 byte from the file (not 1 file!)
🪄 Real-Life Analogy
Imagine:
1
fread(buffer, 4, 100, src);
That means:
“Read 100 items, each 4 bytes long = total 400 bytes from file
src
intobuffer
.”
So:
- You’re still reading just one file.
- But you’re asking: “Give me 100 chunks, each 4 bytes.”
fread
will return how many chunks it successfully read.
💡 TL;DR
Parameter | Meaning |
---|---|
size | Size of each item in bytes |
count | How many items to read |
Total bytes read = size * count | |
Has nothing to do with how many files! |
🐍 chunked or throttled writing in Python
① chunk_size
in downloads, not just local file writing
Like requests
:
1
2
3
4
5
6
7
8
import requests
url = "https://example.com/bigfile.zip"
with requests.get(url, stream=True) as r:
with open("bigfile.zip", "wb") as f:
for chunk in r.iter_content(chunk_size=1024): # 👈 HERE
if chunk: # filter out keep-alive chunks
f.write(chunk)
✨ That chunk_size=1024
controls how much data gets read in each loop - which directly influences how fast and how often we write to disk. You can slow it down manually to throttle file writes (e.g., for poor disks or cautious writes).
👉 This means:
Read up to 1024 bytes from the HTTP stream and give them to me as
chunk
each time.
💡 Unit = bytes So:
chunk_size=1
→ reads 1 byte at a timechunk_size=1024
→ reads 1024 bytes = 1 KB at a timechunk_size=1048576
→ 1 MB per read
✅ Just like a buffer size. We’re choosing how big each bite of the data is before we swallow it down into the file 🍽️
🧠 ② Emulating Slow Writing: Manual Throttling
You can also control speed by combining .write()
with a time delay:
1
2
3
4
5
6
import time
with open("output.txt", "wb") as f:
for chunk in chunks: # assume each chunk is bytes
f.write(chunk)
time.sleep(0.1) # ⏳ delay between writes
You can even dynamically adjust that sleep if you’re making a tool that mimics streamed downloads, network latency, or file corruption tools.
💿 ③ Slowing or chunking file saves in other APIs (e.g., shutil
, aiofiles
)
Python has some wrappers for file copying too:
1
2
3
import shutil
shutil.copyfileobj(src, dst, length=1024) # 👈 control buffer size with `length`
That length=1024
is like your chunk_size
, regulating how much data flows at once.
🧠 Q: In C, Can I fread(&b, sizeof(b), 4, src)
?
1
2
BYTE b; // just 1 byte of space
fread(&b, sizeof(BYTE), 4, src); // ❌ you're reading 4 bytes into 1 byte
Let’s assume:
sizeof(b)
= 1 (i.e.,b
is a BYTE =uint8_t
)4
= count
So this means:
“Read 4 items, each 1 byte, total = 4 bytes into
&b
”
BUT❗
If b
is just a single byte, you’re gonna get undefined behavior 💀 because you’re telling C:
“Read 4 bytes into a memory space only big enough for 1 byte 😵”
✅ Correct Form to Read 4 Bytes at Once in C:
Must predefine a large enough array. You’d need a buffer array like this:
1
2
3
4
5
BYTE buffer[4]; // buffer has space for 4 bytes
while (fread(buffer, sizeof(BYTE), 4, src) != 0)
{
fwrite(buffer, sizeof(BYTE), 4, dst);
}
This is safe and beautiful. You’re reading:
- 4 items
- each 1 byte
- total 4 bytes
- into a buffer that has room for 4 bytes
🧠 buffer
is 4 bytes big → perfect match!
buffer
size must be ≥ total bytes you read.
Each loop reads 4 bytes at once → just like chunk_size=4
in Python.
😼 TL;DR
- Python’s
chunk_size=1024
means read 1024 bytes per loop ✅ In C, to do the same, you’d use
fread(buffer, 1, 1024, src)
- But remember to declare:
BYTE buffer[1024];
- But remember to declare:
- Units are the same in both: bytes (not bits, not integers-just raw memory bytes)
- If you mismatch buffer size and read size in C? Boom 💥 undefined behavior.
🌟Can We Use Smaller or bigger-than-byte Units, Like Bits, Kilobytes, Megabytes?
🧠 Can We Use Units Smaller Than a Byte? (Like BIT
?)
Nope, not directly 😢 - RAM and CPUs only work with bytes as the minimum addressable unit.
So:
1
BIT buffer[4]; // ❌ INVALID unless you define BIT yourself as a byte type
There’s no built-in BIT
in C, because:
- You can’t actually access a single bit in memory on its own.
- The smallest chunk memory gives you is 1 byte (8 bits).
BUT! If you want to simulate bit-by-bit storage, you have options like:
1
2
3
4
5
6
typedef struct {
unsigned int bit0 : 1;
unsigned int bit1 : 1;
unsigned int bit2 : 1;
unsigned int bit3 : 1;
} Bits4;
💡 This uses bit fields - but even then, the whole thing is stored in bytes under the hood.
✅ Can We Use Larger Units (KB, MB) in C?
YES! But manually:
There are no automatic KB
/MB
units in C, but you can create byte arrays of that size:
1
2
3
4
5
6
#define KB 1024
#define MB (1024 * KB)
BYTE buffer[KB]; // 1 KB buffer (1024 bytes)
BYTE mega[MB]; // 1 MB buffer (1048576 bytes)
// 👆 You can also name it `BYTE buffer[MB];` if you like
You’re still working in bytes, but now you’re defining named constants to make code human-readable.
😽 Real Examples:
1
2
3
4
5
6
#define KB 1024
typedef uint8_t BYTE;
BYTE buffer[KB]; // 1 kilobyte buffer
fread(buffer, sizeof(BYTE), KB, src); // read 1KB at once
or:
1
BYTE buffer[512]; // 512 bytes (like a disk sector)
💡 Why do we use &b
but not &buffer
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Copies a file
#include <stdio.h>
#include <stdint.h>
typedef uint8_t BYTE;
int main(int argc, char *argv[])
{
FILE *src = fopen(argv[1], "rb");
FILE *dst = fopen(argv[2], "wb");
BYTE b;
while (fread(&b, sizeof(b), 1, src) != 0)
{
fwrite(&b, sizeof(b), 1, dst);
}
fclose(dst);
fclose(src);
}
📍 Case 1: BYTE b;
➤ use &b
1
2
BYTE b; // Just ONE byte (a value)
fread(&b, sizeof(b), 1, src);
b
is a single variable, not a pointer.
To give its memory address tofread
, you must use&b
Becausefread()
needs a pointer - a place to store what it reads.
🧠 You’re saying:
“Here’s the address of my 1-byte variable - write into it.”
📦 Case 2: BYTE buffer[4];
➤ use buffer
(not &buffer
)
1
2
BYTE buffer[4]; // An array = already a pointer to first element
fread(buffer, sizeof(BYTE), 4, src);
buffer
is an array, and in C, an array decays into a pointer to its first element.
So buffer
is already equivalent to &buffer[0]
→ no &
needed!
🧠 You’re saying:
“Here’s the address of my 4-byte array - you can start writing from here.”
Pset - Volume
💡 Why does buffer *= factor
adjust the volume?
🎧 WAV audio samples are just numbers.
In .wav
files:
- The actual audio data (after the 44-byte header) is a sequence of samples.
Each sample is usually stored as a signed 16-bit integer =
int16_t
- Value ranges from
-32768
to+32767
- Value ranges from
- The louder the sound, the farther the value is from 0.
So:
1
buffer *= factor;
Means:
🎚️ “Multiply the current sample’s amplitude by some factor (e.g., 0.5 or 2.0).”
- If
factor = 0.5
→ volume goes down (quieter) - If
factor = 2.0
→ volume goes up (louder)
✨ You’re scaling the wave. Just like stretching or squishing a sound wave on a visualizer.
🧠 Visualization:
Let’s say you read a sample:
1
2
3
Original: buffer = 10000
factor = 1.5
buffer = 10000 * 1.5 = 15000
It gets louder 🎶 (But warning: too high, and it could overflow beyond 32767 = audio distortion)
Understand fread
& EOF
✅ If You’re Only Reading the Header Once (like this):
1
2
fread(header, HEADER_SIZE, 1, input);
fwrite(header, HEADER_SIZE, 1, output);
Then you don’t need a loop or an EOF check.
Why?
Because you already know the header is
44 bytes
. And you’re just saying: “Read me exactly 1 block that’s 44 bytes.”
🎯 So either:
- It succeeds → all good ✅
- Or it fails (e.g., input file too small, corrupted) → then
fread()
returns less than 1 (likely 0) ❌
If you wanna be a perfectionist, you can always do:
1
2
3
4
5
if (fread(header, HEADER_SIZE, 1, input) != 1)
{
printf("Failed to read WAV header\n");
return 1;
}
But no EOF loop needed if you’re reading a known number of bytes just once.
Clamp to 255
1. Use fminf()
(for float
, from <math.h>
)
1
2
3
#include <math.h>
float sepiaRed = fminf(roundf(...), 255.0);
💡 This is clean and safe! Just don’t forget to #include <math.h>
2. Use Ternary Clamp (? :
conditional operator)
1
image[i][j].rgbtRed = red > 255 ? 255 : red;
💡 Direct and expressive - fast to write, easy to read. Common in C!
👣 How to perform a swap
directly inside main
without using a function?
✨ To do this logic:
1
2
3
int x = 1;
int y = 2;
// swap(&x, &y);
But without using swap()
as a separate function. Totally doable. But we gotta avoid touching pointers incorrectly.
❌ This is not valid:
1
2
3
int tmp = &x; // ❌ wrong: &x is a memory address, not an int
&x = &y; // ❌ double wrong: lvalue of &x can't be assigned
&y = tmp; // ❌ not assignable
That’s like trying to assign to someone’s home address directly instead of going into the house and changing the contents 🏠
✅ The correct way to do a swap
inline in main
is simple:
1
2
3
4
5
6
int x = 1;
int y = 2;
int tmp = x;
x = y;
y = tmp;
That’s it!
This is the exact same logic as:
1
2
3
4
5
6
void swap(int *a, int *b)
{
int tmp = *a;
*a = *b;
*b = tmp;
}
Just without the indirection (*a
, *b
) because you’re working with the actual variables directly, not through pointers.
💡 So here’s the full thing:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#include <stdio.h>
int main(void)
{
int x = 1;
int y = 2;
printf("x is %i, y is %i\n", x, y);
// Swap directly here!
int tmp = x;
x = y;
y = tmp;
printf("x is %i, y is %i\n", x, y);
}
👀 So when do we need pointers?
Only when:
- You’re passing variables into a function and want to modify the originals (not copies)
- Or you’re dealing with arrays, strings, or dynamic memory
In main
, where x
and y
are already accessible, you don’t need &
or *
to swap them.
Why are all the valid fourth bytes from 0xe0
to 0xef
, and how does that mean the first four bits are 1110
?
🧠 First: What’s a Byte in Binary?
Every hexadecimal number (like 0xe0
) represents 8 bits (aka 1 byte).
So:
0xe0
= 1110 00000xe1
= 1110 00010xe2
= 1110 0010- …
0xef
= 1110 1111
Can you see it now? All of these share the first 4 bits: 1110
💡 So what does it mean when we say:
“The fourth byte’s first four bits are
1110
”?
It means:
- The byte is somewhere between
0xe0
and0xef
(hex range) - Because all those bytes start with the binary prefix
1110
- The last 4 bits (called the “low nibble”) can be anything from
0000
to1111
In other words:
1
if ((byte & 0xf0) == 0xe0) // this checks the upper 4 bits
💥 That’s how you detect “fourth byte starts with 1110” - it’s a bitmask trick!
🧠 What it means - softly and clearly:
1
buffer[3] & 0xf0
We’re saying:
“I only care about the first 4 bits (the left ones) - just give me those.”
The last 4 bits (right ones) will be ignored, because we’re masking them out.
So for example:
buffer[3] | Binary | (buffer[3] & 0xf0) | Result | Match? |
---|---|---|---|---|
0xe0 | 11100000 | 11100000 | 0xe0 | ✅ |
0xe1 | 11100001 | 11100000 | 0xe0 | ✅ |
0xef | 11101111 | 11100000 | 0xe0 | ✅ |
0xea | 11101010 | 11100000 | 0xe0 | ✅ |
0xd8 | 11011000 | 11010000 | 0xd0 | ❌ |
This
== 0xe0
condition only checks the first 4 bits!
It says: “Is this byte part of the0xe0
to0xef
range?”
💬 Does it ignore the last 4 bits?
✅ YES. The last 4 bits can be:
0000
=0x0
0001
=0x1
- …
1111
=0xf
And ALL of them will still satisfy (buffer[3] & 0xf0) == 0xe0
.
You’re only comparing the upper nibble - the “high 4 bits”.
🔍 Why do JPEGs have this pattern?
Because it’s part of the JFIF standard, where image segments (like headers, metadata, compressed data, etc.) start with marker bytes.
- JPEG start-of-image (SOI) marker:
0xFF 0xD8
- Followed by a marker segment, usually starting with
0xFF
again, then one of those0xE0
-0xEF
codes
That’s how JPEG decoders know:
“Yep, this is a valid segment of a JPEG file!”
Pset Recovery
💡 Why do we “add blocks to the current JPEG”?
It’s because a JPEG file is way bigger than 512 bytes.
Let’s say one photo is 1MB. That means:
1MB = 1024 * 1024 = 1,048,576 bytes
So… that photo takes up 1,048,576 ÷ 512 = 2048 blocks on the memory card!
So when we’re recovering it, we don’t just copy the first block - we need to copy all 2048 of those 512-byte blocks into a single .jpg
file 💡
🧃 Imagine this:
📸 One photo = many 512-byte pieces
- 📦 Block 0: JPEG start
- 📦 Block 1: more photo data
- 📦 Block 2: even more photo data
- …
- 📦 Block 2047: last part of the photo
You want to copy all these blocks into one JPEG file on your computer.
That’s what we mean by:
“Adding blocks to the current JPEG”
🧠 So let’s say:
“A JPEG starts with a 512-byte block that has a signature.”
BUT - the JPEG continues into more blocks after that.
And we don’t know exactly how many blocks it takes.
So what do we do?
💡 Answer: We keep writing every block to the same file…
…until we find the start of the next JPEG.
💾 Imagine this structure on the memory card:
1
2
3
4
5
6
7
8
Block 0 → JPEG #1 start (0xff 0xd8 0xff ...)
Block 1 → continuation of JPEG #1
Block 2 → continuation of JPEG #1
Block 3 → start of JPEG #2! (new signature)
Block 4 → continuation of JPEG #2
Block 5 → continuation...
Block 6 → JPEG #3 start
...
Your job as the JPEG-saver
- 👀 See block 0? It’s a JPEG start → open
000.jpg
and start writing - 👀 Block 1? No signature → keep writing into
000.jpg
- 👀 Block 2? No signature → still writing…
- 👀 Block 3? New JPEG start! → close
000.jpg
, open001.jpg
, start writing - Repeat 🌀