Tinkerer

Transformer

2025-10-15T00:00:00+10:30

Transformer Core Concepts

From Tokens to Embeddings

Raw tokens are first mapped to dense vectors through an embedding matrix so that the model can work in a continuous space. The embedding size (n_embd) defines the dimensionality of this space and controls both the model capacity and its memory footprint.

Positional Information

Because self-attention without positional information is permutation-equivariant, Transformers inject order information with positional encodings. Classical sinusoidal encodings can be evaluated at positions beyond the training length, while learnable embeddings allow the model to adapt positions during training but are fixed to the learned context range. Modern variants sometimes rely on relative position encodings or rotary embeddings to better capture long context interactions.

Scaled Dot-Product Self-Attention

For each token, the model projects embeddings into queries (Q), keys (K), and values (V). Attention weights are computed as softmax(QKᵀ / sqrt(d_k)), where d_k is the head dimension to prevent large dot products from saturating the softmax. The output is a weighted sum of the value vectors. Encoder attention can gather information from the entire context window (block_size), while decoder-only language models use a causal mask so each position only attends to itself and earlier positions.

Multi-Head Attention

Multiple attention heads run in parallel on different learned projections of the same sequence. This design allows the model to capture heterogeneous relationships (syntax, long-range dependencies, coreference) in the same layer. The concatenated head outputs are linearly projected back into the model dimension.

Position-Wise Feed-Forward Network

Each Transformer block follows attention with a two-layer feed-forward network applied independently to every position. A typical configuration is Linear(n_embd → 4 × n_embd), an activation (GELU or ReLU), then Linear(4 × n_embd → n_embd). This component mixes features learned by attention and introduces non-linearity.

Residual Connections and Normalisation

Skip connections wrap both the attention sublayer and the feed-forward sublayer so that gradients flow directly to earlier blocks. LayerNorm (or RMSNorm in some modern designs) keeps activations well-scaled during training. Variants such as Pre-LN place the normalisation before each sublayer, which improves stability for deeper models.

Encoder-Decoder vs. Decoder-Only

The original Transformer pairs an encoder that builds contextualised representations with a decoder that performs autoregressive generation, both stacked with attention and feed-forward modules. Many language models today use only the decoder stack with causal masking, which enforces that each token can only attend to previous positions, enabling left-to-right generation.

Training and Scaling Considerations

Optimiser choice: AdamW remains the default, but large models may benefit from learning rate warm-up, cosine decay, and parameter-specific weight decay.
Regularisation: Dropout complements attention masking, while techniques such as label smoothing or stochastic depth can help deep stacks converge.
Precision and compilation: Training in mixed precision (bfloat16/fp16) and enabling compiler optimisations (torch.compile) significantly reduces memory use and speeds up training.
Scaling laws: Empirically, model performance improves predictably with more data, parameters, and compute, guiding decisions about n_layer, n_head, and dataset size.

Inference-Time Generation

During autoregressive generation, the model caches key-value pairs to avoid recomputing attention for past tokens. Sampling strategies such as temperature, top-k, nucleus sampling, and contrastive decoding trade off creativity against determinism. For instruction-following models, alignment training such as RLHF or DPO shapes the model before inference, while decoding settings shape each generated response.

PyTorch Skeleton

import torch
import torch.nn as nn


class TransformerLM(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        block_size: int = 192,
        n_embd: int = 192,
        n_layer: int = 3,
        n_head: int = 6,
        dropout: float = 0.2,
        tie_weights: bool = True,
    ) -> None:
        super().__init__()
        self.block_size = block_size
        self.token_embed = nn.Embedding(vocab_size, n_embd)
        self.pos_embed = nn.Parameter(torch.zeros(1, block_size, n_embd))
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=n_embd,
            nhead=n_head,
            dim_feedforward=4 * n_embd,
            dropout=dropout,
            activation="gelu",
            batch_first=True,
        )
        self.layers = nn.TransformerEncoder(encoder_layer, num_layers=n_layer)
        self.norm = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size, bias=False)
        if tie_weights:
            self.lm_head.weight = self.token_embed.weight

    def forward(self, idx: torch.Tensor) -> torch.Tensor:
        if idx.size(1) > self.block_size:
            raise ValueError("Sequence length exceeds block size.")
        x = self.token_embed(idx) + self.pos_embed[:, : idx.size(1)]
        seq_len = idx.size(1)
        causal_mask = torch.triu(
            torch.full((seq_len, seq_len), float("-inf"), device=idx.device),
            diagonal=1,
        )
        x = self.layers(x, mask=causal_mask)
        x = self.norm(x)
        return self.lm_head(x)


def training_step(model, batch, optimizer, scaler=None):
    model.train()
    inputs, targets = batch
    optimizer.zero_grad(set_to_none=True)
    with torch.cuda.amp.autocast(enabled=scaler is not None):
        logits = model(inputs)
        loss = nn.functional.cross_entropy(
            logits.reshape(-1, logits.size(-1)),
            targets.reshape(-1),
        )
    if scaler:
        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optimizer)
        scaler.update()
    else:
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
    return loss.item()

Hyperparameters

Minimal Viable Training Config

Parameter	Sample Value	Meaning
`batch_size`	`48`	Samples per optimisation step
`block_size`	`192`	Context window length
`max_iters`	`300`	Maximum number of optimisation steps
`learning_rate`	`3e-4`	Optimiser step size
`n_embd`	`192`	Transformer embedding dimension
`n_head`	`6`	Number of attention heads
`n_layer`	`3`	Number of Transformer layers
`dropout`	`0.2`	Regularisation probability

Full Training Configuration

Category	Parameter	Sample value	Meaning
Data	`block_size`	`192`	Context window length
	`vocab_size`	(auto from tokenizer)	Number of tokens in the vocabulary
Model	`n_embd`	`192`	Embedding dimension
	`n_head`	`6`	Number of attention heads
	`n_layer`	`3`	Transformer depth
	`dropout`	`0.2`	Dropout probability
	`tie_weights`	`True`	Share token embedding and output projection weights
Training Loop	`batch_size`	`48`	Number of samples per update
	`max_iters`	`300`	Total training iterations
	`grad_clip`	`1.0`	Gradient norm clipping
Optimiser	`learning_rate`	`3e-4`	Base learning rate
	`weight_decay`	`0.1`	AdamW weight decay
	`betas`	`(0.9, 0.95)`	AdamW momentum coefficients
	`eps`	`1e-8`	AdamW epsilon
LR Scheduler	`lr_decay`	`True`	Enable learning rate decay
	`warmup_iters`	`100`	Warm-up steps before decay
	`min_lr`	`1e-5`	Final learning rate after decay
	`scheduler_type`	`"cosine"`	Scheduler function
Precision / Hardware	`device`	`"cuda"`	Compute device
	`dtype`	`"bfloat16"`	Precision mode
	`compile`	`True`	Enable Torch 2.x compile optimisation
Validation / Early Stop	`eval_interval`	`100`	Evaluation frequency
	`eval_iters`	`20`	Mini-batches used for validation loss estimation
	`patience`	`6`	Early stopping patience
	`min_delta`	`1e-3`	Minimum improvement threshold
Checkpoint / Logging	`save_interval`	`100`	Model checkpoint interval
	`log_interval`	`50`	Logging interval
	`wandb_project`	`"gpt-debug"`	Optional logging project name
Generation	`temperature`	`0.8`	Softmax temperature for sampling
	`top_k`	`50`	Top-K sampling
	`top_p`	`0.95`	Nucleus sampling
	`max_new_tokens`	`200`	Maximum number of new tokens to generate

Building the Hack Computer: Learning Notes

2025-10-06T00:00:00+10:30

Boolean Logic

De Morgan’s Law

\[\overline{AB} = \overline{A} + \overline{B}\] \[\overline{A + B} = \overline{A}\,\overline{B}\]

Boolean Arithmetic (Combinational Logic)

Half Adder

Full Adder

Two’s Complement

Quick Flow

Decimal → (abs) → bin → invert → +1 → Two’s complement
Binary  → (MSB=1?) → invert → +1 → decimal → negative

Binary → Decimal

Concept

if MSB = 0 → positive
     normal binary value

if MSB = 1 → negative
     invert bits → add 1 → decimal → add minus sign


Example:

    11101100₂
    
    invert → 00010011
    +1     → 00010100 = 20
    → -20₁₀

Formula

\[\text{Range} = [-2^{n-1},\, 2^{n-1} - 1]\] \[\text{Decimal} = -b_{n-1} \times 2^{n-1} + \sum_{i=0}^{n-2} b_i \times 2^i\]

Example via Formula

Directly convert a binary number to decimal using the formula:

    11101100₂
    = -1×128 + (1×64 + 1×32 + 0×16 + 1×8 + 1×4 + 0×2 + 0×1)
    = -128 + 108
    = -20₁₀

Decimal → Binary

Concept

choose bit width (ex, 8 bits)

if positive:
    normal binary → pad zeros

if negative:
    abs(decimal) → binary → invert → add 1

Example:

    -20

    20  → 00010100
    inv → 11101011
    +1  → 11101100

Example: Positive via Long Division

Example, 20₁₀, target width 8 bits

          ┌─────────── remainder
      2 ) 20           r0
          10           r0
           5           r1
           2           r0
           1           r1
           0  stop

remainders bottom to top, 1 0 1 0 0  →  00010100

Example: Negative via Long Division

Example, -20₁₀, target width 8 bits

Step A, abs value long divide

          ┌─────────── remainder
      2 ) 20           r0
          10           r0
           5           r1
           2           r0
           1           r1
           0  stop

unsigned, 10100  → pad to width → 00010100

Step B, invert bits
00010100 → 11101011

Step C, add 1
11101011 + 1 → 11101100

Result, -20₁₀ → 11101100₂

4-bit Table

Decimal | Two’s Complement
--------------------------
   7    | 0111
   6    | 0110
   5    | 0101
   4    | 0100
   3    | 0011
   2    | 0010
   1    | 0001
   0    | 0000
  -1    | 1111
  -2    | 1110
  -3    | 1101
  -4    | 1100
  -5    | 1011
  -6    | 1010
  -7    | 1001
  -8    | 1000

Sequential Logic

DFF

Bit (1-bit register)

Computer Architecture

ALU

ALU Notes

In two’s complement representation, the bitwise NOT operation $!y$ can be expressed as $!y = -y - 1$.

Assembly Language

Overview

The Hack assembly language contains two instruction types: A-instruction and C-instruction, plus labels for symbolic addresses. Each instruction is 16 bits long.

A-instruction

Form

@value
@symbol

Loads a constant or address into the A register. The value also becomes the memory address for M.

Example

@10
D=A
@counter
M=0

C-instruction

Form

dest=comp;jump

Performs computation and optionally stores or jumps.

dest: target (M, D, A, MD, AM, AD, AMD)
comp: computation (ALU operation)
jump: condition (JGT, JEQ, JGE, JLT, JNE, JLE, JMP)

Example

D=M
D;JGT
0;JMP

Common comp Values

0, 1, -1
D, A, M, !D, !A, !M, -D, -A, -M
D+1, A+1, M+1, D-1, A-1, M-1
D+A, D+M, D-A, D-M, A-D, M-D
D&A, D&M, D|A, D|M

Labels and Symbols

Label Declaration

(LOOP)

Marks a location. The label’s value is the address of the next instruction.

Predefined Symbols

Symbol	Address	Meaning
SP	`0`	Stack pointer (top of the stack)
LCL	`1`	Base address of the local segment
ARG	`2`	Base address of the argument segment
THIS	`3`	Base address of the this segment
THAT	`4`	Base address of the that segment
R0–R15	`0–15`	General purpose registers, aliases for the first 16 RAM addresses
temp (R5–R12)	`5–12`	Fixed temporary segment, used for intermediate storage
pointer (THIS/THAT)	`3–4`	Pointer segment that maps to `THIS` (0) and `THAT` (1)
static (FileName.index)	`16+`	Static variables unique to each `.vm` file, starting from RAM[16]
SCREEN	`16384`	Base address of the screen memory (for display pixels)
KBD	`24576`	Address of the keyboard memory-mapped register

Variables (custom symbols) start from address 16.

Example Program: Sum 1+2+…+n

@i
M=0
@sum
M=0
(LOOP)
  @i
  D=M
  @n
  D=D-M
  @END
  D;JGT
  @i
  D=M
  @sum
  M=M+D
  @i
  M=M+1
  @LOOP
  0;JMP
(END)

Quick Reference

@value: load A
dest=comp;jump: compute and control flow
Symbols: variables and labels
Memory map: R0–R15, SCREEN(16384), KBD(24576)

ASM Notes

First pass: Strip whitespace/comments, walk instructions to build the label table; each non-label command bumps the ROM address counter, while (LABEL) entries alias the next instruction line.
Second pass: Revisit the cleaned instruction stream, resolve symbols (allocating RAM addresses from 16 upward for new variables), and emit the 16-bit Hack opcodes for every A- and C-instruction.

The following can be considered at the software level while the upper part is at the hardware level.

Virtual Machine Language

The VM language is a stack-based intermediate language that abstracts away hardware details. It describes computation using stack operations, memory access, branching, and function calls.

Stack and SP

The stack starts at address 256. The SP (Stack Pointer) always points to the next free slot.

push writes to *SP, then SP = SP + 1
pop decrements SP, then reads from *SP

// push D onto stack
@SP
A=M
M=D
@SP
M=M+1

// pop top of stack into D
@SP
AM=M-1
D=M

Memory Segments

VM Segment	Meaning	Assembly Base	Address Computation	Example
argument	Function arguments	`ARG`	`A = M + index`	`push argument 2` → `*(ARG + 2)`
local	Local variables of the current function	`LCL`	`A = M + index`	`pop local 0` → `*(LCL + 0)`
this	“this” pointer area	`THIS`	`A = M + index`	`push this 1` → `*(THIS + 1)`
that	“that” pointer area	`THAT`	`A = M + index`	`pop that 2` → `*(THAT + 2)`
temp	Temporary storage (RAM[5–12])	`5`	`A = 5 + index`	`push temp 3` → `@8`
pointer	Stores THIS and THAT pointers (RAM[3–4])	`3`	`A = 3 + index`	`pop pointer 0` → `THIS = *(SP-1)`
static	File-specific static variables	`16`	`@FileName.index`	`push static 2` → `@Foo.2`
constant	Immediate values, not in RAM	—	`D = A`	`push constant 7` → `D = 7`

Basic Syntax of VM Language

push constant i
push segment i
pop segment i
add | sub | neg | eq | gt | lt | and | or | not
label X
goto X
if-goto X
function f k
call f n
return

Translations from VM language to Assembly

The C++ VMTranslator writes structured templates for every VM command. Values destined for the stack are staged in D and finalised with the shared push tail:

@SP
AM=M+1
A=A-1
M=D

For pop commands targeting base-pointer segments, the absolute address is cached in R13 before the stack value is stored. The assembler snippets below use placeholders such as index and FunctionName that the translator substitutes at runtime.

push segment index

push constant index

@index
D=A
@SP
AM=M+1
A=A-1
M=D

push local|argument|this|that index

@index
D=A
@SEG            // SEG ∈ {LCL, ARG, THIS, THAT}
A=M+D
D=M
@SP
AM=M+1
A=A-1
M=D

push pointer 0|1

@THIS|THAT
D=M
@SP
AM=M+1
A=A-1
M=D

push temp index

@index
D=A
@5
A=D+A
D=M
@SP
AM=M+1
A=A-1
M=D

push static index

@index
D=A
@16
A=D+A
D=M
@SP
AM=M+1
A=A-1
M=D

pop segment index

pop local|argument|this|that index

@index
D=A
@SEG            // SEG ∈ {LCL, ARG, THIS, THAT}
D=M+D
@13
M=D
@SP
AM=M-1
D=M
@13
A=M
M=D

pop pointer 0

@SP
AM=M-1
D=M
@THIS
M=D

pop pointer 1

@SP
AM=M-1
D=M
@THAT
M=D

pop temp index

@index
D=A
@5
D=D+A
@13
M=D
@SP
AM=M-1
D=M
@13
A=M
M=D

pop static index

@index
D=A
@16
D=D+A
@13
M=D
@SP
AM=M-1
D=M
@13
A=M
M=D

Arithmetic and logic

add

@SP
AM=M-1
D=M
A=A-1
M=M+D

sub

@SP
AM=M-1
D=M
A=A-1
M=M-D

neg

@SP
A=M-1
M=-M

and

@SP
AM=M-1
D=M
A=A-1
M=M&D

or

@SP
AM=M-1
D=M
A=A-1
M=M|D

not

@SP
A=M-1
M=!M

Comparisons

eq, gt, and lt share a helper that emits unique labels (CMP_TRUE0, CMP_END0, …). Example output for eq:

@SP
AM=M-1
D=M
@SP
AM=M-1
D=M-D
@CMP_TRUE0
D;JEQ
D=0
@CMP_END0
0;JMP
(CMP_TRUE0)
D=-1
(CMP_END0)
@SP
AM=M+1
A=A-1
M=D

gt substitutes JGT and lt uses JLT in the conditional jump.

Branching commands

label X: (X)
goto X:
```
1
2
@X
0;JMP
```

if-goto X:

@SP
AM=M-1
D=M
@X
D;JNE

Function commands

function FunctionName nLocals

(FunctionName)
@nLocals
D=A
@13
M=D
(FunctionName$initLocalsLoop)
@13
D=M
@FunctionName$initLocalsEnd
D;JEQ
@SP
AM=M+1
A=A-1
M=0
@13
M=M-1
@FunctionName$initLocalsLoop
0;JMP
(FunctionName$initLocalsEnd)

call FunctionName nArgs

@FunctionName$ret.0   // counter increments per call site
D=A
@SP
AM=M+1
A=A-1
M=D
@LCL
D=M
@SP
AM=M+1
A=A-1
M=D
@ARG
D=M
@SP
AM=M+1
A=A-1
M=D
@THIS
D=M
@SP
AM=M+1
A=A-1
M=D
@THAT
D=M
@SP
AM=M+1
A=A-1
M=D
@SP
D=M
@5
D=D-A
@nArgs
D=D-A
@ARG
M=D
@SP
D=M
@LCL
M=D
@FunctionName
0;JMP
(FunctionName$ret.0)

return

@LCL
D=M
@13
M=D              // frame = LCL
@5
A=D-A
D=M
@14
M=D              // ret = *(frame-5)
@SP
AM=M-1
D=M
@ARG
A=M
M=D              // *ARG = pop()
@ARG
D=M+1
@SP
M=D              // SP = ARG + 1
@13
AM=M-1
D=M
@THAT
M=D
@13
AM=M-1
D=M
@THIS
M=D
@13
AM=M-1
D=M
@ARG
M=D
@13
AM=M-1
D=M
@LCL
M=D
@14
A=M
0;JMP            // goto ret

R13 and R14 serve as the frame scratch space and cached return address.

Program Control

Subroutines & Functions

Stack Implementation

Call Implementation

Function Implementation

Return Implementation

Memory Architecture of the Hack Virtual Machine

The Hack Virtual Machine is implemented on a stack-based architecture. All function calls, arguments, and local variables are stored in the RAM. The CPU executes instructions fetched from the ROM, while the stack resides in RAM starting at address 256. The following diagram summarises the relationship between ROM, RAM, and the stack pointer segments.

                    ┌──────────────────────────────────────────┐
                    │                HACK CPU                  │
                    │──────────────────────────────────────────│
                    │  Registers:                              │
                    │   A  → Address register                  │
                    │   D  → Data register                     │
                    │   PC → Program counter                   │
                    │                                          │
                    │  Control signals: use A/D/PC to access   │
                    │  RAM or ROM                              │
                    └──────────────────────────────────────────┘
                                       │
                                       │ (A register provides address)
                                       ▼

    ┌───────────────────────────────────────────┐
    │                   ROM                     │
    │───────────────────────────────────────────│
    │ Stores machine code (.hack instructions)  │
    │ Loaded from compiled .asm file            │
    │ PC fetches sequentially                   │
    │ Read-only memory                          │
    └───────────────────────────────────────────┘
                                       │
                                       ▼
    ┌───────────────────────────────────────────┐
    │                   RAM                     │
    │───────────────────────────────────────────│
    │ Address range: 0 - 32767                  │
    │                                           │
    │ 0–15 : General-purpose registers          │
    │   ├─ R0  = SP    (Stack Pointer)          │
    │   ├─ R1  = LCL   (Local segment base)     │
    │   ├─ R2  = ARG   (Argument segment base)  │
    │   ├─ R3  = THIS  (This segment base)      │
    │   ├─ R4  = THAT  (That segment base)      │
    │   ├─ R5–R12 = Temp segment (8 slots)      │
    │   ├─ R13–R15 = General temporary registers│
    │                                           │
    │ 16–255 : Static variables (per file)      │
    │                                           │
    │ 256–2047 : Stack segment                  │
    │   ↑                                       │
    │   │ push → write at stack top             │
    │   │ pop  → remove from stack top          │
    │   │ SP points to next free slot           │
    │   │-------------------------------------- │
    │   │  ← Stack base (256)                   │
    │   │  [Return address]                     │
    │   │  [Saved LCL, ARG, THIS, THAT]         │
    │   │  [Local variables local 0..n]         │
    │   │  [Working stack / evaluation values]  │
    │   │-------------------------------------- │
    │   ↓                                       │
    │                                           │
    │ 2048–16383 : Heap / arrays / objects      │
    │                                           │
    │ 16384–24575 : Screen memory map           │
    │ 24576–32767 : Keyboard input              │
    └───────────────────────────────────────────┘

VM Notes

The VM provides portability by hiding the Hack memory details.
SP management ensures correct push/pop order.
Always use unique labels for comparisons and calls.
ROM addresses are just sequential instruction numbers; label declarations don’t consume addresses, they alias the next instruction’s line number.
call: push return address and segment pointers, then jump to function.
return: restore caller frame, reposition SP, and jump back to return address.
No constant in pop operation.

High-Level Language (Jack)

Jack source compiles to VM commands, VM maps to Hack assembly.

Segment Mapping

Jack variable kind	VM segment	Notes
static	static	Per class file scope
field	this	Base in pointer 0
var, local	local	Subroutine private
argument	argument	Call site provided
array base	this or local or argument	Depends on declaration

Subroutine Kinds

Jack subroutine	VM header	Entry actions
function	function ClassName.func k	No implicit this
method	function ClassName.method k	push argument 0, pop pointer 0
constructor	function ClassName.new k	push constant fieldCount, call Memory.alloc 1, pop pointer 0

Statements, minimal templates

let x = expr

... code for expr
pop segment index        // x

let a[i] = expr

... code for a
... code for i
add
... code for expr
pop temp 0
pop pointer 1
push temp 0
pop that 0

y = a[i]

... code for a
... code for i
add
pop pointer 1
push that 0
... assign to y via pop segment index

do subCall(args)

... push object ref if method call
... push args
call QualName nArgs
pop temp 0               // discard return

if (cond) { S1 } else { S2 }

... code for cond
if-goto IF_TRUE$n
goto IF_FALSE$n
label IF_TRUE$n
... S1
goto IF_END$n
label IF_FALSE$n
... S2
label IF_END$n

while (cond) { S }

label WHILE_EXP$n
... code for cond
not
if-goto WHILE_END$n
... S
goto WHILE_EXP$n
label WHILE_END$n

return, return expr

push constant 0          // void
return

... code for expr
return

Expressions, operators

Jack	VM expansion
- x	neg
not x	not
x + y	add
x − y	sub
x & y	and
x \| y	or
x < y	lt
x > y	gt
x = y	eq
x * y	call Math.multiply 2
x / y	call Math.divide 2

Literals and keywords

Jack	VM
integer n	push constant n
true	push constant 0, not
false, null	push constant 0
this	push pointer 0

String literal “abc”

push constant 3
call String.new 1
push constant 97
call String.appendChar 2
push constant 98
call String.appendChar 2
push constant 99
call String.appendChar 2

Length first, then append each code point.

Calls, qualification

Jack form	VM call name	Arg0 rule
obj.m(a,b)	ClassName.m	push obj reference then a,b
Class.f(a,b)	Class.f	no implicit object
m(a,b) inside a method	ClassName.m	push pointer 0 then a,b

Label policy

Use per subroutine counters, labels must be unique per function, for example IF_TRUE$n, WHILE_END$n.

Code Generation

Handling Objects

Handling Arrays

Example

Revisiting Nand2Tetris: Building a Computer from Scratch

2025-10-05T00:00:00+09:30

Introduction

In the age where computing systems are defined by abstraction layers and pre-built frameworks, the Nand2Tetris project, which was designed by Noam Nisan and Shimon Schocken, offers a rare opportunity to return to the foundations of computer science. This educational journey begins with the simplest possible logic gate, the NAND gate, and gradually guides learners toward constructing a fully functioning computer capable of running a high-level programming language and simple applications such as the game Tetris. By bridging hardware architecture, machine language, operating systems, and compiler design, Nand2Tetris provides an integrated understanding of how each layer of computing interacts to form a cohesive whole.

Chapter 1: Boolean Logic

This chapter begins with Boolean algebra and the NAND gate, the universal logic gate from which all others can be constructed. Students implement basic gates such as NOT, AND, OR, and XOR, laying the foundation for digital computation.

A ----\                
       NAND ----> Output
B ----/

Through these constructions, learners understand how simple gates combine to form complex logical circuits.

De Morgan’s Law

\[\overline{AB} = \overline{A} + \overline{B}\\\] \[\overline{A + B} = \overline{A}\,\overline{B}\]

Chapter 2: Boolean Arithmetic

The second chapter focuses on binary arithmetic. Using logic gates, students build half-adders and full-adders, then chain them to construct multi-bit adders capable of handling binary numbers.

A ----\
       XOR ---- Sum
B ----/ \        \
        AND ---- Carry

This establishes how computers perform arithmetic at the hardware level.

Chapter 3: Sequential Logic

Sequential logic introduces state—the ability to remember information. Using feedback loops and flip-flops, students design circuits that store data over time.

     +---------+
     |         |
Input ---> NAND ---> Output
       ^         |
       |_________|

These principles lead to the design of registers and counters, key elements of memory systems.

Chapter 4: Machine Language

Here, students are introduced to the Hack machine language, the instruction set that the computer will eventually execute. They learn how the CPU interprets binary codes as instructions for computation and memory manipulation.

Example program:

@2
D=A
@3
D=D+A
@0
M=D

This simple example adds two numbers and stores the result in memory.

Chapter 5: Computer Architecture

This chapter integrates earlier components into a complete CPU. Students combine the ALU (Arithmetic Logic Unit), registers, and program counter to build a central processing unit capable of running the Hack machine language.

Instruction --> Decoder --> Control Bits
Registers --> ALU --> Output + Flags

This marks the transition from circuit design to system-level computation.

Chapter 6: Assembler

With the hardware ready, students build an assembler to translate Hack assembly language into binary machine code. The assembler resolves symbolic labels and memory variables into numeric addresses.

@LOOP  →  @16
@i     →  @17

This software bridge allows humans to program the hardware more effectively.

The Virtual Machine (VM) language introduces a stack-based computation model that abstracts hardware operations. It sits between the Jack high-level language and the Hack assembly language, providing a clean interface for arithmetic, logic, and memory commands.

All computations occur on a stack using push and pop instructions. Operands are pushed onto the stack, an operation (like add or sub) is performed, and the result is stored back on top.

Example:

push constant 7
push constant 8
add

Execution:

Stack: [7] → [7,8] → add → [15]

The VM defines memory segments that map to hardware:

Segment	Purpose	Hack Mapping
constant	literal values	none
local	function locals	RAM[LCL]
argument	function args	RAM[ARG]
this/that	object refs	RAM[THIS]/RAM[THAT]
temp	temporary	RAM[5–12]
pointer	controls this/that	RAM[3–4]
static	global vars	RAM[16+]

Arithmetic and logic commands include:

add, sub, neg, eq, gt, lt, and, or, not

Example translation:

// VM: add
@SP
AM=M-1
D=M
A=A-1
M=M+D

This translator layer introduces structured computation independent of physical memory layout and prepares for Chapter 8, which adds branching and function control.

Chapter 8: Virtual Machine II — Program Control

Extending the VM, this chapter adds program control capabilities like branching, looping, and function calls. It demonstrates how higher-level logic is implemented atop a stack-based execution model.

function Main.fibonacci 0
push argument 0
push constant 2
lt
if-goto BASE_CASE

Chapter 9: High-Level Language

Students are introduced to Jack, a simple, object-based language. Jack programs are compiled into VM code, showing the bridge from human-readable syntax to machine-executable logic.

Example:

class Main {
  function void main() {
    do Output.printString("Hello, world!");
    return;
  }
}

Chapter 10: Compiler I — Syntax Analysis

Here, the compiler is built. Students first construct a syntax analyser that parses Jack programs into structured representations (parse trees). This teaches the foundations of compiler design.

Jack Source → Tokeniser → Syntax Tree

Chapter 11: Compiler II — Code Generation

In this chapter, students implement code generation, translating parsed Jack syntax into executable VM commands. This finalises the high-level language pipeline from source code to VM bytecode.

Jack → VM Code → Assembly → Binary → Execution

Chapter 12: Operating System

Students write the Jack operating system (OS), implementing essential libraries like Math, Memory, String, and Array. These provide higher-level abstractions that simplify application development.

+------------------+
| Application Code |
| OS Libraries     |
| VM + Compiler    |
| CPU + Memory     |
+------------------+

The OS marks the final layer of abstraction between hardware and user-level software.

Chapter 13: Postscript — More Fun to Go

The book concludes with reflections on further exploration. Having built a full computer system—from hardware logic to operating system—students can now explore real-world architectures, programming languages, and computer science research topics.

Conclusion

Nand2Tetris demystifies the complexity of computing by reconstructing it from first principles. It unifies hardware and software learning, empowering students to understand every layer of modern computation, from NAND gates to game programs.

Multi-Layer Perceptron Neural Networks

2025-09-27T00:00:00+09:30

Key Concepts

Hidden Layers
An MLP contains one or more hidden layers between the input and output.
Each hidden layer applies a linear transformation followed by a non-linear activation function:
$H_\ell = f_\ell(H_{\ell-1} W_\ell + \mathbf{1} b_\ell^\top)$
where:
- $H_{\ell-1}$ is the previous layer’s output (with $H_0 = X$).
- $W_\ell, b_\ell$ are the weight matrix and bias vector for layer $\ell$.
- $f_\ell(\cdot)$ is a non-linear activation (e.g., ReLU, tanh, sigmoid).
Deep Representations
Multiple hidden layers allow the network to learn hierarchical feature representations.
- Early layers capture low-level patterns (e.g., edges in images).
- Deeper layers capture higher-level abstractions (e.g., object shapes).
Activation Functions
Unlike a classic perceptron (which often uses a step function) or logistic regression (which uses sigmoid), MLPs commonly use:
- ReLU: $f(x) = \max(0, x)$ (default in modern deep learning).
- Tanh: rescales input to $[-1,1]$.
- Sigmoid: mainly used in the output layer for binary classification.
Output Layer
- For binary classification: sigmoid function produces $p = \sigma(z)$.
- For multi-class classification: softmax produces probability distribution over $K$ classes:
  $p_k = \frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)} \quad (k=1,\dots,K)$
Loss Function (Extension)
- Binary case: same as SLP (binary cross-entropy).
- Multi-class case: categorical cross-entropy with one-hot labels:
  $\mathcal{L} = -\frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K y_{i,k} \log p_{i,k}$
Backpropagation Through Layers
The error signal is propagated backward through each layer using the chain rule, enabling gradient computation for all parameters:
$\frac{\partial \mathcal{L}}{\partial W_\ell}, \quad \frac{\partial \mathcal{L}}{\partial b_\ell}, \quad \ell = 1,\dots,L$
Universal Approximation
With enough hidden units, an MLP can approximate any continuous function on a compact domain.
This property underlies its power as a general-purpose function approximator.

Architecture of a Multi-Layer Perceptron Neural Network

Formulas

Input
- Mini-batch input: $X \in \mathbb{R}^{m \times n}$
- where:
  - $m$ = batch size
  - $n$ = number of features
- Parameters: $W_1 \in \mathbb{R}^{n \times k_1}, \quad b_1 \in \mathbb{R}^{k_1}$ $W_2 \in \mathbb{R}^{k_1 \times k_2}, \quad b_2 \in \mathbb{R}^{k_2}$ $w_3 \in \mathbb{R}^{k_2}, \quad b_3 \in \mathbb{R}$
Forward Propagation
1. Hidden Layer 1 $H_1 = f_1\!\big( X W_1 + \mathbf{1} b_1^\top \big) \quad \in \mathbb{R}^{m \times k_1}$
2. Hidden Layer 2 $H_2 = f_2\!\big( H_1 W_2 + \mathbf{1} b_2^\top \big) \quad \in \mathbb{R}^{m \times k_2}$
3. Output Pre-activation $z = H_2 w_3 + b_3 \mathbf{1} \quad \in \mathbb{R}^{m}$
4. Sigmoid Activation $p = \sigma(z) = \frac{1}{1 + e^{-z}} \quad \in \mathbb{R}^{m}$
Prediction $\hat{y}_i = \begin{cases} 1, & \text{if } p_i \geq \tau \\ 0, & \text{if } p_i < \tau \end{cases}$ with threshold $\tau = 0.5$.
Loss Function (Binary Cross-Entropy)

For a batch: $\mathcal{L} = -\frac{1}{m} \sum_{i=1}^m \Big( y_i \log(p_i) + (1-y_i)\log(1-p_i) \Big)$

Vectorised form: $\mathcal{L} = -\frac{1}{m} \Big[ y^\top \log p + (1-y)^\top \log (1-p) \Big]$
Backpropagation (Gradients)
- Output layer: $\frac{\partial \mathcal{L}}{\partial z} = \frac{1}{m}(p - y) \quad \in \mathbb{R}^m$
- Gradients for output weights and bias: $\frac{\partial \mathcal{L}}{\partial w_3} = \frac{1}{m} H_2^\top (p-y)$ $\frac{\partial \mathcal{L}}{\partial b_3} = \frac{1}{m} \mathbf{1}^\top (p-y)$
- Hidden layers (chain rule with chosen activations $f_1, f_2$).
Parameter Update (Gradient Descent)

With learning rate $\eta > 0$: $W_\ell \leftarrow W_\ell - \eta \frac{\partial \mathcal{L}}{\partial W_\ell}, \quad b_\ell \leftarrow b_\ell - \eta \frac{\partial \mathcal{L}}{\partial b_\ell} \quad (\ell = 1,2,3)$

Implementation and Explanation

This section contrasts a from-scratch NumPy implementation with an equivalent PyTorch model. Both pipelines share the same data preprocessing, hyperparameters, and evaluation workflow so their learning curves can be compared directly. A correct manual implementation should produce broadly similar learning behaviour to PyTorch; large gaps usually point to implementation details such as gradient scaling, initialisation, or optimiser settings rather than to autograd itself.

Custom Version

The custom network is assembled from lightweight building blocks: Linear, ReLU, and CrossEntropy. Each layer stores the activations it needs for the backward pass, computes gradients manually, and updates its parameters via SGD in the step routine. Utility helpers handle one-hot encoding, mini-batch iteration, normalisation, and accuracy tracking so the training loop mirrors a framework-driven workflow while keeping every tensor transformation explicit.

import numpy as np
from pathlib import Path

np.random.seed(8)


class Linear:
    def __init__(self, in_features, out_features):
        self.W = np.random.randn(out_features, in_features) * np.sqrt(2.0 / in_features)
        self.b = np.zeros((out_features, 1))

    def forward(self, x):
        self.x = x
        return self.W @ x + self.b

    def backward(self, grad_output):
        self.dW = grad_output @ self.x.T
        self.db = np.sum(grad_output, axis=1, keepdims=True)
        return self.W.T @ grad_output


class ReLU:
    def forward(self, x):
        self.mask = x > 0
        return np.maximum(0, x)

    def backward(self, grad_output):
        return grad_output * self.mask


class CrossEntropy:
    def forward(self, logits, labels):
        shifted = logits - np.max(logits, axis=0, keepdims=True)
        exp_scores = np.exp(shifted)
        probs = exp_scores / np.sum(exp_scores, axis=0, keepdims=True)
        self.probs = probs
        self.labels = labels
        return -np.sum(labels * np.log(probs + 1e-15)) / labels.shape[1]

    def backward(self):
        return (self.probs - self.labels) / self.labels.shape[1]


class ThreeLayerNN:
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, output_dim):
        self.fc1 = Linear(input_dim, hidden_dim1)
        self.act1 = ReLU()
        self.fc2 = Linear(hidden_dim1, hidden_dim2)
        self.act2 = ReLU()
        self.fc3 = Linear(hidden_dim2, output_dim)

    def forward(self, x):
        z1 = self.fc1.forward(x)
        a1 = self.act1.forward(z1)
        z2 = self.fc2.forward(a1)
        a2 = self.act2.forward(z2)
        logits = self.fc3.forward(a2)
        return logits

    def backward(self, grad_output):
        grad_hidden2 = self.fc3.backward(grad_output)
        grad_hidden2 = self.act2.backward(grad_hidden2)
        grad_hidden1 = self.fc2.backward(grad_hidden2)
        grad_hidden1 = self.act1.backward(grad_hidden1)
        self.fc1.backward(grad_hidden1)

    def step(self, lr):
        self.fc1.W -= lr * self.fc1.dW
        self.fc1.b -= lr * self.fc1.db
        self.fc2.W -= lr * self.fc2.dW
        self.fc2.b -= lr * self.fc2.db
        self.fc3.W -= lr * self.fc3.dW
        self.fc3.b -= lr * self.fc3.db


def one_hot(labels, num_classes):
    return np.eye(num_classes)[labels].T


def iterate_minibatches(X, Y, batch_size, shuffle=True):
    num_samples = X.shape[1]
    indices = np.arange(num_samples)
    if shuffle:
        np.random.shuffle(indices)
    for start in range(0, num_samples, batch_size):
        batch_idx = indices[start:start + batch_size]
        yield X[:, batch_idx], Y[:, batch_idx]


def accuracy(logits, labels):
    preds = np.argmax(logits, axis=0)
    return np.mean(preds == labels)


data_dir = Path.home() / "Code" / "data"

train_data = np.loadtxt(data_dir / "train.csv", delimiter=",")
test_data = np.loadtxt(data_dir / "test.csv", delimiter=",")

y_train_full = train_data[:, 0].astype(int)
X_train_full = train_data[:, 1:]

y_test = test_data[:, 0].astype(int)
X_test = test_data[:, 1:]

train_cutoff = 4000
X_train_raw = X_train_full[:train_cutoff]
y_train = y_train_full[:train_cutoff]
X_val_raw = X_train_full[train_cutoff:]
y_val = y_train_full[train_cutoff:]

mean = X_train_raw.mean(axis=0, keepdims=True)
std = X_train_raw.std(axis=0, keepdims=True) + 1e-8

X_train_std = (X_train_raw - mean) / std
X_val_std = (X_val_raw - mean) / std
X_test_std = (X_test - mean) / std

X_train_np = X_train_std.T
X_val_np = X_val_std.T
X_test_np = X_test_std.T

num_classes = 2
hidden_units = 64

Y_train = one_hot(y_train, num_classes)
Y_val = one_hot(y_val, num_classes)
Y_test = one_hot(y_test, num_classes)

hidden_dim1 = hidden_units
hidden_dim2 = hidden_units

custom_model = ThreeLayerNN(input_dim=X_train_np.shape[0],
                            hidden_dim1=hidden_dim1,
                            hidden_dim2=hidden_dim2,
                            output_dim=num_classes)
criterion_np = CrossEntropy()

epochs = 50
batch_size = 64
learning_rate = 0.1

custom_history = []
for epoch in range(1, epochs + 1):
    epoch_loss = 0.0
    for xb, yb in iterate_minibatches(X_train_np, Y_train, batch_size):
        logits = custom_model.forward(xb)
        loss = criterion_np.forward(logits, yb)
        grad_logits = criterion_np.backward()
        custom_model.backward(grad_logits)
        custom_model.step(learning_rate)
        epoch_loss += loss * xb.shape[1]
    epoch_loss /= X_train_np.shape[1]
    train_acc = accuracy(custom_model.forward(X_train_np), y_train)
    val_acc = accuracy(custom_model.forward(X_val_np), y_val)
    custom_history.append((epoch, epoch_loss, train_acc, val_acc))
    if epoch % 10 == 0 or epoch == 1:
        print(f"Epoch {epoch:02d}: loss={epoch_loss:.4f} train_acc={train_acc:.4f} val_acc={val_acc:.4f}")

custom_val_acc = accuracy(custom_model.forward(X_val_np), y_val)
custom_test_acc = accuracy(custom_model.forward(X_test_np), y_test)

print(f"Custom validation accuracy: {custom_val_acc:.4f}")
print(f"Custom test accuracy: {custom_test_acc:.4f}")

Training Custom Model

(base) ➜  draft python ml/mlp.py
Epoch 01: loss=0.8111 train_acc=0.4998 val_acc=0.5016
Epoch 10: loss=0.6645 train_acc=0.6025 val_acc=0.6011
Epoch 20: loss=0.5992 train_acc=0.6810 val_acc=0.6613
Epoch 30: loss=0.5376 train_acc=0.7455 val_acc=0.7129
Epoch 40: loss=0.4696 train_acc=0.7965 val_acc=0.7607
Epoch 50: loss=0.3977 train_acc=0.8435 val_acc=0.7991
Custom validation accuracy: 0.7991
Custom test accuracy: 0.7980

PyTorch Version

The PyTorch variant recreates the same architecture with nn.Sequential, letting autograd handle gradient calculations. Dataset splits are wrapped in TensorDataset/DataLoader, giving shuffling and batching for free, and the training loop follows the standard optimizer.zero_grad() → loss.backward() → optimizer.step() pattern. Reusing the preprocessing from the custom section ensures any performance gains are attributable to the framework tooling rather than data differences.

import numpy as np
import torch
from torch import nn
from torch.utils.data import TensorDataset, DataLoader
from pathlib import Path
torch.manual_seed(8)


# Load data and create train/validation/test splits
data_dir = Path.home() / "Code" / "data"

train_data = np.loadtxt(data_dir / "train.csv", delimiter=",")
test_data = np.loadtxt(data_dir / "test.csv", delimiter=",")

y_train_full = train_data[:, 0].astype(int)
X_train_full = train_data[:, 1:]

y_test = test_data[:, 0].astype(int)
X_test = test_data[:, 1:]

train_cutoff = 4000
X_train_raw = X_train_full[:train_cutoff]
y_train = y_train_full[:train_cutoff]
X_val_raw = X_train_full[train_cutoff:]
y_val = y_train_full[train_cutoff:]

mean = X_train_raw.mean(axis=0, keepdims=True)
std = X_train_raw.std(axis=0, keepdims=True) + 1e-8

X_train_std = (X_train_raw - mean) / std
X_val_std = (X_val_raw - mean) / std
X_test_std = (X_test - mean) / std

X_train_tensor = torch.tensor(X_train_std, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)
X_val_tensor = torch.tensor(X_val_std, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val, dtype=torch.long)
X_test_tensor = torch.tensor(X_test_std, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)

batch_size = 64
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

input_dim = X_train_tensor.shape[1]
num_classes = 2

# Define PyTorch MLP and training utilities

class TorchMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        return self.net(x)


def evaluate_model(model, criterion, data_loader, device):
    model.eval()
    total_loss = 0.0
    correct = 0
    total = 0
    with torch.no_grad():
        for xb, yb in data_loader:
            xb = xb.to(device)
            yb = yb.to(device)
            logits = model(xb)
            loss = criterion(logits, yb)
            total_loss += loss.item() * xb.size(0)
            preds = torch.argmax(logits, dim=1)
            correct += (preds == yb).sum().item()
            total += xb.size(0)
    return total_loss / total, correct / total


def train_model(model, criterion, optimizer, train_loader, val_loader, epochs, device):
    history = []
    model.to(device)
    for epoch in range(1, epochs + 1):
        model.train()
        epoch_loss = 0.0
        correct = 0
        total = 0
        for xb, yb in train_loader:
            xb = xb.to(device)
            yb = yb.to(device)
            optimizer.zero_grad()
            logits = model(xb)
            loss = criterion(logits, yb)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item() * xb.size(0)
            preds = torch.argmax(logits, dim=1)
            correct += (preds == yb).sum().item()
            total += xb.size(0)
        train_loss = epoch_loss / total
        train_acc = correct / total
        val_loss, val_acc = evaluate_model(model, criterion, val_loader, device)
        history.append((epoch, train_loss, train_acc, val_loss, val_acc))
        print(f"Epoch {epoch:02d}: train_loss={train_loss:.4f} train_acc={train_acc:.4f} "
              f"val_loss={val_loss:.4f} val_acc={val_acc:.4f}")
    return history

# Train the PyTorch model with the same hyperparameters as the custom implementation


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

hidden_units = 64
learning_rate = 0.1
epochs = 50

model = TorchMLP(input_dim, hidden_units, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

pytorch_history = train_model(model, criterion, optimizer, train_loader, val_loader, epochs, device)

pytorch_val_loss, pytorch_val_acc = evaluate_model(model, criterion, val_loader, device)
pytorch_test_loss, pytorch_test_acc = evaluate_model(model, criterion, test_loader, device)

print(f"PyTorch validation accuracy: {pytorch_val_acc:.4f}, loss: {pytorch_val_loss:.4f}")
print(f"PyTorch test accuracy: {pytorch_test_acc:.4f}, loss: {pytorch_test_loss:.4f}")

Training PyTorch Model

(base) ➜  draft python ml/mlp_torch.py 
Epoch 01: train_loss=0.6716 train_acc=0.6168 val_loss=0.6073 val_acc=0.7809
Epoch 02: train_loss=0.3701 train_acc=0.8952 val_loss=0.1712 val_acc=0.9540
Epoch 03: train_loss=0.1077 train_acc=0.9695 val_loss=0.1098 val_acc=0.9631
Epoch 04: train_loss=0.0564 train_acc=0.9872 val_loss=0.1032 val_acc=0.9667
Epoch 05: train_loss=0.0335 train_acc=0.9942 val_loss=0.0987 val_acc=0.9700
Epoch 06: train_loss=0.0208 train_acc=0.9978 val_loss=0.0992 val_acc=0.9680
Epoch 07: train_loss=0.0132 train_acc=0.9982 val_loss=0.1018 val_acc=0.9684
Epoch 08: train_loss=0.0079 train_acc=0.9990 val_loss=0.1039 val_acc=0.9682
Epoch 09: train_loss=0.0049 train_acc=1.0000 val_loss=0.1036 val_acc=0.9709
Epoch 10: train_loss=0.0037 train_acc=1.0000 val_loss=0.1046 val_acc=0.9709
Epoch 11: train_loss=0.0029 train_acc=1.0000 val_loss=0.1052 val_acc=0.9709
Epoch 12: train_loss=0.0024 train_acc=1.0000 val_loss=0.1059 val_acc=0.9718
Epoch 13: train_loss=0.0020 train_acc=1.0000 val_loss=0.1067 val_acc=0.9718
Epoch 14: train_loss=0.0017 train_acc=1.0000 val_loss=0.1073 val_acc=0.9722
Epoch 15: train_loss=0.0015 train_acc=1.0000 val_loss=0.1080 val_acc=0.9727
Epoch 16: train_loss=0.0013 train_acc=1.0000 val_loss=0.1087 val_acc=0.9727
Epoch 17: train_loss=0.0012 train_acc=1.0000 val_loss=0.1094 val_acc=0.9731
Epoch 18: train_loss=0.0011 train_acc=1.0000 val_loss=0.1100 val_acc=0.9731
Epoch 19: train_loss=0.0010 train_acc=1.0000 val_loss=0.1105 val_acc=0.9731
Epoch 20: train_loss=0.0009 train_acc=1.0000 val_loss=0.1111 val_acc=0.9731
Epoch 21: train_loss=0.0008 train_acc=1.0000 val_loss=0.1117 val_acc=0.9731
Epoch 22: train_loss=0.0008 train_acc=1.0000 val_loss=0.1122 val_acc=0.9731
Epoch 23: train_loss=0.0007 train_acc=1.0000 val_loss=0.1127 val_acc=0.9731
Epoch 24: train_loss=0.0007 train_acc=1.0000 val_loss=0.1131 val_acc=0.9731
Epoch 25: train_loss=0.0006 train_acc=1.0000 val_loss=0.1136 val_acc=0.9733
Epoch 26: train_loss=0.0006 train_acc=1.0000 val_loss=0.1141 val_acc=0.9731
Epoch 27: train_loss=0.0006 train_acc=1.0000 val_loss=0.1145 val_acc=0.9736
Epoch 28: train_loss=0.0005 train_acc=1.0000 val_loss=0.1149 val_acc=0.9733
Epoch 29: train_loss=0.0005 train_acc=1.0000 val_loss=0.1152 val_acc=0.9733
Epoch 30: train_loss=0.0005 train_acc=1.0000 val_loss=0.1156 val_acc=0.9733
Epoch 31: train_loss=0.0005 train_acc=1.0000 val_loss=0.1160 val_acc=0.9731
Epoch 32: train_loss=0.0004 train_acc=1.0000 val_loss=0.1163 val_acc=0.9733
Epoch 33: train_loss=0.0004 train_acc=1.0000 val_loss=0.1167 val_acc=0.9731
Epoch 34: train_loss=0.0004 train_acc=1.0000 val_loss=0.1170 val_acc=0.9733
Epoch 35: train_loss=0.0004 train_acc=1.0000 val_loss=0.1173 val_acc=0.9731
Epoch 36: train_loss=0.0004 train_acc=1.0000 val_loss=0.1176 val_acc=0.9733
Epoch 37: train_loss=0.0004 train_acc=1.0000 val_loss=0.1179 val_acc=0.9733
Epoch 38: train_loss=0.0003 train_acc=1.0000 val_loss=0.1182 val_acc=0.9733
Epoch 39: train_loss=0.0003 train_acc=1.0000 val_loss=0.1185 val_acc=0.9736
Epoch 40: train_loss=0.0003 train_acc=1.0000 val_loss=0.1188 val_acc=0.9736
Epoch 41: train_loss=0.0003 train_acc=1.0000 val_loss=0.1191 val_acc=0.9736
Epoch 42: train_loss=0.0003 train_acc=1.0000 val_loss=0.1193 val_acc=0.9736
Epoch 43: train_loss=0.0003 train_acc=1.0000 val_loss=0.1196 val_acc=0.9736
Epoch 44: train_loss=0.0003 train_acc=1.0000 val_loss=0.1198 val_acc=0.9736
Epoch 45: train_loss=0.0003 train_acc=1.0000 val_loss=0.1201 val_acc=0.9736
Epoch 46: train_loss=0.0003 train_acc=1.0000 val_loss=0.1203 val_acc=0.9736
Epoch 47: train_loss=0.0003 train_acc=1.0000 val_loss=0.1205 val_acc=0.9736
Epoch 48: train_loss=0.0003 train_acc=1.0000 val_loss=0.1208 val_acc=0.9736
Epoch 49: train_loss=0.0002 train_acc=1.0000 val_loss=0.1210 val_acc=0.9736
Epoch 50: train_loss=0.0002 train_acc=1.0000 val_loss=0.1212 val_acc=0.9736
PyTorch validation accuracy: 0.9736, loss: 0.1212
PyTorch test accuracy: 0.9707, loss: 0.1009

Summary

Side-by-side results highlight how much leverage a mature framework provides: PyTorch removes most manual bookkeeping around gradient calculation, parameter updates, device placement, and batching. The NumPy baseline remains valuable for building intuition about tensor shapes, gradient flow, and training dynamics, but its results should be checked carefully because small scaling mistakes can change the effective learning rate by orders of magnitude.

Understanding Single-Layer Neural Networks

2025-09-26T00:00:00+09:30

Key Concepts

Neuron (Perceptron)
A neuron is the fundamental unit of the network. Each neuron computes a weighted sum of inputs plus a bias, then applies an activation function (e.g. step, sigmoid).
Input Layer
The input layer accepts the raw data features. Each input is multiplied by an associated weight and passed to the neuron. It is often represented by a vector $\mathbf{x} \in \mathbb{R}^n$.
Weights and Bias
- Weights represent the importance of each input feature. They are often represented by a vector $\mathbf{w} \in \mathbb{R}^n$.
- Bias allows shifting the decision boundary away from the origin.
Linear Combination
The neuron computes:
\[z = \mathbf{w}^\top \mathbf{x} + b\]
Where $\mathbf{w}$ = weight, $\mathbf{x}$ = input, $b$ = bias
Activation Function
The activation function introduces non-linearity (in classic perceptron often a step function). It determines the output class or value. Without it, the network is just a linear model.
Output Layer
It provides the final prediction. In a single-layer network, there is only one set of weights between the input and output (no hidden layer).
Decision Boundary
The hyperplane separating classes in the input space. In a single-layer network, this boundary is always linear.
Ground Truth
The true label of the data point, denoted as $y_{\text{true}} \in {0,1}$. It represents the actual class assigned in the dataset.
Prediction
The model produces $\hat{y}$, derived from the probability $p$. Typically, $\hat{y} = 1$ if $p \geq \tau$, else $\hat{y} = 0$.
Loss Function (Cross-Entropy)
To train the model, the predicted probability $p$ is compared with the ground truth $y_{\text{true}}$ using the binary cross-entropy loss:
\[\mathcal{L}(y_{\text{true}}, p) = - \big[ y_{\text{true}} \cdot \log(p) + (1 - y_{\text{true}}) \cdot \log(1 - p) \big]\]
This loss penalises large differences between the predicted probability and the actual label. Minimising $\mathcal{L}$ adjusts the weights $w$ and bias $b$ to improve classification performance.
Backpropagation
Gradients of the loss are propagated back to update the weights and bias.
Gradient Descent
The parameters are updated iteratively to minimise the loss:
\[\mathbf{w} := \mathbf{w} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{w}}, \quad b := b - \eta \frac{\partial \mathcal{L}}{\partial b}\]
where $\eta > 0$ is the learning rate controlling the step size.

Architecture of a Single-Layer Neural Network

This diagram illustrates a single-layer neural network for binary classification. The input is represented as a feature vector $\mathbf{x} \in \mathbb{R}^n$, and the parameters of the model are a weight vector $\mathbf{w} \in \mathbb{R}^n$ and a bias term $b \in \mathbb{R}$. The linear combination is computed as $z = \mathbf{w}^\top \mathbf{x} + b$. This value is then passed through a sigmoid activation function $\sigma(z) = \frac{1}{1+e^{-z}}$, which outputs a probability $p = \Pr(y=1 \mid \mathbf{x}) = \sigma(z)$, representing the likelihood that the class label $y$ equals 1. Finally, the probability is compared to a threshold $\tau$ (e.g., 0.5) to produce the predicted class label $y \in {0,1}$. The decision boundary of this model is defined by $\mathbf{w}^\top \mathbf{x} + b = 0$.

Apple HomeKit 智能家居低成本入门指南

2024-08-29T00:00:00+09:30

本文记录了我如何用低成本打造一个 苹果 HomeKit 智能家居系统 的全过程，从选购硬件到软件配置，以及最终的使用体验。希望能给同样想入坑的朋友一些参考。

一、起因

卧室吸顶灯坏了，想着先换个 LED 灯芯试试是不是灯芯的问题。于是花了约 ¥40 买了一款支持米家 WiFi 接入的灯芯。
装好后才发现买的是 48W 的灯条，亮度一般，甚至不如台灯。不过既然能亮，就先凑合用。

这让我萌生了一个想法：既然要折腾，不如顺便把整个家里的灯光和电器做一套智能化改造。家里正好有个闲置的 HomePod mini，于是决定以 HomeKit 为核心做实验。

二、思路与方案

1. 跨生态问题

苹果 HomeKit 和米家、涂鸦等生态协议不兼容。
解决方案是使用 Home Assistant（HA） —— 一款开源智能家居网关软件，能桥接不同协议，把米家、涂鸦等设备“伪装”成 HomeKit 设备，从而让 HomeKit 统一管理。

2. 硬件准备

运行环境：树莓派太贵，我在二手市场买了一个“智趣盒子”（¥75，2 GB RAM + 16 GB ROM），刷机后运行 HA。
网络环境：主路由在客厅，卧室信号差，于是用卧室网口接了一个桥接路由器，并接入盒子。这样手机能在全屋无感切换 WiFi。
更新与调试：刷好机后访问 http://192.168.31.31:8123/，发现 HA 无法自动更新。通过 SSH 登录查看，发现是 Docker 部署，于是手动拉取最新版镜像并重启，问题解决。

Home Assistant Docker Services

三、设备接入

米家设备

接入流程：

确认有 家庭中枢（HomePod / iPad），才能实现远程控制。
在 HACS 安装插件 Xiaomi Miot Auto。
添加米家账号并导入设备。
创建 HomeKit Bridge 并桥接至 HomeKit。
- 注意：一个桥接中只能有一个空调，所以我把卧室空调单独建桥。

最终在 iOS “家庭”App 中能直接看到米家设备，并通过按钮或 Siri 控制。

涂鸦设备

主要用于红外 / 射频控制的电器，例如电视、晾衣架。

步骤：

在 涂鸦开发者平台 创建云项目，配置权限。
在涂鸦或“智能生活”App 中完成设备配对与学习遥控信号。
将密钥填入 HA，设备，在HomeKit Bridge中添加对应的开关，即可出现在 HomeKit 中。

其他设备接入方案

灯具：
- 灯条控制：能调色温，但断电后失效。
- 智能开关：手动/智能两用，更推荐。
空调：
- IoT 空调可直接接入。
- 普通空调需配合 空调伴侣（如 Gosund 电小酷）。
电视 / 晾衣架：
- 用支持红外/射频的 万能遥控器（涂鸦平台）。
插座：
- 用智能插座即可实现通断电控制。
门铃：
- 高端选 Aqara G4，性价比选 小米智能门铃 3。
窗帘：
- 可选轨道电机或窗帘伴侣（体验一般，我退货了）。

四、成本清单

项目	金额 (¥)
二手 HomePod mini	599
智趣盒子	75
HomeKit 2 路开关	45
涂鸦万能遥控	152
72W HomeKit 灯条	85
米家空调伴侣	59
智能插座	30
合计	1045

五、未来扩展

目前系统已能满足大部分日常需求。下一步可考虑：

智能门锁
温湿度传感器
扫地机器人、空气传感器
自动化场景（如“回家模式”、温湿度触发）

借助 HA 的插件与自动化功能，苹果生态下的家庭控制已经足够灵活和强大。

总结

通过 ¥1045 的低成本改造，成功将米家、涂鸦等设备整合进苹果 HomeKit，实现了统一控制。整个过程的关键在于 Home Assistant 的桥接作用。

如果你也想入坑，可以先从一两个设备试起，逐步扩展。
欢迎在评论区交流你的智能家居玩法。 🚀

Ansible Csv Vars Plugin

2020-05-07T00:00:00+09:30

Introduction

When rolling out network configs I kept address assignments in spreadsheets because they are easy to edit. Translating those tables into Ansible host vars manually grew painful, so this plugin lets each CSV row feed host-specific variables automatically. During template rendering every device pulls its unique addresses from the sheet, keeping the workflow spreadsheet-friendly while delivering correct per-host values in playbooks.

Requirements

Parse data from CSV into host variables.

CSV file example:

hostname,gateway,vlan30,vlan40,vlan50,vlan60,vlan70
localhost,192.168.30.254,192.168.30.31,192.168.40.31,192.168.50.31,192.168.60.31,192.168.70.31

Place the CSV file in the csv_vars directory under inventory or playbook, and it will be automatically parsed.
The hostname field in the CSV must match the host name in the inventory.
It is recommended to name the CSV as GROUP_OR_HOST_NAME.csv. Multiple CSV files are supported, and variable overriding is also supported.

Write the Plugin

File path: vars_plugins/csv_vars.py

...existing code...

Code from https://github.com/guozijn/csv_vars

CSV File

Place the CSV file in the csv_vars directory under inventory or playbook.

mdkir csv_vars
cat << EOF >> csv_vars/nodes.csv
hostname,gateway,vlan30,vlan40,vlan50,vlan60,vlan70
192.168.77.130,192.168.30.254,192.168.30.31,192.168.40.31,192.168.50.31,192.168.60.31,192.168.70.31
EOF

Run Playbook

# cat test_vars.yml

- hosts: 192.168.77.130
  gather_facts: no
  tasks:
    - debug:
        msg: "{{ lookup('vars', item) }}"
      loop: "{{ hostvars[inventory_hostname].keys() | select('match', '^vlan.*$|gateway') | list }}"

Execution result:

# ansible-playbook test_vars.yml

PLAY [192.168.77.130] ************************************************************************************************

TASK [debug] *********************************************************************************************************
ok: [192.168.77.130] => (item=gateway) => {
    "msg": "192.168.30.254"
}
ok: [192.168.77.130] => (item=vlan60) => {
    "msg": "192.168.60.31"
}
ok: [192.168.77.130] => (item=vlan30) => {
    "msg": "192.168.30.31"
}
ok: [192.168.77.130] => (item=vlan40) => {
    "msg": "192.168.40.31"
}
ok: [192.168.77.130] => (item=vlan70) => {
    "msg": "192.168.70.31"
}
ok: [192.168.77.130] => (item=vlan50) => {
    "msg": "192.168.50.31"
}

PLAY RECAP ***********************************************************************************************************
192.168.77.130             : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Run ad-hoc

Configure ansible.cfg to set the custom plugin directory:

[defaults]
vars_plugins       = /etc/ansible/vars_plugins

# ansible 192.168.77.130 -m debug -a 'var=gateway'
192.168.77.130 | SUCCESS => {
    "gateway": "192.168.30.254"
}

# ansible 192.168.77.130 -m debug -a 'var=vlan30'
192.168.77.130 | SUCCESS => {
    "vlan30": "192.168.30.31"
}