<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://guozijn.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://guozijn.github.io/" rel="alternate" type="text/html" hreflang="en" /><updated>2026-04-15T08:41:43+09:30</updated><id>https://guozijn.github.io/feed.xml</id><title type="html">Tinkerer</title><subtitle></subtitle><entry><title type="html">Transformer</title><link href="https://guozijn.github.io/engineering/2025/10/15/transformer.html" rel="alternate" type="text/html" title="Transformer" /><published>2025-10-15T00:00:00+10:30</published><updated>2025-10-15T00:00:00+10:30</updated><id>https://guozijn.github.io/engineering/2025/10/15/transformer</id><content type="html" xml:base="https://guozijn.github.io/engineering/2025/10/15/transformer.html"><![CDATA[<h2 id="transformer-core-concepts">Transformer Core Concepts</h2>

<h3 id="from-tokens-to-embeddings">From Tokens to Embeddings</h3>

<p>Raw tokens are first mapped to dense vectors through an embedding matrix so that the model can work in a continuous space. The embedding size (<code class="language-plaintext highlighter-rouge">n_embd</code>) defines the dimensionality of this space and controls both the model capacity and its memory footprint.</p>

<h3 id="positional-information">Positional Information</h3>

<p>Because self-attention without positional information is permutation-equivariant, Transformers inject order information with positional encodings. Classical sinusoidal encodings can be evaluated at positions beyond the training length, while learnable embeddings allow the model to adapt positions during training but are fixed to the learned context range. Modern variants sometimes rely on relative position encodings or rotary embeddings to better capture long context interactions.</p>

<h3 id="scaled-dot-product-self-attention">Scaled Dot-Product Self-Attention</h3>

<p>For each token, the model projects embeddings into queries (Q), keys (K), and values (V). Attention weights are computed as <code class="language-plaintext highlighter-rouge">softmax(QKᵀ / sqrt(d_k))</code>, where <code class="language-plaintext highlighter-rouge">d_k</code> is the head dimension to prevent large dot products from saturating the softmax. The output is a weighted sum of the value vectors. Encoder attention can gather information from the entire context window (<code class="language-plaintext highlighter-rouge">block_size</code>), while decoder-only language models use a causal mask so each position only attends to itself and earlier positions.</p>

<h3 id="multi-head-attention">Multi-Head Attention</h3>

<p>Multiple attention heads run in parallel on different learned projections of the same sequence. This design allows the model to capture heterogeneous relationships (syntax, long-range dependencies, coreference) in the same layer. The concatenated head outputs are linearly projected back into the model dimension.</p>

<h3 id="position-wise-feed-forward-network">Position-Wise Feed-Forward Network</h3>

<p>Each Transformer block follows attention with a two-layer feed-forward network applied independently to every position. A typical configuration is <code class="language-plaintext highlighter-rouge">Linear(n_embd → 4 × n_embd)</code>, an activation (GELU or ReLU), then <code class="language-plaintext highlighter-rouge">Linear(4 × n_embd → n_embd)</code>. This component mixes features learned by attention and introduces non-linearity.</p>

<h3 id="residual-connections-and-normalisation">Residual Connections and Normalisation</h3>

<p>Skip connections wrap both the attention sublayer and the feed-forward sublayer so that gradients flow directly to earlier blocks. LayerNorm (or RMSNorm in some modern designs) keeps activations well-scaled during training. Variants such as Pre-LN place the normalisation before each sublayer, which improves stability for deeper models.</p>

<h3 id="encoder-decoder-vs-decoder-only">Encoder-Decoder vs. Decoder-Only</h3>

<p>The original Transformer pairs an encoder that builds contextualised representations with a decoder that performs autoregressive generation, both stacked with attention and feed-forward modules. Many language models today use only the decoder stack with causal masking, which enforces that each token can only attend to previous positions, enabling left-to-right generation.</p>

<h3 id="training-and-scaling-considerations">Training and Scaling Considerations</h3>

<ul>
  <li><strong>Optimiser choice</strong>: AdamW remains the default, but large models may benefit from learning rate warm-up, cosine decay, and parameter-specific weight decay.</li>
  <li><strong>Regularisation</strong>: Dropout complements attention masking, while techniques such as label smoothing or stochastic depth can help deep stacks converge.</li>
  <li><strong>Precision and compilation</strong>: Training in mixed precision (<code class="language-plaintext highlighter-rouge">bfloat16</code>/<code class="language-plaintext highlighter-rouge">fp16</code>) and enabling compiler optimisations (<code class="language-plaintext highlighter-rouge">torch.compile</code>) significantly reduces memory use and speeds up training.</li>
  <li><strong>Scaling laws</strong>: Empirically, model performance improves predictably with more data, parameters, and compute, guiding decisions about <code class="language-plaintext highlighter-rouge">n_layer</code>, <code class="language-plaintext highlighter-rouge">n_head</code>, and dataset size.</li>
</ul>

<h3 id="inference-time-generation">Inference-Time Generation</h3>

<p>During autoregressive generation, the model caches key-value pairs to avoid recomputing attention for past tokens. Sampling strategies such as temperature, top-k, nucleus sampling, and contrastive decoding trade off creativity against determinism. For instruction-following models, alignment training such as RLHF or DPO shapes the model before inference, while decoding settings shape each generated response.</p>

<h3 id="pytorch-skeleton">PyTorch Skeleton</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
</pre></td><td class="rouge-code"><pre><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torch.nn</span> <span class="k">as</span> <span class="n">nn</span>


<span class="k">class</span> <span class="nc">TransformerLM</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">vocab_size</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
        <span class="n">block_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">192</span><span class="p">,</span>
        <span class="n">n_embd</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">192</span><span class="p">,</span>
        <span class="n">n_layer</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">3</span><span class="p">,</span>
        <span class="n">n_head</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">6</span><span class="p">,</span>
        <span class="n">dropout</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.2</span><span class="p">,</span>
        <span class="n">tie_weights</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">block_size</span> <span class="o">=</span> <span class="n">block_size</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">token_embed</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Embedding</span><span class="p">(</span><span class="n">vocab_size</span><span class="p">,</span> <span class="n">n_embd</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">pos_embed</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">block_size</span><span class="p">,</span> <span class="n">n_embd</span><span class="p">))</span>
        <span class="n">encoder_layer</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">TransformerEncoderLayer</span><span class="p">(</span>
            <span class="n">d_model</span><span class="o">=</span><span class="n">n_embd</span><span class="p">,</span>
            <span class="n">nhead</span><span class="o">=</span><span class="n">n_head</span><span class="p">,</span>
            <span class="n">dim_feedforward</span><span class="o">=</span><span class="mi">4</span> <span class="o">*</span> <span class="n">n_embd</span><span class="p">,</span>
            <span class="n">dropout</span><span class="o">=</span><span class="n">dropout</span><span class="p">,</span>
            <span class="n">activation</span><span class="o">=</span><span class="s">"gelu"</span><span class="p">,</span>
            <span class="n">batch_first</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">layers</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">TransformerEncoder</span><span class="p">(</span><span class="n">encoder_layer</span><span class="p">,</span> <span class="n">num_layers</span><span class="o">=</span><span class="n">n_layer</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">norm</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">LayerNorm</span><span class="p">(</span><span class="n">n_embd</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">lm_head</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">n_embd</span><span class="p">,</span> <span class="n">vocab_size</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">tie_weights</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">lm_head</span><span class="p">.</span><span class="n">weight</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">token_embed</span><span class="p">.</span><span class="n">weight</span>

    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">idx</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">idx</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">&gt;</span> <span class="bp">self</span><span class="p">.</span><span class="n">block_size</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="s">"Sequence length exceeds block size."</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">token_embed</span><span class="p">(</span><span class="n">idx</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">pos_embed</span><span class="p">[:,</span> <span class="p">:</span> <span class="n">idx</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">1</span><span class="p">)]</span>
        <span class="n">seq_len</span> <span class="o">=</span> <span class="n">idx</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">causal_mask</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">triu</span><span class="p">(</span>
            <span class="n">torch</span><span class="p">.</span><span class="n">full</span><span class="p">((</span><span class="n">seq_len</span><span class="p">,</span> <span class="n">seq_len</span><span class="p">),</span> <span class="nb">float</span><span class="p">(</span><span class="s">"-inf"</span><span class="p">),</span> <span class="n">device</span><span class="o">=</span><span class="n">idx</span><span class="p">.</span><span class="n">device</span><span class="p">),</span>
            <span class="n">diagonal</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
        <span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">layers</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="n">causal_mask</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">norm</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">lm_head</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">training_step</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">batch</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">scaler</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
    <span class="n">model</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>
    <span class="n">inputs</span><span class="p">,</span> <span class="n">targets</span> <span class="o">=</span> <span class="n">batch</span>
    <span class="n">optimizer</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">(</span><span class="n">set_to_none</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">amp</span><span class="p">.</span><span class="n">autocast</span><span class="p">(</span><span class="n">enabled</span><span class="o">=</span><span class="n">scaler</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">):</span>
        <span class="n">logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">inputs</span><span class="p">)</span>
        <span class="n">loss</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">functional</span><span class="p">.</span><span class="n">cross_entropy</span><span class="p">(</span>
            <span class="n">logits</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">logits</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)),</span>
            <span class="n">targets</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">),</span>
        <span class="p">)</span>
    <span class="k">if</span> <span class="n">scaler</span><span class="p">:</span>
        <span class="n">scaler</span><span class="p">.</span><span class="n">scale</span><span class="p">(</span><span class="n">loss</span><span class="p">).</span><span class="n">backward</span><span class="p">()</span>
        <span class="n">scaler</span><span class="p">.</span><span class="n">unscale_</span><span class="p">(</span><span class="n">optimizer</span><span class="p">)</span>
        <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">clip_grad_norm_</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="mf">1.0</span><span class="p">)</span>
        <span class="n">scaler</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">optimizer</span><span class="p">)</span>
        <span class="n">scaler</span><span class="p">.</span><span class="n">update</span><span class="p">()</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
        <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">clip_grad_norm_</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="mf">1.0</span><span class="p">)</span>
        <span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
    <span class="k">return</span> <span class="n">loss</span><span class="p">.</span><span class="n">item</span><span class="p">()</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<h2 id="hyperparameters">Hyperparameters</h2>
<h3 id="minimal-viable-training-config">Minimal Viable Training Config</h3>

<table>
  <thead>
    <tr>
      <th><strong>Parameter</strong></th>
      <th><strong>Sample Value</strong></th>
      <th><strong>Meaning</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">batch_size</code></td>
      <td><code class="language-plaintext highlighter-rouge">48</code></td>
      <td>Samples per optimisation step</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">block_size</code></td>
      <td><code class="language-plaintext highlighter-rouge">192</code></td>
      <td>Context window length</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">max_iters</code></td>
      <td><code class="language-plaintext highlighter-rouge">300</code></td>
      <td>Maximum number of optimisation steps</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">learning_rate</code></td>
      <td><code class="language-plaintext highlighter-rouge">3e-4</code></td>
      <td>Optimiser step size</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">n_embd</code></td>
      <td><code class="language-plaintext highlighter-rouge">192</code></td>
      <td>Transformer embedding dimension</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">n_head</code></td>
      <td><code class="language-plaintext highlighter-rouge">6</code></td>
      <td>Number of attention heads</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">n_layer</code></td>
      <td><code class="language-plaintext highlighter-rouge">3</code></td>
      <td>Number of Transformer layers</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">dropout</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.2</code></td>
      <td>Regularisation probability</td>
    </tr>
  </tbody>
</table>

<h3 id="full-training-configuration">Full Training Configuration</h3>

<table>
  <thead>
    <tr>
      <th><strong>Category</strong></th>
      <th><strong>Parameter</strong></th>
      <th><strong>Sample value</strong></th>
      <th><strong>Meaning</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Data</strong></td>
      <td><code class="language-plaintext highlighter-rouge">block_size</code></td>
      <td><code class="language-plaintext highlighter-rouge">192</code></td>
      <td>Context window length</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">vocab_size</code></td>
      <td><em>(auto from tokenizer)</em></td>
      <td>Number of tokens in the vocabulary</td>
    </tr>
    <tr>
      <td><strong>Model</strong></td>
      <td><code class="language-plaintext highlighter-rouge">n_embd</code></td>
      <td><code class="language-plaintext highlighter-rouge">192</code></td>
      <td>Embedding dimension</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">n_head</code></td>
      <td><code class="language-plaintext highlighter-rouge">6</code></td>
      <td>Number of attention heads</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">n_layer</code></td>
      <td><code class="language-plaintext highlighter-rouge">3</code></td>
      <td>Transformer depth</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">dropout</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.2</code></td>
      <td>Dropout probability</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">tie_weights</code></td>
      <td><code class="language-plaintext highlighter-rouge">True</code></td>
      <td>Share token embedding and output projection weights</td>
    </tr>
    <tr>
      <td><strong>Training Loop</strong></td>
      <td><code class="language-plaintext highlighter-rouge">batch_size</code></td>
      <td><code class="language-plaintext highlighter-rouge">48</code></td>
      <td>Number of samples per update</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">max_iters</code></td>
      <td><code class="language-plaintext highlighter-rouge">300</code></td>
      <td>Total training iterations</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">grad_clip</code></td>
      <td><code class="language-plaintext highlighter-rouge">1.0</code></td>
      <td>Gradient norm clipping</td>
    </tr>
    <tr>
      <td><strong>Optimiser</strong></td>
      <td><code class="language-plaintext highlighter-rouge">learning_rate</code></td>
      <td><code class="language-plaintext highlighter-rouge">3e-4</code></td>
      <td>Base learning rate</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">weight_decay</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.1</code></td>
      <td>AdamW weight decay</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">betas</code></td>
      <td><code class="language-plaintext highlighter-rouge">(0.9, 0.95)</code></td>
      <td>AdamW momentum coefficients</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">eps</code></td>
      <td><code class="language-plaintext highlighter-rouge">1e-8</code></td>
      <td>AdamW epsilon</td>
    </tr>
    <tr>
      <td><strong>LR Scheduler</strong></td>
      <td><code class="language-plaintext highlighter-rouge">lr_decay</code></td>
      <td><code class="language-plaintext highlighter-rouge">True</code></td>
      <td>Enable learning rate decay</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">warmup_iters</code></td>
      <td><code class="language-plaintext highlighter-rouge">100</code></td>
      <td>Warm-up steps before decay</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">min_lr</code></td>
      <td><code class="language-plaintext highlighter-rouge">1e-5</code></td>
      <td>Final learning rate after decay</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">scheduler_type</code></td>
      <td><code class="language-plaintext highlighter-rouge">"cosine"</code></td>
      <td>Scheduler function</td>
    </tr>
    <tr>
      <td><strong>Precision / Hardware</strong></td>
      <td><code class="language-plaintext highlighter-rouge">device</code></td>
      <td><code class="language-plaintext highlighter-rouge">"cuda"</code></td>
      <td>Compute device</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">dtype</code></td>
      <td><code class="language-plaintext highlighter-rouge">"bfloat16"</code></td>
      <td>Precision mode</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">compile</code></td>
      <td><code class="language-plaintext highlighter-rouge">True</code></td>
      <td>Enable Torch 2.x compile optimisation</td>
    </tr>
    <tr>
      <td><strong>Validation / Early Stop</strong></td>
      <td><code class="language-plaintext highlighter-rouge">eval_interval</code></td>
      <td><code class="language-plaintext highlighter-rouge">100</code></td>
      <td>Evaluation frequency</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">eval_iters</code></td>
      <td><code class="language-plaintext highlighter-rouge">20</code></td>
      <td>Mini-batches used for validation loss estimation</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">patience</code></td>
      <td><code class="language-plaintext highlighter-rouge">6</code></td>
      <td>Early stopping patience</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">min_delta</code></td>
      <td><code class="language-plaintext highlighter-rouge">1e-3</code></td>
      <td>Minimum improvement threshold</td>
    </tr>
    <tr>
      <td><strong>Checkpoint / Logging</strong></td>
      <td><code class="language-plaintext highlighter-rouge">save_interval</code></td>
      <td><code class="language-plaintext highlighter-rouge">100</code></td>
      <td>Model checkpoint interval</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">log_interval</code></td>
      <td><code class="language-plaintext highlighter-rouge">50</code></td>
      <td>Logging interval</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">wandb_project</code></td>
      <td><code class="language-plaintext highlighter-rouge">"gpt-debug"</code></td>
      <td>Optional logging project name</td>
    </tr>
    <tr>
      <td><strong>Generation</strong></td>
      <td><code class="language-plaintext highlighter-rouge">temperature</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.8</code></td>
      <td>Softmax temperature for sampling</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">top_k</code></td>
      <td><code class="language-plaintext highlighter-rouge">50</code></td>
      <td>Top-K sampling</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">top_p</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.95</code></td>
      <td>Nucleus sampling</td>
    </tr>
    <tr>
      <td> </td>
      <td><code class="language-plaintext highlighter-rouge">max_new_tokens</code></td>
      <td><code class="language-plaintext highlighter-rouge">200</code></td>
      <td>Maximum number of new tokens to generate</td>
    </tr>
  </tbody>
</table>]]></content><author><name></name></author><category term="Engineering" /><category term="machine learning" /><category term="transformer" /><summary type="html"><![CDATA[Transformer Core Concepts]]></summary></entry><entry><title type="html">Building the Hack Computer: Learning Notes</title><link href="https://guozijn.github.io/learning/2025/10/06/nand2tetris-notes.html" rel="alternate" type="text/html" title="Building the Hack Computer: Learning Notes" /><published>2025-10-06T00:00:00+10:30</published><updated>2025-10-06T00:00:00+10:30</updated><id>https://guozijn.github.io/learning/2025/10/06/nand2tetris-notes</id><content type="html" xml:base="https://guozijn.github.io/learning/2025/10/06/nand2tetris-notes.html"><![CDATA[<h2 id="boolean-logic">Boolean Logic</h2>
<h3 id="de-morgans-law">De Morgan’s Law</h3>

\[\overline{AB} = \overline{A} + \overline{B}\]

\[\overline{A + B} = \overline{A}\,\overline{B}\]

<h2 id="boolean-arithmetic-combinational-logic">Boolean Arithmetic (Combinational Logic)</h2>
<h4 id="half-adder">Half Adder</h4>

<p><img src="https://images.zjguo.com/half-adder.png" alt="half-adder.png" /></p>

<h4 id="full-adder">Full Adder</h4>

<p><img src="https://images.zjguo.com/full-adder.png" alt="full-adder.png" /></p>

<h4 id="twos-complement">Two’s Complement</h4>
<h5 id="quick-flow">Quick Flow</h5>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre>Decimal → (abs) → bin → invert → +1 → Two’s complement
Binary  → (MSB=1?) → invert → +1 → decimal → negative
</pre></td></tr></tbody></table></code></pre></div></div>
<h5 id="binary--decimal">Binary → Decimal</h5>
<p><strong>Concept</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre>if MSB = 0 → positive
     normal binary value

if MSB = 1 → negative
     invert bits → add 1 → decimal → add minus sign


Example:

    11101100₂
    
    invert → 00010011
    +1     → 00010100 = 20
    → -20₁₀
</pre></td></tr></tbody></table></code></pre></div></div>

<p><strong>Formula</strong></p>

\[\text{Range} = [-2^{n-1},\, 2^{n-1} - 1]\]

\[\text{Decimal} = -b_{n-1} \times 2^{n-1} + \sum_{i=0}^{n-2} b_i \times 2^i\]

<p><strong>Example via Formula</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre>Directly convert a binary number to decimal using the formula:

    11101100₂
    = -1×128 + (1×64 + 1×32 + 0×16 + 1×8 + 1×4 + 0×2 + 0×1)
    = -128 + 108
    = -20₁₀
</pre></td></tr></tbody></table></code></pre></div></div>

<h5 id="decimal--binary">Decimal → Binary</h5>
<p><strong>Concept</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="rouge-code"><pre>choose bit width (ex, 8 bits)

if positive:
    normal binary → pad zeros

if negative:
    abs(decimal) → binary → invert → add 1

Example:

    -20

    20  → 00010100
    inv → 11101011
    +1  → 11101100
</pre></td></tr></tbody></table></code></pre></div></div>

<p><strong>Example: Positive via Long Division</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="rouge-code"><pre>Example, 20₁₀, target width 8 bits

          ┌─────────── remainder
      2 ) 20           r0
          10           r0
           5           r1
           2           r0
           1           r1
           0  stop

remainders bottom to top, 1 0 1 0 0  →  00010100
</pre></td></tr></tbody></table></code></pre></div></div>

<p><strong>Example: Negative via Long Division</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
</pre></td><td class="rouge-code"><pre>Example, -20₁₀, target width 8 bits

Step A, abs value long divide

          ┌─────────── remainder
      2 ) 20           r0
          10           r0
           5           r1
           2           r0
           1           r1
           0  stop

unsigned, 10100  → pad to width → 00010100

Step B, invert bits
00010100 → 11101011

Step C, add 1
11101011 + 1 → 11101100

Result, -20₁₀ → 11101100₂
</pre></td></tr></tbody></table></code></pre></div></div>

<p><strong>4-bit Table</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
</pre></td><td class="rouge-code"><pre>Decimal | Two’s Complement
--------------------------
   7    | 0111
   6    | 0110
   5    | 0101
   4    | 0100
   3    | 0011
   2    | 0010
   1    | 0001
   0    | 0000
  -1    | 1111
  -2    | 1110
  -3    | 1101
  -4    | 1100
  -5    | 1011
  -6    | 1010
  -7    | 1001
  -8    | 1000
</pre></td></tr></tbody></table></code></pre></div></div>

<h2 id="sequential-logic">Sequential Logic</h2>

<h3 id="dff">DFF</h3>

<p><img src="https://images.zjguo.com/dff.png" alt="dff.png" /></p>

<h3 id="bit-1-bit-register">Bit (1-bit register)</h3>

<p><img src="https://images.zjguo.com/1-bit-register.png" alt="1-bit-register.png" /></p>

<h2 id="computer-architecture">Computer Architecture</h2>

<p><img src="https://images.zjguo.com/hack-computer.png" alt="hack-computer.png" /></p>

<h3 id="alu">ALU</h3>
<p><img src="https://images.zjguo.com/alu.png" alt="alu.png" /></p>

<h3 id="alu-notes">ALU Notes</h3>
<ul>
  <li>In two’s complement representation, the bitwise NOT operation $!y$ can be expressed as $!y = -y - 1$.</li>
</ul>

<h2 id="assembly-language">Assembly Language</h2>
<h3 id="overview">Overview</h3>
<p>The Hack assembly language contains two instruction types: <strong>A-instruction</strong> and <strong>C-instruction</strong>, plus labels for symbolic addresses. Each instruction is 16 bits long.</p>

<h3 id="a-instruction">A-instruction</h3>
<p><strong>Form</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre>@value
@symbol
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Loads a constant or address into the A register. The value also becomes the memory address for <code class="language-plaintext highlighter-rouge">M</code>.</p>

<p><strong>Example</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre>@10
D=A
@counter
M=0
</pre></td></tr></tbody></table></code></pre></div></div>

<h3 id="c-instruction">C-instruction</h3>
<p><img src="https://images.zjguo.com/c-instructions.png" alt="c-instructions.png" /></p>

<p><strong>Form</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>dest=comp;jump
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Performs computation and optionally stores or jumps.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">dest</code>: target (M, D, A, MD, AM, AD, AMD)</li>
  <li><code class="language-plaintext highlighter-rouge">comp</code>: computation (ALU operation)</li>
  <li><code class="language-plaintext highlighter-rouge">jump</code>: condition (JGT, JEQ, JGE, JLT, JNE, JLE, JMP)</li>
</ul>

<p><strong>Example</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre>D=M
D;JGT
0;JMP
</pre></td></tr></tbody></table></code></pre></div></div>

<h3 id="common-comp-values">Common comp Values</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre>0, 1, -1
D, A, M, !D, !A, !M, -D, -A, -M
D+1, A+1, M+1, D-1, A-1, M-1
D+A, D+M, D-A, D-M, A-D, M-D
D&amp;A, D&amp;M, D|A, D|M
</pre></td></tr></tbody></table></code></pre></div></div>

<h3 id="labels-and-symbols">Labels and Symbols</h3>
<p><strong>Label Declaration</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>(LOOP)
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Marks a location. The label’s value is the address of the next instruction.</p>

<p><strong>Predefined Symbols</strong></p>

<table>
  <thead>
    <tr>
      <th>Symbol</th>
      <th>Address</th>
      <th>Meaning</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>SP</strong></td>
      <td><code class="language-plaintext highlighter-rouge">0</code></td>
      <td>Stack pointer (top of the stack)</td>
    </tr>
    <tr>
      <td><strong>LCL</strong></td>
      <td><code class="language-plaintext highlighter-rouge">1</code></td>
      <td>Base address of the local segment</td>
    </tr>
    <tr>
      <td><strong>ARG</strong></td>
      <td><code class="language-plaintext highlighter-rouge">2</code></td>
      <td>Base address of the argument segment</td>
    </tr>
    <tr>
      <td><strong>THIS</strong></td>
      <td><code class="language-plaintext highlighter-rouge">3</code></td>
      <td>Base address of the this segment</td>
    </tr>
    <tr>
      <td><strong>THAT</strong></td>
      <td><code class="language-plaintext highlighter-rouge">4</code></td>
      <td>Base address of the that segment</td>
    </tr>
    <tr>
      <td><strong>R0–R15</strong></td>
      <td><code class="language-plaintext highlighter-rouge">0–15</code></td>
      <td>General purpose registers, aliases for the first 16 RAM addresses</td>
    </tr>
    <tr>
      <td><strong>temp (R5–R12)</strong></td>
      <td><code class="language-plaintext highlighter-rouge">5–12</code></td>
      <td>Fixed temporary segment, used for intermediate storage</td>
    </tr>
    <tr>
      <td><strong>pointer (THIS/THAT)</strong></td>
      <td><code class="language-plaintext highlighter-rouge">3–4</code></td>
      <td>Pointer segment that maps to <code class="language-plaintext highlighter-rouge">THIS</code> (0) and <code class="language-plaintext highlighter-rouge">THAT</code> (1)</td>
    </tr>
    <tr>
      <td><strong>static (FileName.index)</strong></td>
      <td><code class="language-plaintext highlighter-rouge">16+</code></td>
      <td>Static variables unique to each <code class="language-plaintext highlighter-rouge">.vm</code> file, starting from RAM[16]</td>
    </tr>
    <tr>
      <td><strong>SCREEN</strong></td>
      <td><code class="language-plaintext highlighter-rouge">16384</code></td>
      <td>Base address of the screen memory (for display pixels)</td>
    </tr>
    <tr>
      <td><strong>KBD</strong></td>
      <td><code class="language-plaintext highlighter-rouge">24576</code></td>
      <td>Address of the keyboard memory-mapped register</td>
    </tr>
  </tbody>
</table>

<p>Variables (custom symbols) start from address 16.</p>

<h3 id="example-program-sum-12n">Example Program: Sum 1+2+…+n</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
</pre></td><td class="rouge-code"><pre>@i
M=0
@sum
M=0
(LOOP)
  @i
  D=M
  @n
  D=D-M
  @END
  D;JGT
  @i
  D=M
  @sum
  M=M+D
  @i
  M=M+1
  @LOOP
  0;JMP
(END)
</pre></td></tr></tbody></table></code></pre></div></div>

<h3 id="quick-reference">Quick Reference</h3>
<ul>
  <li><code class="language-plaintext highlighter-rouge">@value</code>: load A</li>
  <li><code class="language-plaintext highlighter-rouge">dest=comp;jump</code>: compute and control flow</li>
  <li>Symbols: variables and labels</li>
  <li>Memory map: R0–R15, SCREEN(16384), KBD(24576)</li>
</ul>

<h3 id="asm-notes">ASM Notes</h3>
<ul>
  <li>First pass: Strip whitespace/comments, walk instructions to build the label table; each non-label command bumps the ROM address counter, while <code class="language-plaintext highlighter-rouge">(LABEL)</code> entries alias the next instruction line.</li>
  <li>Second pass: Revisit the cleaned instruction stream, resolve symbols (allocating RAM addresses from 16 upward for new variables), and emit the 16-bit Hack opcodes for every A- and C-instruction.</li>
</ul>

<hr />

<blockquote>
  <p>The following can be considered at the software level while the upper part is at the hardware level.</p>
</blockquote>

<h2 id="virtual-machine-language">Virtual Machine Language</h2>

<p>The VM language is a stack-based intermediate language that abstracts away hardware details. It describes computation using stack operations, memory access, branching, and function calls.</p>

<h3 id="stack-and-sp">Stack and SP</h3>
<p>The stack starts at address 256. The <code class="language-plaintext highlighter-rouge">SP</code> (Stack Pointer) always points to the next free slot.</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">push</code> writes to <code class="language-plaintext highlighter-rouge">*SP</code>, then <code class="language-plaintext highlighter-rouge">SP = SP + 1</code></li>
  <li><code class="language-plaintext highlighter-rouge">pop</code> decrements <code class="language-plaintext highlighter-rouge">SP</code>, then reads from <code class="language-plaintext highlighter-rouge">*SP</code></li>
</ul>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="rouge-code"><pre>// push D onto stack
@SP
A=M
M=D
@SP
M=M+1

// pop top of stack into D
@SP
AM=M-1
D=M
</pre></td></tr></tbody></table></code></pre></div></div>

<h3 id="memory-segments">Memory Segments</h3>

<table>
  <thead>
    <tr>
      <th>VM Segment</th>
      <th>Meaning</th>
      <th>Assembly Base</th>
      <th>Address Computation</th>
      <th>Example</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>argument</strong></td>
      <td>Function arguments</td>
      <td><code class="language-plaintext highlighter-rouge">ARG</code></td>
      <td><code class="language-plaintext highlighter-rouge">A = M + index</code></td>
      <td><code class="language-plaintext highlighter-rouge">push argument 2</code> → <code class="language-plaintext highlighter-rouge">*(ARG + 2)</code></td>
    </tr>
    <tr>
      <td><strong>local</strong></td>
      <td>Local variables of the current function</td>
      <td><code class="language-plaintext highlighter-rouge">LCL</code></td>
      <td><code class="language-plaintext highlighter-rouge">A = M + index</code></td>
      <td><code class="language-plaintext highlighter-rouge">pop local 0</code> → <code class="language-plaintext highlighter-rouge">*(LCL + 0)</code></td>
    </tr>
    <tr>
      <td><strong>this</strong></td>
      <td>“this” pointer area</td>
      <td><code class="language-plaintext highlighter-rouge">THIS</code></td>
      <td><code class="language-plaintext highlighter-rouge">A = M + index</code></td>
      <td><code class="language-plaintext highlighter-rouge">push this 1</code> → <code class="language-plaintext highlighter-rouge">*(THIS + 1)</code></td>
    </tr>
    <tr>
      <td><strong>that</strong></td>
      <td>“that” pointer area</td>
      <td><code class="language-plaintext highlighter-rouge">THAT</code></td>
      <td><code class="language-plaintext highlighter-rouge">A = M + index</code></td>
      <td><code class="language-plaintext highlighter-rouge">pop that 2</code> → <code class="language-plaintext highlighter-rouge">*(THAT + 2)</code></td>
    </tr>
    <tr>
      <td><strong>temp</strong></td>
      <td>Temporary storage (RAM[5–12])</td>
      <td><code class="language-plaintext highlighter-rouge">5</code></td>
      <td><code class="language-plaintext highlighter-rouge">A = 5 + index</code></td>
      <td><code class="language-plaintext highlighter-rouge">push temp 3</code> → <code class="language-plaintext highlighter-rouge">@8</code></td>
    </tr>
    <tr>
      <td><strong>pointer</strong></td>
      <td>Stores THIS and THAT pointers (RAM[3–4])</td>
      <td><code class="language-plaintext highlighter-rouge">3</code></td>
      <td><code class="language-plaintext highlighter-rouge">A = 3 + index</code></td>
      <td><code class="language-plaintext highlighter-rouge">pop pointer 0</code> → <code class="language-plaintext highlighter-rouge">THIS = *(SP-1)</code></td>
    </tr>
    <tr>
      <td><strong>static</strong></td>
      <td>File-specific static variables</td>
      <td><code class="language-plaintext highlighter-rouge">16</code></td>
      <td><code class="language-plaintext highlighter-rouge">@FileName.index</code></td>
      <td><code class="language-plaintext highlighter-rouge">push static 2</code> → <code class="language-plaintext highlighter-rouge">@Foo.2</code></td>
    </tr>
    <tr>
      <td><strong>constant</strong></td>
      <td>Immediate values, not in RAM</td>
      <td>—</td>
      <td><code class="language-plaintext highlighter-rouge">D = A</code></td>
      <td><code class="language-plaintext highlighter-rouge">push constant 7</code> → <code class="language-plaintext highlighter-rouge">D = 7</code></td>
    </tr>
  </tbody>
</table>

<h3 id="basic-syntax-of-vm-language">Basic Syntax of VM Language</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="rouge-code"><pre>push constant i
push segment i
pop segment i
add | sub | neg | eq | gt | lt | and | or | not
label X
goto X
if-goto X
function f k
call f n
return
</pre></td></tr></tbody></table></code></pre></div></div>

<h3 id="translations-from-vm-language-to-assembly">Translations from VM language to Assembly</h3>

<p>The C++ <a href="https://github.com/guozijn/compsys/blob/main/prac6/VMTranslator/VMTranslator.cpp"><code class="language-plaintext highlighter-rouge">VMTranslator</code></a> writes structured templates for every VM command. Values destined for the stack are staged in <code class="language-plaintext highlighter-rouge">D</code> and finalised with the shared push tail:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre>@SP
AM=M+1
A=A-1
M=D
</pre></td></tr></tbody></table></code></pre></div></div>

<p>For <code class="language-plaintext highlighter-rouge">pop</code> commands targeting base-pointer segments, the absolute address is cached in <code class="language-plaintext highlighter-rouge">R13</code> before the stack value is stored. The assembler snippets below use placeholders such as <code class="language-plaintext highlighter-rouge">index</code> and <code class="language-plaintext highlighter-rouge">FunctionName</code> that the translator substitutes at runtime.</p>

<h4 id="push-segment-index">push segment index</h4>

<ul>
  <li><code class="language-plaintext highlighter-rouge">push constant index</code>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre>@index
D=A
@SP
AM=M+1
A=A-1
M=D
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
  <li><code class="language-plaintext highlighter-rouge">push local|argument|this|that index</code>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="rouge-code"><pre>@index
D=A
@SEG            // SEG ∈ {LCL, ARG, THIS, THAT}
A=M+D
D=M
@SP
AM=M+1
A=A-1
M=D
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
  <li><code class="language-plaintext highlighter-rouge">push pointer 0|1</code>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre>@THIS|THAT
D=M
@SP
AM=M+1
A=A-1
M=D
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
  <li><code class="language-plaintext highlighter-rouge">push temp index</code>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="rouge-code"><pre>@index
D=A
@5
A=D+A
D=M
@SP
AM=M+1
A=A-1
M=D
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
  <li><code class="language-plaintext highlighter-rouge">push static index</code>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="rouge-code"><pre>@index
D=A
@16
A=D+A
D=M
@SP
AM=M+1
A=A-1
M=D
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
</ul>

<h4 id="pop-segment-index">pop segment index</h4>

<ul>
  <li><code class="language-plaintext highlighter-rouge">pop local|argument|this|that index</code>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="rouge-code"><pre>@index
D=A
@SEG            // SEG ∈ {LCL, ARG, THIS, THAT}
D=M+D
@13
M=D
@SP
AM=M-1
D=M
@13
A=M
M=D
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
  <li><code class="language-plaintext highlighter-rouge">pop pointer 0</code>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre>@SP
AM=M-1
D=M
@THIS
M=D
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
  <li><code class="language-plaintext highlighter-rouge">pop pointer 1</code>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre>@SP
AM=M-1
D=M
@THAT
M=D
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
  <li><code class="language-plaintext highlighter-rouge">pop temp index</code>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="rouge-code"><pre>@index
D=A
@5
D=D+A
@13
M=D
@SP
AM=M-1
D=M
@13
A=M
M=D
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
  <li><code class="language-plaintext highlighter-rouge">pop static index</code>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="rouge-code"><pre>@index
D=A
@16
D=D+A
@13
M=D
@SP
AM=M-1
D=M
@13
A=M
M=D
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
</ul>

<h4 id="arithmetic-and-logic">Arithmetic and logic</h4>

<p><code class="language-plaintext highlighter-rouge">add</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre>@SP
AM=M-1
D=M
A=A-1
M=M+D
</pre></td></tr></tbody></table></code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">sub</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre>@SP
AM=M-1
D=M
A=A-1
M=M-D
</pre></td></tr></tbody></table></code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">neg</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre>@SP
A=M-1
M=-M
</pre></td></tr></tbody></table></code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">and</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre>@SP
AM=M-1
D=M
A=A-1
M=M&amp;D
</pre></td></tr></tbody></table></code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">or</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre>@SP
AM=M-1
D=M
A=A-1
M=M|D
</pre></td></tr></tbody></table></code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">not</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre>@SP
A=M-1
M=!M
</pre></td></tr></tbody></table></code></pre></div></div>

<h4 id="comparisons">Comparisons</h4>

<p><code class="language-plaintext highlighter-rouge">eq</code>, <code class="language-plaintext highlighter-rouge">gt</code>, and <code class="language-plaintext highlighter-rouge">lt</code> share a helper that emits unique labels (<code class="language-plaintext highlighter-rouge">CMP_TRUE0</code>, <code class="language-plaintext highlighter-rouge">CMP_END0</code>, …). Example output for <code class="language-plaintext highlighter-rouge">eq</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
</pre></td><td class="rouge-code"><pre>@SP
AM=M-1
D=M
@SP
AM=M-1
D=M-D
@CMP_TRUE0
D;JEQ
D=0
@CMP_END0
0;JMP
(CMP_TRUE0)
D=-1
(CMP_END0)
@SP
AM=M+1
A=A-1
M=D
</pre></td></tr></tbody></table></code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">gt</code> substitutes <code class="language-plaintext highlighter-rouge">JGT</code> and <code class="language-plaintext highlighter-rouge">lt</code> uses <code class="language-plaintext highlighter-rouge">JLT</code> in the conditional jump.</p>

<h4 id="branching-commands">Branching commands</h4>

<ul>
  <li><code class="language-plaintext highlighter-rouge">label X</code>: <code class="language-plaintext highlighter-rouge">(X)</code></li>
  <li><code class="language-plaintext highlighter-rouge">goto X</code>:
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre>@X
0;JMP
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
  <li><code class="language-plaintext highlighter-rouge">if-goto X</code>:
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre>@SP
AM=M-1
D=M
@X
D;JNE
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
</ul>

<h4 id="function-commands">Function commands</h4>

<p><code class="language-plaintext highlighter-rouge">function FunctionName nLocals</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
</pre></td><td class="rouge-code"><pre>(FunctionName)
@nLocals
D=A
@13
M=D
(FunctionName$initLocalsLoop)
@13
D=M
@FunctionName$initLocalsEnd
D;JEQ
@SP
AM=M+1
A=A-1
M=0
@13
M=M-1
@FunctionName$initLocalsLoop
0;JMP
(FunctionName$initLocalsEnd)
</pre></td></tr></tbody></table></code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">call FunctionName nArgs</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
</pre></td><td class="rouge-code"><pre>@FunctionName$ret.0   // counter increments per call site
D=A
@SP
AM=M+1
A=A-1
M=D
@LCL
D=M
@SP
AM=M+1
A=A-1
M=D
@ARG
D=M
@SP
AM=M+1
A=A-1
M=D
@THIS
D=M
@SP
AM=M+1
A=A-1
M=D
@THAT
D=M
@SP
AM=M+1
A=A-1
M=D
@SP
D=M
@5
D=D-A
@nArgs
D=D-A
@ARG
M=D
@SP
D=M
@LCL
M=D
@FunctionName
0;JMP
(FunctionName$ret.0)
</pre></td></tr></tbody></table></code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">return</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
</pre></td><td class="rouge-code"><pre>@LCL
D=M
@13
M=D              // frame = LCL
@5
A=D-A
D=M
@14
M=D              // ret = *(frame-5)
@SP
AM=M-1
D=M
@ARG
A=M
M=D              // *ARG = pop()
@ARG
D=M+1
@SP
M=D              // SP = ARG + 1
@13
AM=M-1
D=M
@THAT
M=D
@13
AM=M-1
D=M
@THIS
M=D
@13
AM=M-1
D=M
@ARG
M=D
@13
AM=M-1
D=M
@LCL
M=D
@14
A=M
0;JMP            // goto ret
</pre></td></tr></tbody></table></code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">R13</code> and <code class="language-plaintext highlighter-rouge">R14</code> serve as the frame scratch space and cached return address.</p>

<h3 id="program-control">Program Control</h3>
<h4 id="subroutines--functions">Subroutines &amp; Functions</h4>
<h5 id="stack-implementation">Stack Implementation</h5>
<p><img src="https://images.zjguo.com/vm-stack-implementation.png" alt="vm-stack-implementation.png" /></p>

<h5 id="call-implementation">Call Implementation</h5>
<p><img src="https://images.zjguo.com/vm-call-command.png" alt="vm-call-command.png" /></p>

<h5 id="function-implementation">Function Implementation</h5>
<p><img src="https://images.zjguo.com/vm-function-command.png" alt="vm-function-command.png" /></p>

<h5 id="return-implementation">Return Implementation</h5>
<p><img src="https://images.zjguo.com/vm-return-command.png" alt="vm-return-command.png" /></p>

<h3 id="memory-architecture-of-the-hack-virtual-machine">Memory Architecture of the Hack Virtual Machine</h3>
<p>The Hack Virtual Machine is implemented on a stack-based architecture.
All function calls, arguments, and local variables are stored in the RAM.
The CPU executes instructions fetched from the ROM, while the stack resides in RAM starting at address 256.
The following diagram summarises the relationship between ROM, RAM, and the stack pointer segments.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
</pre></td><td class="rouge-code"><pre>                    ┌──────────────────────────────────────────┐
                    │                HACK CPU                  │
                    │──────────────────────────────────────────│
                    │  Registers:                              │
                    │   A  → Address register                  │
                    │   D  → Data register                     │
                    │   PC → Program counter                   │
                    │                                          │
                    │  Control signals: use A/D/PC to access   │
                    │  RAM or ROM                              │
                    └──────────────────────────────────────────┘
                                       │
                                       │ (A register provides address)
                                       ▼

    ┌───────────────────────────────────────────┐
    │                   ROM                     │
    │───────────────────────────────────────────│
    │ Stores machine code (.hack instructions)  │
    │ Loaded from compiled .asm file            │
    │ PC fetches sequentially                   │
    │ Read-only memory                          │
    └───────────────────────────────────────────┘
                                       │
                                       ▼
    ┌───────────────────────────────────────────┐
    │                   RAM                     │
    │───────────────────────────────────────────│
    │ Address range: 0 - 32767                  │
    │                                           │
    │ 0–15 : General-purpose registers          │
    │   ├─ R0  = SP    (Stack Pointer)          │
    │   ├─ R1  = LCL   (Local segment base)     │
    │   ├─ R2  = ARG   (Argument segment base)  │
    │   ├─ R3  = THIS  (This segment base)      │
    │   ├─ R4  = THAT  (That segment base)      │
    │   ├─ R5–R12 = Temp segment (8 slots)      │
    │   ├─ R13–R15 = General temporary registers│
    │                                           │
    │ 16–255 : Static variables (per file)      │
    │                                           │
    │ 256–2047 : Stack segment                  │
    │   ↑                                       │
    │   │ push → write at stack top             │
    │   │ pop  → remove from stack top          │
    │   │ SP points to next free slot           │
    │   │-------------------------------------- │
    │   │  ← Stack base (256)                   │
    │   │  [Return address]                     │
    │   │  [Saved LCL, ARG, THIS, THAT]         │
    │   │  [Local variables local 0..n]         │
    │   │  [Working stack / evaluation values]  │
    │   │-------------------------------------- │
    │   ↓                                       │
    │                                           │
    │ 2048–16383 : Heap / arrays / objects      │
    │                                           │
    │ 16384–24575 : Screen memory map           │
    │ 24576–32767 : Keyboard input              │
    └───────────────────────────────────────────┘
</pre></td></tr></tbody></table></code></pre></div></div>

<h3 id="vm-notes">VM Notes</h3>
<ul>
  <li>The VM provides portability by hiding the Hack memory details.</li>
  <li>SP management ensures correct push/pop order.</li>
  <li>Always use unique labels for comparisons and calls.</li>
  <li>ROM addresses are just sequential instruction numbers; label declarations don’t consume addresses, they alias the next instruction’s line number.</li>
  <li><code class="language-plaintext highlighter-rouge">call</code>: push return address and segment pointers, then jump to function.</li>
  <li><code class="language-plaintext highlighter-rouge">return</code>: restore caller frame, reposition <code class="language-plaintext highlighter-rouge">SP</code>, and jump back to return address.</li>
  <li>No <code class="language-plaintext highlighter-rouge">constant</code> in pop operation.</li>
</ul>

<h2 id="high-level-language-jack">High-Level Language (Jack)</h2>
<p>Jack source compiles to VM commands, VM maps to Hack assembly.</p>

<h3 id="segment-mapping">Segment Mapping</h3>

<table>
  <thead>
    <tr>
      <th>Jack variable kind</th>
      <th>VM segment</th>
      <th>Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>static</td>
      <td>static</td>
      <td>Per class file scope</td>
    </tr>
    <tr>
      <td>field</td>
      <td>this</td>
      <td>Base in pointer 0</td>
    </tr>
    <tr>
      <td>var, local</td>
      <td>local</td>
      <td>Subroutine private</td>
    </tr>
    <tr>
      <td>argument</td>
      <td>argument</td>
      <td>Call site provided</td>
    </tr>
    <tr>
      <td>array base</td>
      <td>this or local or argument</td>
      <td>Depends on declaration</td>
    </tr>
  </tbody>
</table>

<h3 id="subroutine-kinds">Subroutine Kinds</h3>

<table>
  <thead>
    <tr>
      <th>Jack subroutine</th>
      <th>VM header</th>
      <th>Entry actions</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>function</td>
      <td>function ClassName.func k</td>
      <td>No implicit this</td>
    </tr>
    <tr>
      <td>method</td>
      <td>function ClassName.method k</td>
      <td>push argument 0, pop pointer 0</td>
    </tr>
    <tr>
      <td>constructor</td>
      <td>function ClassName.new k</td>
      <td>push constant fieldCount, call Memory.alloc 1, pop pointer 0</td>
    </tr>
  </tbody>
</table>

<h3 id="statements-minimal-templates">Statements, minimal templates</h3>
<ol>
  <li>let x = expr
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre>... code for expr
pop segment index        // x
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
  <li>let a[i] = expr
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="rouge-code"><pre>... code for a
... code for i
add
... code for expr
pop temp 0
pop pointer 1
push temp 0
pop that 0
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
  <li>y = a[i]
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre>... code for a
... code for i
add
pop pointer 1
push that 0
... assign to y via pop segment index
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
  <li>do subCall(args)
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre>... push object ref if method call
... push args
call QualName nArgs
pop temp 0               // discard return
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
  <li>if (cond) { S1 } else { S2 }
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="rouge-code"><pre>... code for cond
if-goto IF_TRUE$n
goto IF_FALSE$n
label IF_TRUE$n
... S1
goto IF_END$n
label IF_FALSE$n
... S2
label IF_END$n
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
  <li>while (cond) { S }
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="rouge-code"><pre>label WHILE_EXP$n
... code for cond
not
if-goto WHILE_END$n
... S
goto WHILE_EXP$n
label WHILE_END$n
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
  <li>return, return expr
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre>push constant 0          // void
return
</pre></td></tr></tbody></table></code></pre></div>    </div>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre>... code for expr
return
</pre></td></tr></tbody></table></code></pre></div>    </div>
  </li>
</ol>

<h3 id="expressions-operators">Expressions, operators</h3>

<table>
  <thead>
    <tr>
      <th>Jack</th>
      <th>VM expansion</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>- x</td>
      <td>neg</td>
    </tr>
    <tr>
      <td>not x</td>
      <td>not</td>
    </tr>
    <tr>
      <td>x + y</td>
      <td>add</td>
    </tr>
    <tr>
      <td>x − y</td>
      <td>sub</td>
    </tr>
    <tr>
      <td>x &amp; y</td>
      <td>and</td>
    </tr>
    <tr>
      <td>x | y</td>
      <td>or</td>
    </tr>
    <tr>
      <td>x &lt; y</td>
      <td>lt</td>
    </tr>
    <tr>
      <td>x &gt; y</td>
      <td>gt</td>
    </tr>
    <tr>
      <td>x = y</td>
      <td>eq</td>
    </tr>
    <tr>
      <td>x * y</td>
      <td>call Math.multiply 2</td>
    </tr>
    <tr>
      <td>x / y</td>
      <td>call Math.divide 2</td>
    </tr>
  </tbody>
</table>

<h3 id="literals-and-keywords">Literals and keywords</h3>

<table>
  <thead>
    <tr>
      <th>Jack</th>
      <th>VM</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>integer n</td>
      <td>push constant n</td>
    </tr>
    <tr>
      <td>true</td>
      <td>push constant 0, not</td>
    </tr>
    <tr>
      <td>false, null</td>
      <td>push constant 0</td>
    </tr>
    <tr>
      <td>this</td>
      <td>push pointer 0</td>
    </tr>
  </tbody>
</table>

<h3 id="string-literal-abc">String literal “abc”</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="rouge-code"><pre>push constant 3
call String.new 1
push constant 97
call String.appendChar 2
push constant 98
call String.appendChar 2
push constant 99
call String.appendChar 2
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Length first, then append each code point.</p>

<h3 id="calls-qualification">Calls, qualification</h3>

<table>
  <thead>
    <tr>
      <th>Jack form</th>
      <th>VM call name</th>
      <th>Arg0 rule</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>obj.m(a,b)</td>
      <td>ClassName.m</td>
      <td>push obj reference then a,b</td>
    </tr>
    <tr>
      <td>Class.f(a,b)</td>
      <td>Class.f</td>
      <td>no implicit object</td>
    </tr>
    <tr>
      <td>m(a,b) inside a method</td>
      <td>ClassName.m</td>
      <td>push pointer 0 then a,b</td>
    </tr>
  </tbody>
</table>

<h3 id="label-policy">Label policy</h3>
<p>Use per subroutine counters, labels must be unique per function, for example <code class="language-plaintext highlighter-rouge">IF_TRUE$n</code>, <code class="language-plaintext highlighter-rouge">WHILE_END$n</code>.</p>

<h3 id="code-generation">Code Generation</h3>

<h4 id="handling-objects">Handling Objects</h4>

<p><img src="https://images.zjguo.com/jack-handling-objects-1.png" alt="jack-handling-objects-1.png" /></p>

<p><img src="https://images.zjguo.com/jack-handling-objects-2.png" alt="jack-handling-objects-2.png" /></p>

<h4 id="handling-arrays">Handling Arrays</h4>

<p><img src="https://images.zjguo.com/jack-handling-arrays-1.png" alt="jack-handling-arrays-1.png" /></p>

<p><img src="https://images.zjguo.com/jack-handling-arrays-2.png" alt="jack-handling-arrays-2.png" /></p>

<h4 id="example">Example</h4>

<p><img src="https://images.zjguo.com/jack-example.png" alt="jack-example.png" /></p>]]></content><author><name></name></author><category term="Learning" /><category term="nand2tetris" /><category term="computer system" /><category term="notes" /><summary type="html"><![CDATA[Boolean Logic De Morgan’s Law]]></summary></entry><entry><title type="html">Revisiting Nand2Tetris: Building a Computer from Scratch</title><link href="https://guozijn.github.io/learning/2025/10/05/nand2tetris.html" rel="alternate" type="text/html" title="Revisiting Nand2Tetris: Building a Computer from Scratch" /><published>2025-10-05T00:00:00+09:30</published><updated>2025-10-05T00:00:00+09:30</updated><id>https://guozijn.github.io/learning/2025/10/05/nand2tetris</id><content type="html" xml:base="https://guozijn.github.io/learning/2025/10/05/nand2tetris.html"><![CDATA[<h2 id="introduction">Introduction</h2>
<p>In the age where computing systems are defined by abstraction layers and pre-built frameworks, the <em>Nand2Tetris</em> project, which was designed by Noam Nisan and Shimon Schocken, offers a rare opportunity to return to the foundations of computer science. This educational journey begins with the simplest possible logic gate, the NAND gate, and gradually guides learners toward constructing a fully functioning computer capable of running a high-level programming language and simple applications such as the game Tetris. By bridging hardware architecture, machine language, operating systems, and compiler design, Nand2Tetris provides an integrated understanding of how each layer of computing interacts to form a cohesive whole.</p>

<hr />

<h2 id="chapter-1-boolean-logic">Chapter 1: Boolean Logic</h2>

<p>This chapter begins with <strong>Boolean algebra</strong> and the <strong>NAND gate</strong>, the universal logic gate from which all others can be constructed. Students implement basic gates such as NOT, AND, OR, and XOR, laying the foundation for digital computation.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre>A ----\                
       NAND ----&gt; Output
B ----/
</pre></td></tr></tbody></table></code></pre></div></div>

<p>Through these constructions, learners understand how simple gates combine to form complex logical circuits.</p>

<h3 id="de-morgans-law">De Morgan’s Law</h3>

\[\overline{AB} = \overline{A} + \overline{B}\\\]

\[\overline{A + B} = \overline{A}\,\overline{B}\]

<hr />

<h2 id="chapter-2-boolean-arithmetic">Chapter 2: Boolean Arithmetic</h2>

<p>The second chapter focuses on <strong>binary arithmetic</strong>. Using logic gates, students build half-adders and full-adders, then chain them to construct multi-bit adders capable of handling binary numbers.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre>A ----\
       XOR ---- Sum
B ----/ \        \
        AND ---- Carry
</pre></td></tr></tbody></table></code></pre></div></div>

<p>This establishes how computers perform arithmetic at the hardware level.</p>

<hr />

<h2 id="chapter-3-sequential-logic">Chapter 3: Sequential Logic</h2>

<p>Sequential logic introduces <strong>state</strong>—the ability to remember information. Using feedback loops and flip-flops, students design circuits that store data over time.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre>     +---------+
     |         |
Input ---&gt; NAND ---&gt; Output
       ^         |
       |_________|
</pre></td></tr></tbody></table></code></pre></div></div>

<p>These principles lead to the design of <strong>registers</strong> and <strong>counters</strong>, key elements of memory systems.</p>

<hr />

<h2 id="chapter-4-machine-language">Chapter 4: Machine Language</h2>

<p>Here, students are introduced to the <strong>Hack machine language</strong>, the instruction set that the computer will eventually execute. They learn how the CPU interprets binary codes as instructions for computation and memory manipulation.</p>

<p>Example program:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre>@2
D=A
@3
D=D+A
@0
M=D
</pre></td></tr></tbody></table></code></pre></div></div>

<p>This simple example adds two numbers and stores the result in memory.</p>

<hr />

<h2 id="chapter-5-computer-architecture">Chapter 5: Computer Architecture</h2>

<p>This chapter integrates earlier components into a complete <strong>CPU</strong>. Students combine the ALU (Arithmetic Logic Unit), registers, and program counter to build a central processing unit capable of running the Hack machine language.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre>Instruction --&gt; Decoder --&gt; Control Bits
Registers --&gt; ALU --&gt; Output + Flags
</pre></td></tr></tbody></table></code></pre></div></div>

<p>This marks the transition from circuit design to system-level computation.</p>

<hr />

<h2 id="chapter-6-assembler">Chapter 6: Assembler</h2>

<p>With the hardware ready, students build an <strong>assembler</strong> to translate Hack assembly language into binary machine code. The assembler resolves symbolic labels and memory variables into numeric addresses.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre>@LOOP  →  @16
@i     →  @17
</pre></td></tr></tbody></table></code></pre></div></div>

<p>This software bridge allows humans to program the hardware more effectively.</p>

<hr />

<p>The <strong>Virtual Machine (VM)</strong> language introduces a stack-based computation model that abstracts hardware operations. It sits between the Jack high-level language and the Hack assembly language, providing a clean interface for arithmetic, logic, and memory commands.</p>

<p>All computations occur on a stack using <code class="language-plaintext highlighter-rouge">push</code> and <code class="language-plaintext highlighter-rouge">pop</code> instructions. Operands are pushed onto the stack, an operation (like <code class="language-plaintext highlighter-rouge">add</code> or <code class="language-plaintext highlighter-rouge">sub</code>) is performed, and the result is stored back on top.</p>

<p>Example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre>push constant 7
push constant 8
add
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Execution:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>Stack: [7] → [7,8] → add → [15]
</pre></td></tr></tbody></table></code></pre></div></div>

<p>The VM defines memory segments that map to hardware:</p>

<table>
  <thead>
    <tr>
      <th>Segment</th>
      <th>Purpose</th>
      <th>Hack Mapping</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>constant</td>
      <td>literal values</td>
      <td>none</td>
    </tr>
    <tr>
      <td>local</td>
      <td>function locals</td>
      <td>RAM[LCL]</td>
    </tr>
    <tr>
      <td>argument</td>
      <td>function args</td>
      <td>RAM[ARG]</td>
    </tr>
    <tr>
      <td>this/that</td>
      <td>object refs</td>
      <td>RAM[THIS]/RAM[THAT]</td>
    </tr>
    <tr>
      <td>temp</td>
      <td>temporary</td>
      <td>RAM[5–12]</td>
    </tr>
    <tr>
      <td>pointer</td>
      <td>controls this/that</td>
      <td>RAM[3–4]</td>
    </tr>
    <tr>
      <td>static</td>
      <td>global vars</td>
      <td>RAM[16+]</td>
    </tr>
  </tbody>
</table>

<p>Arithmetic and logic commands include:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>add, sub, neg, eq, gt, lt, and, or, not
</pre></td></tr></tbody></table></code></pre></div></div>

<p>Example translation:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre>// VM: add
@SP
AM=M-1
D=M
A=A-1
M=M+D
</pre></td></tr></tbody></table></code></pre></div></div>

<p>This translator layer introduces structured computation independent of physical memory layout and prepares for Chapter 8, which adds branching and function control.</p>

<hr />

<h2 id="chapter-8-virtual-machine-ii--program-control">Chapter 8: Virtual Machine II — Program Control</h2>

<p>Extending the VM, this chapter adds <strong>program control</strong> capabilities like branching, looping, and function calls. It demonstrates how higher-level logic is implemented atop a stack-based execution model.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre>function Main.fibonacci 0
push argument 0
push constant 2
lt
if-goto BASE_CASE
</pre></td></tr></tbody></table></code></pre></div></div>

<hr />

<h2 id="chapter-9-high-level-language">Chapter 9: High-Level Language</h2>

<p>Students are introduced to <strong>Jack</strong>, a simple, object-based language. Jack programs are compiled into VM code, showing the bridge from human-readable syntax to machine-executable logic.</p>

<p>Example:</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre><span class="kd">class</span> <span class="nc">Main</span> <span class="o">{</span>
  <span class="n">function</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">()</span> <span class="o">{</span>
    <span class="k">do</span> <span class="nc">Output</span><span class="o">.</span><span class="na">printString</span><span class="o">(</span><span class="s">"Hello, world!"</span><span class="o">);</span>
    <span class="k">return</span><span class="o">;</span>
  <span class="o">}</span>
<span class="o">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<hr />

<h2 id="chapter-10-compiler-i--syntax-analysis">Chapter 10: Compiler I — Syntax Analysis</h2>

<p>Here, the compiler is built. Students first construct a <strong>syntax analyser</strong> that parses Jack programs into structured representations (parse trees). This teaches the foundations of compiler design.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>Jack Source → Tokeniser → Syntax Tree
</pre></td></tr></tbody></table></code></pre></div></div>

<hr />

<h2 id="chapter-11-compiler-ii--code-generation">Chapter 11: Compiler II — Code Generation</h2>

<p>In this chapter, students implement <strong>code generation</strong>, translating parsed Jack syntax into executable VM commands. This finalises the high-level language pipeline from source code to VM bytecode.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>Jack → VM Code → Assembly → Binary → Execution
</pre></td></tr></tbody></table></code></pre></div></div>

<hr />

<h2 id="chapter-12-operating-system">Chapter 12: Operating System</h2>

<p>Students write the <strong>Jack operating system (OS)</strong>, implementing essential libraries like <code class="language-plaintext highlighter-rouge">Math</code>, <code class="language-plaintext highlighter-rouge">Memory</code>, <code class="language-plaintext highlighter-rouge">String</code>, and <code class="language-plaintext highlighter-rouge">Array</code>. These provide higher-level abstractions that simplify application development.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre>+------------------+
| Application Code |
| OS Libraries     |
| VM + Compiler    |
| CPU + Memory     |
+------------------+
</pre></td></tr></tbody></table></code></pre></div></div>

<p>The OS marks the final layer of abstraction between hardware and user-level software.</p>

<hr />

<h2 id="chapter-13-postscript--more-fun-to-go">Chapter 13: Postscript — More Fun to Go</h2>

<p>The book concludes with reflections on further exploration. Having built a full computer system—from hardware logic to operating system—students can now explore real-world architectures, programming languages, and computer science research topics.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>Nand2Tetris demystifies the complexity of computing by reconstructing it from first principles. It unifies hardware and software learning, empowering students to understand every layer of modern computation, from NAND gates to game programs.</p>]]></content><author><name></name></author><category term="Learning" /><category term="nand2tetris" /><category term="computer system" /><summary type="html"><![CDATA[Introduction In the age where computing systems are defined by abstraction layers and pre-built frameworks, the Nand2Tetris project, which was designed by Noam Nisan and Shimon Schocken, offers a rare opportunity to return to the foundations of computer science. This educational journey begins with the simplest possible logic gate, the NAND gate, and gradually guides learners toward constructing a fully functioning computer capable of running a high-level programming language and simple applications such as the game Tetris. By bridging hardware architecture, machine language, operating systems, and compiler design, Nand2Tetris provides an integrated understanding of how each layer of computing interacts to form a cohesive whole.]]></summary></entry><entry><title type="html">Multi-Layer Perceptron Neural Networks</title><link href="https://guozijn.github.io/engineering/2025/09/27/multi-layer-perceptrons.html" rel="alternate" type="text/html" title="Multi-Layer Perceptron Neural Networks" /><published>2025-09-27T00:00:00+09:30</published><updated>2025-09-27T00:00:00+09:30</updated><id>https://guozijn.github.io/engineering/2025/09/27/multi-layer-perceptrons</id><content type="html" xml:base="https://guozijn.github.io/engineering/2025/09/27/multi-layer-perceptrons.html"><![CDATA[<h2 id="key-concepts">Key Concepts</h2>
<ul>
  <li><strong>Hidden Layers</strong><br />
An MLP contains one or more hidden layers between the input and output.<br />
Each hidden layer applies a linear transformation followed by a non-linear activation function:<br />
\(H_\ell = f_\ell(H_{\ell-1} W_\ell + \mathbf{1} b_\ell^\top)\)<br />
where:
    <ul>
      <li>$H_{\ell-1}$ is the previous layer’s output (with $H_0 = X$).</li>
      <li>$W_\ell, b_\ell$ are the weight matrix and bias vector for layer $\ell$.</li>
      <li>$f_\ell(\cdot)$ is a non-linear activation (e.g., ReLU, tanh, sigmoid).</li>
    </ul>
  </li>
  <li><strong>Deep Representations</strong><br />
Multiple hidden layers allow the network to learn hierarchical feature representations.
    <ul>
      <li>Early layers capture <strong>low-level patterns</strong> (e.g., edges in images).</li>
      <li>Deeper layers capture <strong>higher-level abstractions</strong> (e.g., object shapes).</li>
    </ul>
  </li>
  <li><strong>Activation Functions</strong><br />
Unlike a classic perceptron (which often uses a step function) or logistic regression (which uses sigmoid), MLPs commonly use:
    <ul>
      <li><strong>ReLU</strong>: $f(x) = \max(0, x)$ (default in modern deep learning).</li>
      <li><strong>Tanh</strong>: rescales input to $[-1,1]$.</li>
      <li><strong>Sigmoid</strong>: mainly used in the output layer for binary classification.</li>
    </ul>
  </li>
  <li><strong>Output Layer</strong>
    <ul>
      <li>For <strong>binary classification</strong>: sigmoid function produces $p = \sigma(z)$.</li>
      <li>For <strong>multi-class classification</strong>: softmax produces probability distribution over $K$ classes:<br />
\(p_k = \frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)} \quad (k=1,\dots,K)\)</li>
    </ul>
  </li>
  <li><strong>Loss Function (Extension)</strong>
    <ul>
      <li>Binary case: same as SLP (binary cross-entropy).</li>
      <li>Multi-class case: categorical cross-entropy with one-hot labels:<br />
\(\mathcal{L} = -\frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K y_{i,k} \log p_{i,k}\)</li>
    </ul>
  </li>
  <li>
    <p><strong>Backpropagation Through Layers</strong><br />
The error signal is propagated backward through each layer using the <strong>chain rule</strong>, enabling gradient computation for all parameters:<br />
\(\frac{\partial \mathcal{L}}{\partial W_\ell}, \quad \frac{\partial \mathcal{L}}{\partial b_\ell}, \quad \ell = 1,\dots,L\)</p>
  </li>
  <li><strong>Universal Approximation</strong><br />
With enough hidden units, an MLP can approximate any continuous function on a compact domain.<br />
This property underlies its power as a general-purpose function approximator.</li>
</ul>

<h2 id="architecture-of-a-multi-layer-perceptron-neural-network">Architecture of a Multi-Layer Perceptron Neural Network</h2>

<p><img src="https://images.zjguo.com/mlp_architecture.png" alt="mlp_architecture.png" /></p>

<h2 id="formulas">Formulas</h2>

<ul>
  <li><strong>Input</strong>
    <ul>
      <li>Mini-batch input:
\(X \in \mathbb{R}^{m \times n}\)</li>
      <li>where:
        <ul>
          <li>$m$ = batch size</li>
          <li>$n$ = number of features</li>
        </ul>
      </li>
      <li>Parameters:
\(W_1 \in \mathbb{R}^{n \times k_1}, \quad b_1 \in \mathbb{R}^{k_1}\)
\(W_2 \in \mathbb{R}^{k_1 \times k_2}, \quad b_2 \in \mathbb{R}^{k_2}\)
\(w_3 \in \mathbb{R}^{k_2}, \quad b_3 \in \mathbb{R}\)</li>
    </ul>
  </li>
  <li>
    <p><strong>Forward Propagation</strong></p>

    <ol>
      <li>
        <p><strong>Hidden Layer 1</strong>
\(H_1 = f_1\!\big( X W_1 + \mathbf{1} b_1^\top \big) \quad \in \mathbb{R}^{m \times k_1}\)</p>
      </li>
      <li>
        <p><strong>Hidden Layer 2</strong>
\(H_2 = f_2\!\big( H_1 W_2 + \mathbf{1} b_2^\top \big) \quad \in \mathbb{R}^{m \times k_2}\)</p>
      </li>
      <li>
        <p><strong>Output Pre-activation</strong>
\(z = H_2 w_3 + b_3 \mathbf{1} \quad \in \mathbb{R}^{m}\)</p>
      </li>
      <li>
        <p><strong>Sigmoid Activation</strong>
\(p = \sigma(z) = \frac{1}{1 + e^{-z}} \quad \in \mathbb{R}^{m}\)</p>
      </li>
    </ol>
  </li>
  <li>
    <p><strong>Prediction</strong>
\(\hat{y}_i =
\begin{cases}
1, &amp; \text{if } p_i \geq \tau \\
0, &amp; \text{if } p_i &lt; \tau
\end{cases}\)
with threshold $\tau = 0.5$.</p>
  </li>
  <li>
    <p><strong>Loss Function (Binary Cross-Entropy)</strong></p>

    <p>For a batch:
\(\mathcal{L} = -\frac{1}{m} \sum_{i=1}^m \Big( y_i \log(p_i) + (1-y_i)\log(1-p_i) \Big)\)</p>

    <p>Vectorised form:
\(\mathcal{L} = -\frac{1}{m} \Big[ y^\top \log p + (1-y)^\top \log (1-p) \Big]\)</p>
  </li>
  <li>
    <p><strong>Backpropagation (Gradients)</strong></p>

    <ul>
      <li>
        <p>Output layer:
\(\frac{\partial \mathcal{L}}{\partial z} = \frac{1}{m}(p - y) \quad \in \mathbb{R}^m\)</p>
      </li>
      <li>
        <p>Gradients for output weights and bias:
\(\frac{\partial \mathcal{L}}{\partial w_3} = \frac{1}{m} H_2^\top (p-y)\)
\(\frac{\partial \mathcal{L}}{\partial b_3} = \frac{1}{m} \mathbf{1}^\top (p-y)\)</p>
      </li>
      <li>
        <p>Hidden layers (chain rule with chosen activations $f_1, f_2$).</p>
      </li>
    </ul>
  </li>
  <li>
    <p><strong>Parameter Update (Gradient Descent)</strong></p>

    <p>With learning rate $\eta &gt; 0$:
\(W_\ell \leftarrow W_\ell - \eta \frac{\partial \mathcal{L}}{\partial W_\ell}, \quad
b_\ell \leftarrow b_\ell - \eta \frac{\partial \mathcal{L}}{\partial b_\ell}
\quad (\ell = 1,2,3)\)</p>
  </li>
</ul>

<h2 id="implementation-and-explanation">Implementation and Explanation</h2>

<p>This section contrasts a from-scratch NumPy implementation with an equivalent PyTorch model. Both pipelines share the same data preprocessing, hyperparameters, and evaluation workflow so their learning curves can be compared directly. A correct manual implementation should produce broadly similar learning behaviour to PyTorch; large gaps usually point to implementation details such as gradient scaling, initialisation, or optimiser settings rather than to autograd itself.</p>

<h3 id="custom-version">Custom Version</h3>

<p>The custom network is assembled from lightweight building blocks: <code class="language-plaintext highlighter-rouge">Linear</code>, <code class="language-plaintext highlighter-rouge">ReLU</code>, and <code class="language-plaintext highlighter-rouge">CrossEntropy</code>. Each layer stores the activations it needs for the backward pass, computes gradients manually, and updates its parameters via SGD in the <code class="language-plaintext highlighter-rouge">step</code> routine. Utility helpers handle one-hot encoding, mini-batch iteration, normalisation, and accuracy tracking so the training loop mirrors a framework-driven workflow while keeping every tensor transformation explicit.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
</pre></td><td class="rouge-code"><pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">8</span><span class="p">)</span>


<span class="k">class</span> <span class="nc">Linear</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">in_features</span><span class="p">,</span> <span class="n">out_features</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">W</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">out_features</span><span class="p">,</span> <span class="n">in_features</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mf">2.0</span> <span class="o">/</span> <span class="n">in_features</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">out_features</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>

    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">x</span> <span class="o">=</span> <span class="n">x</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">W</span> <span class="o">@</span> <span class="n">x</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">b</span>

    <span class="k">def</span> <span class="nf">backward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">grad_output</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">dW</span> <span class="o">=</span> <span class="n">grad_output</span> <span class="o">@</span> <span class="bp">self</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">T</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">db</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">grad_output</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">W</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">grad_output</span>


<span class="k">class</span> <span class="nc">ReLU</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">mask</span> <span class="o">=</span> <span class="n">x</span> <span class="o">&gt;</span> <span class="mi">0</span>
        <span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">maximum</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">backward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">grad_output</span><span class="p">):</span>
        <span class="k">return</span> <span class="n">grad_output</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">mask</span>


<span class="k">class</span> <span class="nc">CrossEntropy</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">logits</span><span class="p">,</span> <span class="n">labels</span><span class="p">):</span>
        <span class="n">shifted</span> <span class="o">=</span> <span class="n">logits</span> <span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="nb">max</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
        <span class="n">exp_scores</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">shifted</span><span class="p">)</span>
        <span class="n">probs</span> <span class="o">=</span> <span class="n">exp_scores</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">exp_scores</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">probs</span> <span class="o">=</span> <span class="n">probs</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">labels</span> <span class="o">=</span> <span class="n">labels</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">labels</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">probs</span> <span class="o">+</span> <span class="mf">1e-15</span><span class="p">))</span> <span class="o">/</span> <span class="n">labels</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>

    <span class="k">def</span> <span class="nf">backward</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">probs</span> <span class="o">-</span> <span class="bp">self</span><span class="p">.</span><span class="n">labels</span><span class="p">)</span> <span class="o">/</span> <span class="bp">self</span><span class="p">.</span><span class="n">labels</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>


<span class="k">class</span> <span class="nc">ThreeLayerNN</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_dim</span><span class="p">,</span> <span class="n">hidden_dim1</span><span class="p">,</span> <span class="n">hidden_dim2</span><span class="p">,</span> <span class="n">output_dim</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">fc1</span> <span class="o">=</span> <span class="n">Linear</span><span class="p">(</span><span class="n">input_dim</span><span class="p">,</span> <span class="n">hidden_dim1</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">act1</span> <span class="o">=</span> <span class="n">ReLU</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">fc2</span> <span class="o">=</span> <span class="n">Linear</span><span class="p">(</span><span class="n">hidden_dim1</span><span class="p">,</span> <span class="n">hidden_dim2</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">act2</span> <span class="o">=</span> <span class="n">ReLU</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">fc3</span> <span class="o">=</span> <span class="n">Linear</span><span class="p">(</span><span class="n">hidden_dim2</span><span class="p">,</span> <span class="n">output_dim</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="n">z1</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc1</span><span class="p">.</span><span class="n">forward</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">a1</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">act1</span><span class="p">.</span><span class="n">forward</span><span class="p">(</span><span class="n">z1</span><span class="p">)</span>
        <span class="n">z2</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc2</span><span class="p">.</span><span class="n">forward</span><span class="p">(</span><span class="n">a1</span><span class="p">)</span>
        <span class="n">a2</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">act2</span><span class="p">.</span><span class="n">forward</span><span class="p">(</span><span class="n">z2</span><span class="p">)</span>
        <span class="n">logits</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc3</span><span class="p">.</span><span class="n">forward</span><span class="p">(</span><span class="n">a2</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">logits</span>

    <span class="k">def</span> <span class="nf">backward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">grad_output</span><span class="p">):</span>
        <span class="n">grad_hidden2</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc3</span><span class="p">.</span><span class="n">backward</span><span class="p">(</span><span class="n">grad_output</span><span class="p">)</span>
        <span class="n">grad_hidden2</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">act2</span><span class="p">.</span><span class="n">backward</span><span class="p">(</span><span class="n">grad_hidden2</span><span class="p">)</span>
        <span class="n">grad_hidden1</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc2</span><span class="p">.</span><span class="n">backward</span><span class="p">(</span><span class="n">grad_hidden2</span><span class="p">)</span>
        <span class="n">grad_hidden1</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">act1</span><span class="p">.</span><span class="n">backward</span><span class="p">(</span><span class="n">grad_hidden1</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">fc1</span><span class="p">.</span><span class="n">backward</span><span class="p">(</span><span class="n">grad_hidden1</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">step</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">lr</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">fc1</span><span class="p">.</span><span class="n">W</span> <span class="o">-=</span> <span class="n">lr</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc1</span><span class="p">.</span><span class="n">dW</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">fc1</span><span class="p">.</span><span class="n">b</span> <span class="o">-=</span> <span class="n">lr</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc1</span><span class="p">.</span><span class="n">db</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">fc2</span><span class="p">.</span><span class="n">W</span> <span class="o">-=</span> <span class="n">lr</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc2</span><span class="p">.</span><span class="n">dW</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">fc2</span><span class="p">.</span><span class="n">b</span> <span class="o">-=</span> <span class="n">lr</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc2</span><span class="p">.</span><span class="n">db</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">fc3</span><span class="p">.</span><span class="n">W</span> <span class="o">-=</span> <span class="n">lr</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc3</span><span class="p">.</span><span class="n">dW</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">fc3</span><span class="p">.</span><span class="n">b</span> <span class="o">-=</span> <span class="n">lr</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc3</span><span class="p">.</span><span class="n">db</span>


<span class="k">def</span> <span class="nf">one_hot</span><span class="p">(</span><span class="n">labels</span><span class="p">,</span> <span class="n">num_classes</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">eye</span><span class="p">(</span><span class="n">num_classes</span><span class="p">)[</span><span class="n">labels</span><span class="p">].</span><span class="n">T</span>


<span class="k">def</span> <span class="nf">iterate_minibatches</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span>
    <span class="n">num_samples</span> <span class="o">=</span> <span class="n">X</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
    <span class="n">indices</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="n">num_samples</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">shuffle</span><span class="p">:</span>
        <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">indices</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">start</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">num_samples</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">):</span>
        <span class="n">batch_idx</span> <span class="o">=</span> <span class="n">indices</span><span class="p">[</span><span class="n">start</span><span class="p">:</span><span class="n">start</span> <span class="o">+</span> <span class="n">batch_size</span><span class="p">]</span>
        <span class="k">yield</span> <span class="n">X</span><span class="p">[:,</span> <span class="n">batch_idx</span><span class="p">],</span> <span class="n">Y</span><span class="p">[:,</span> <span class="n">batch_idx</span><span class="p">]</span>


<span class="k">def</span> <span class="nf">accuracy</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">labels</span><span class="p">):</span>
    <span class="n">preds</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">preds</span> <span class="o">==</span> <span class="n">labels</span><span class="p">)</span>


<span class="n">data_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">.</span><span class="n">home</span><span class="p">()</span> <span class="o">/</span> <span class="s">"Code"</span> <span class="o">/</span> <span class="s">"data"</span>

<span class="n">train_data</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">loadtxt</span><span class="p">(</span><span class="n">data_dir</span> <span class="o">/</span> <span class="s">"train.csv"</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s">","</span><span class="p">)</span>
<span class="n">test_data</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">loadtxt</span><span class="p">(</span><span class="n">data_dir</span> <span class="o">/</span> <span class="s">"test.csv"</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s">","</span><span class="p">)</span>

<span class="n">y_train_full</span> <span class="o">=</span> <span class="n">train_data</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="n">X_train_full</span> <span class="o">=</span> <span class="n">train_data</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">:]</span>

<span class="n">y_test</span> <span class="o">=</span> <span class="n">test_data</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="n">X_test</span> <span class="o">=</span> <span class="n">test_data</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">:]</span>

<span class="n">train_cutoff</span> <span class="o">=</span> <span class="mi">4000</span>
<span class="n">X_train_raw</span> <span class="o">=</span> <span class="n">X_train_full</span><span class="p">[:</span><span class="n">train_cutoff</span><span class="p">]</span>
<span class="n">y_train</span> <span class="o">=</span> <span class="n">y_train_full</span><span class="p">[:</span><span class="n">train_cutoff</span><span class="p">]</span>
<span class="n">X_val_raw</span> <span class="o">=</span> <span class="n">X_train_full</span><span class="p">[</span><span class="n">train_cutoff</span><span class="p">:]</span>
<span class="n">y_val</span> <span class="o">=</span> <span class="n">y_train_full</span><span class="p">[</span><span class="n">train_cutoff</span><span class="p">:]</span>

<span class="n">mean</span> <span class="o">=</span> <span class="n">X_train_raw</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">std</span> <span class="o">=</span> <span class="n">X_train_raw</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="o">+</span> <span class="mf">1e-8</span>

<span class="n">X_train_std</span> <span class="o">=</span> <span class="p">(</span><span class="n">X_train_raw</span> <span class="o">-</span> <span class="n">mean</span><span class="p">)</span> <span class="o">/</span> <span class="n">std</span>
<span class="n">X_val_std</span> <span class="o">=</span> <span class="p">(</span><span class="n">X_val_raw</span> <span class="o">-</span> <span class="n">mean</span><span class="p">)</span> <span class="o">/</span> <span class="n">std</span>
<span class="n">X_test_std</span> <span class="o">=</span> <span class="p">(</span><span class="n">X_test</span> <span class="o">-</span> <span class="n">mean</span><span class="p">)</span> <span class="o">/</span> <span class="n">std</span>

<span class="n">X_train_np</span> <span class="o">=</span> <span class="n">X_train_std</span><span class="p">.</span><span class="n">T</span>
<span class="n">X_val_np</span> <span class="o">=</span> <span class="n">X_val_std</span><span class="p">.</span><span class="n">T</span>
<span class="n">X_test_np</span> <span class="o">=</span> <span class="n">X_test_std</span><span class="p">.</span><span class="n">T</span>

<span class="n">num_classes</span> <span class="o">=</span> <span class="mi">2</span>
<span class="n">hidden_units</span> <span class="o">=</span> <span class="mi">64</span>

<span class="n">Y_train</span> <span class="o">=</span> <span class="n">one_hot</span><span class="p">(</span><span class="n">y_train</span><span class="p">,</span> <span class="n">num_classes</span><span class="p">)</span>
<span class="n">Y_val</span> <span class="o">=</span> <span class="n">one_hot</span><span class="p">(</span><span class="n">y_val</span><span class="p">,</span> <span class="n">num_classes</span><span class="p">)</span>
<span class="n">Y_test</span> <span class="o">=</span> <span class="n">one_hot</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">num_classes</span><span class="p">)</span>

<span class="n">hidden_dim1</span> <span class="o">=</span> <span class="n">hidden_units</span>
<span class="n">hidden_dim2</span> <span class="o">=</span> <span class="n">hidden_units</span>

<span class="n">custom_model</span> <span class="o">=</span> <span class="n">ThreeLayerNN</span><span class="p">(</span><span class="n">input_dim</span><span class="o">=</span><span class="n">X_train_np</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span>
                            <span class="n">hidden_dim1</span><span class="o">=</span><span class="n">hidden_dim1</span><span class="p">,</span>
                            <span class="n">hidden_dim2</span><span class="o">=</span><span class="n">hidden_dim2</span><span class="p">,</span>
                            <span class="n">output_dim</span><span class="o">=</span><span class="n">num_classes</span><span class="p">)</span>
<span class="n">criterion_np</span> <span class="o">=</span> <span class="n">CrossEntropy</span><span class="p">()</span>

<span class="n">epochs</span> <span class="o">=</span> <span class="mi">50</span>
<span class="n">batch_size</span> <span class="o">=</span> <span class="mi">64</span>
<span class="n">learning_rate</span> <span class="o">=</span> <span class="mf">0.1</span>

<span class="n">custom_history</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">epochs</span> <span class="o">+</span> <span class="mi">1</span><span class="p">):</span>
    <span class="n">epoch_loss</span> <span class="o">=</span> <span class="mf">0.0</span>
    <span class="k">for</span> <span class="n">xb</span><span class="p">,</span> <span class="n">yb</span> <span class="ow">in</span> <span class="n">iterate_minibatches</span><span class="p">(</span><span class="n">X_train_np</span><span class="p">,</span> <span class="n">Y_train</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">):</span>
        <span class="n">logits</span> <span class="o">=</span> <span class="n">custom_model</span><span class="p">.</span><span class="n">forward</span><span class="p">(</span><span class="n">xb</span><span class="p">)</span>
        <span class="n">loss</span> <span class="o">=</span> <span class="n">criterion_np</span><span class="p">.</span><span class="n">forward</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">yb</span><span class="p">)</span>
        <span class="n">grad_logits</span> <span class="o">=</span> <span class="n">criterion_np</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
        <span class="n">custom_model</span><span class="p">.</span><span class="n">backward</span><span class="p">(</span><span class="n">grad_logits</span><span class="p">)</span>
        <span class="n">custom_model</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">learning_rate</span><span class="p">)</span>
        <span class="n">epoch_loss</span> <span class="o">+=</span> <span class="n">loss</span> <span class="o">*</span> <span class="n">xb</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
    <span class="n">epoch_loss</span> <span class="o">/=</span> <span class="n">X_train_np</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
    <span class="n">train_acc</span> <span class="o">=</span> <span class="n">accuracy</span><span class="p">(</span><span class="n">custom_model</span><span class="p">.</span><span class="n">forward</span><span class="p">(</span><span class="n">X_train_np</span><span class="p">),</span> <span class="n">y_train</span><span class="p">)</span>
    <span class="n">val_acc</span> <span class="o">=</span> <span class="n">accuracy</span><span class="p">(</span><span class="n">custom_model</span><span class="p">.</span><span class="n">forward</span><span class="p">(</span><span class="n">X_val_np</span><span class="p">),</span> <span class="n">y_val</span><span class="p">)</span>
    <span class="n">custom_history</span><span class="p">.</span><span class="n">append</span><span class="p">((</span><span class="n">epoch</span><span class="p">,</span> <span class="n">epoch_loss</span><span class="p">,</span> <span class="n">train_acc</span><span class="p">,</span> <span class="n">val_acc</span><span class="p">))</span>
    <span class="k">if</span> <span class="n">epoch</span> <span class="o">%</span> <span class="mi">10</span> <span class="o">==</span> <span class="mi">0</span> <span class="ow">or</span> <span class="n">epoch</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Epoch </span><span class="si">{</span><span class="n">epoch</span><span class="si">:</span><span class="mi">02</span><span class="n">d</span><span class="si">}</span><span class="s">: loss=</span><span class="si">{</span><span class="n">epoch_loss</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s"> train_acc=</span><span class="si">{</span><span class="n">train_acc</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s"> val_acc=</span><span class="si">{</span><span class="n">val_acc</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

<span class="n">custom_val_acc</span> <span class="o">=</span> <span class="n">accuracy</span><span class="p">(</span><span class="n">custom_model</span><span class="p">.</span><span class="n">forward</span><span class="p">(</span><span class="n">X_val_np</span><span class="p">),</span> <span class="n">y_val</span><span class="p">)</span>
<span class="n">custom_test_acc</span> <span class="o">=</span> <span class="n">accuracy</span><span class="p">(</span><span class="n">custom_model</span><span class="p">.</span><span class="n">forward</span><span class="p">(</span><span class="n">X_test_np</span><span class="p">),</span> <span class="n">y_test</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Custom validation accuracy: </span><span class="si">{</span><span class="n">custom_val_acc</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Custom test accuracy: </span><span class="si">{</span><span class="n">custom_test_acc</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<h4 id="training-custom-model">Training Custom Model</h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="rouge-code"><pre><span class="o">(</span>base<span class="o">)</span> ➜  draft python ml/mlp.py
Epoch 01: <span class="nv">loss</span><span class="o">=</span>0.8111 <span class="nv">train_acc</span><span class="o">=</span>0.4998 <span class="nv">val_acc</span><span class="o">=</span>0.5016
Epoch 10: <span class="nv">loss</span><span class="o">=</span>0.6645 <span class="nv">train_acc</span><span class="o">=</span>0.6025 <span class="nv">val_acc</span><span class="o">=</span>0.6011
Epoch 20: <span class="nv">loss</span><span class="o">=</span>0.5992 <span class="nv">train_acc</span><span class="o">=</span>0.6810 <span class="nv">val_acc</span><span class="o">=</span>0.6613
Epoch 30: <span class="nv">loss</span><span class="o">=</span>0.5376 <span class="nv">train_acc</span><span class="o">=</span>0.7455 <span class="nv">val_acc</span><span class="o">=</span>0.7129
Epoch 40: <span class="nv">loss</span><span class="o">=</span>0.4696 <span class="nv">train_acc</span><span class="o">=</span>0.7965 <span class="nv">val_acc</span><span class="o">=</span>0.7607
Epoch 50: <span class="nv">loss</span><span class="o">=</span>0.3977 <span class="nv">train_acc</span><span class="o">=</span>0.8435 <span class="nv">val_acc</span><span class="o">=</span>0.7991
Custom validation accuracy: 0.7991
Custom <span class="nb">test </span>accuracy: 0.7980
</pre></td></tr></tbody></table></code></pre></div></div>

<h3 id="pytorch-version">PyTorch Version</h3>

<p>The PyTorch variant recreates the same architecture with <code class="language-plaintext highlighter-rouge">nn.Sequential</code>, letting autograd handle gradient calculations. Dataset splits are wrapped in <code class="language-plaintext highlighter-rouge">TensorDataset</code>/<code class="language-plaintext highlighter-rouge">DataLoader</code>, giving shuffling and batching for free, and the training loop follows the standard <code class="language-plaintext highlighter-rouge">optimizer.zero_grad() → loss.backward() → optimizer.step()</code> pattern. Reusing the preprocessing from the custom section ensures any performance gains are attributable to the framework tooling rather than data differences.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
</pre></td><td class="rouge-code"><pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">torch</span> <span class="kn">import</span> <span class="n">nn</span>
<span class="kn">from</span> <span class="nn">torch.utils.data</span> <span class="kn">import</span> <span class="n">TensorDataset</span><span class="p">,</span> <span class="n">DataLoader</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="n">torch</span><span class="p">.</span><span class="n">manual_seed</span><span class="p">(</span><span class="mi">8</span><span class="p">)</span>


<span class="c1"># Load data and create train/validation/test splits
</span><span class="n">data_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">.</span><span class="n">home</span><span class="p">()</span> <span class="o">/</span> <span class="s">"Code"</span> <span class="o">/</span> <span class="s">"data"</span>

<span class="n">train_data</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">loadtxt</span><span class="p">(</span><span class="n">data_dir</span> <span class="o">/</span> <span class="s">"train.csv"</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s">","</span><span class="p">)</span>
<span class="n">test_data</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">loadtxt</span><span class="p">(</span><span class="n">data_dir</span> <span class="o">/</span> <span class="s">"test.csv"</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s">","</span><span class="p">)</span>

<span class="n">y_train_full</span> <span class="o">=</span> <span class="n">train_data</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="n">X_train_full</span> <span class="o">=</span> <span class="n">train_data</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">:]</span>

<span class="n">y_test</span> <span class="o">=</span> <span class="n">test_data</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="n">X_test</span> <span class="o">=</span> <span class="n">test_data</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">:]</span>

<span class="n">train_cutoff</span> <span class="o">=</span> <span class="mi">4000</span>
<span class="n">X_train_raw</span> <span class="o">=</span> <span class="n">X_train_full</span><span class="p">[:</span><span class="n">train_cutoff</span><span class="p">]</span>
<span class="n">y_train</span> <span class="o">=</span> <span class="n">y_train_full</span><span class="p">[:</span><span class="n">train_cutoff</span><span class="p">]</span>
<span class="n">X_val_raw</span> <span class="o">=</span> <span class="n">X_train_full</span><span class="p">[</span><span class="n">train_cutoff</span><span class="p">:]</span>
<span class="n">y_val</span> <span class="o">=</span> <span class="n">y_train_full</span><span class="p">[</span><span class="n">train_cutoff</span><span class="p">:]</span>

<span class="n">mean</span> <span class="o">=</span> <span class="n">X_train_raw</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">std</span> <span class="o">=</span> <span class="n">X_train_raw</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="o">+</span> <span class="mf">1e-8</span>

<span class="n">X_train_std</span> <span class="o">=</span> <span class="p">(</span><span class="n">X_train_raw</span> <span class="o">-</span> <span class="n">mean</span><span class="p">)</span> <span class="o">/</span> <span class="n">std</span>
<span class="n">X_val_std</span> <span class="o">=</span> <span class="p">(</span><span class="n">X_val_raw</span> <span class="o">-</span> <span class="n">mean</span><span class="p">)</span> <span class="o">/</span> <span class="n">std</span>
<span class="n">X_test_std</span> <span class="o">=</span> <span class="p">(</span><span class="n">X_test</span> <span class="o">-</span> <span class="n">mean</span><span class="p">)</span> <span class="o">/</span> <span class="n">std</span>

<span class="n">X_train_tensor</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">X_train_std</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span>
<span class="n">y_train_tensor</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">y_train</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="nb">long</span><span class="p">)</span>
<span class="n">X_val_tensor</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">X_val_std</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span>
<span class="n">y_val_tensor</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">y_val</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="nb">long</span><span class="p">)</span>
<span class="n">X_test_tensor</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">X_test_std</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span>
<span class="n">y_test_tensor</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="nb">long</span><span class="p">)</span>

<span class="n">batch_size</span> <span class="o">=</span> <span class="mi">64</span>
<span class="n">train_dataset</span> <span class="o">=</span> <span class="n">TensorDataset</span><span class="p">(</span><span class="n">X_train_tensor</span><span class="p">,</span> <span class="n">y_train_tensor</span><span class="p">)</span>
<span class="n">val_dataset</span> <span class="o">=</span> <span class="n">TensorDataset</span><span class="p">(</span><span class="n">X_val_tensor</span><span class="p">,</span> <span class="n">y_val_tensor</span><span class="p">)</span>
<span class="n">test_dataset</span> <span class="o">=</span> <span class="n">TensorDataset</span><span class="p">(</span><span class="n">X_test_tensor</span><span class="p">,</span> <span class="n">y_test_tensor</span><span class="p">)</span>

<span class="n">train_loader</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">train_dataset</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">val_loader</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">val_dataset</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">)</span>
<span class="n">test_loader</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">test_dataset</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">)</span>

<span class="n">input_dim</span> <span class="o">=</span> <span class="n">X_train_tensor</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">num_classes</span> <span class="o">=</span> <span class="mi">2</span>

<span class="c1"># Define PyTorch MLP and training utilities
</span>
<span class="k">class</span> <span class="nc">TorchMLP</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_dim</span><span class="p">,</span> <span class="n">hidden_dim</span><span class="p">,</span> <span class="n">output_dim</span><span class="p">):</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">net</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">input_dim</span><span class="p">,</span> <span class="n">hidden_dim</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">hidden_dim</span><span class="p">,</span> <span class="n">hidden_dim</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">hidden_dim</span><span class="p">,</span> <span class="n">output_dim</span><span class="p">)</span>
        <span class="p">)</span>

    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">net</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">evaluate_model</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">criterion</span><span class="p">,</span> <span class="n">data_loader</span><span class="p">,</span> <span class="n">device</span><span class="p">):</span>
    <span class="n">model</span><span class="p">.</span><span class="nb">eval</span><span class="p">()</span>
    <span class="n">total_loss</span> <span class="o">=</span> <span class="mf">0.0</span>
    <span class="n">correct</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">total</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
        <span class="k">for</span> <span class="n">xb</span><span class="p">,</span> <span class="n">yb</span> <span class="ow">in</span> <span class="n">data_loader</span><span class="p">:</span>
            <span class="n">xb</span> <span class="o">=</span> <span class="n">xb</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
            <span class="n">yb</span> <span class="o">=</span> <span class="n">yb</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
            <span class="n">logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">xb</span><span class="p">)</span>
            <span class="n">loss</span> <span class="o">=</span> <span class="n">criterion</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">yb</span><span class="p">)</span>
            <span class="n">total_loss</span> <span class="o">+=</span> <span class="n">loss</span><span class="p">.</span><span class="n">item</span><span class="p">()</span> <span class="o">*</span> <span class="n">xb</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
            <span class="n">preds</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
            <span class="n">correct</span> <span class="o">+=</span> <span class="p">(</span><span class="n">preds</span> <span class="o">==</span> <span class="n">yb</span><span class="p">).</span><span class="nb">sum</span><span class="p">().</span><span class="n">item</span><span class="p">()</span>
            <span class="n">total</span> <span class="o">+=</span> <span class="n">xb</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">total_loss</span> <span class="o">/</span> <span class="n">total</span><span class="p">,</span> <span class="n">correct</span> <span class="o">/</span> <span class="n">total</span>


<span class="k">def</span> <span class="nf">train_model</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">criterion</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">train_loader</span><span class="p">,</span> <span class="n">val_loader</span><span class="p">,</span> <span class="n">epochs</span><span class="p">,</span> <span class="n">device</span><span class="p">):</span>
    <span class="n">history</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">model</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">epochs</span> <span class="o">+</span> <span class="mi">1</span><span class="p">):</span>
        <span class="n">model</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>
        <span class="n">epoch_loss</span> <span class="o">=</span> <span class="mf">0.0</span>
        <span class="n">correct</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="n">total</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="k">for</span> <span class="n">xb</span><span class="p">,</span> <span class="n">yb</span> <span class="ow">in</span> <span class="n">train_loader</span><span class="p">:</span>
            <span class="n">xb</span> <span class="o">=</span> <span class="n">xb</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
            <span class="n">yb</span> <span class="o">=</span> <span class="n">yb</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
            <span class="n">optimizer</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
            <span class="n">logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">xb</span><span class="p">)</span>
            <span class="n">loss</span> <span class="o">=</span> <span class="n">criterion</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">yb</span><span class="p">)</span>
            <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
            <span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
            <span class="n">epoch_loss</span> <span class="o">+=</span> <span class="n">loss</span><span class="p">.</span><span class="n">item</span><span class="p">()</span> <span class="o">*</span> <span class="n">xb</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
            <span class="n">preds</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
            <span class="n">correct</span> <span class="o">+=</span> <span class="p">(</span><span class="n">preds</span> <span class="o">==</span> <span class="n">yb</span><span class="p">).</span><span class="nb">sum</span><span class="p">().</span><span class="n">item</span><span class="p">()</span>
            <span class="n">total</span> <span class="o">+=</span> <span class="n">xb</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
        <span class="n">train_loss</span> <span class="o">=</span> <span class="n">epoch_loss</span> <span class="o">/</span> <span class="n">total</span>
        <span class="n">train_acc</span> <span class="o">=</span> <span class="n">correct</span> <span class="o">/</span> <span class="n">total</span>
        <span class="n">val_loss</span><span class="p">,</span> <span class="n">val_acc</span> <span class="o">=</span> <span class="n">evaluate_model</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">criterion</span><span class="p">,</span> <span class="n">val_loader</span><span class="p">,</span> <span class="n">device</span><span class="p">)</span>
        <span class="n">history</span><span class="p">.</span><span class="n">append</span><span class="p">((</span><span class="n">epoch</span><span class="p">,</span> <span class="n">train_loss</span><span class="p">,</span> <span class="n">train_acc</span><span class="p">,</span> <span class="n">val_loss</span><span class="p">,</span> <span class="n">val_acc</span><span class="p">))</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Epoch </span><span class="si">{</span><span class="n">epoch</span><span class="si">:</span><span class="mi">02</span><span class="n">d</span><span class="si">}</span><span class="s">: train_loss=</span><span class="si">{</span><span class="n">train_loss</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s"> train_acc=</span><span class="si">{</span><span class="n">train_acc</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s"> "</span>
              <span class="sa">f</span><span class="s">"val_loss=</span><span class="si">{</span><span class="n">val_loss</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s"> val_acc=</span><span class="si">{</span><span class="n">val_acc</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">history</span>

<span class="c1"># Train the PyTorch model with the same hyperparameters as the custom implementation
</span>

<span class="n">device</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">device</span><span class="p">(</span><span class="s">'cuda'</span> <span class="k">if</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">is_available</span><span class="p">()</span> <span class="k">else</span> <span class="s">'cpu'</span><span class="p">)</span>

<span class="n">hidden_units</span> <span class="o">=</span> <span class="mi">64</span>
<span class="n">learning_rate</span> <span class="o">=</span> <span class="mf">0.1</span>
<span class="n">epochs</span> <span class="o">=</span> <span class="mi">50</span>

<span class="n">model</span> <span class="o">=</span> <span class="n">TorchMLP</span><span class="p">(</span><span class="n">input_dim</span><span class="p">,</span> <span class="n">hidden_units</span><span class="p">,</span> <span class="n">num_classes</span><span class="p">)</span>
<span class="n">criterion</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">CrossEntropyLoss</span><span class="p">()</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">optim</span><span class="p">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="n">learning_rate</span><span class="p">)</span>

<span class="n">pytorch_history</span> <span class="o">=</span> <span class="n">train_model</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">criterion</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">train_loader</span><span class="p">,</span> <span class="n">val_loader</span><span class="p">,</span> <span class="n">epochs</span><span class="p">,</span> <span class="n">device</span><span class="p">)</span>

<span class="n">pytorch_val_loss</span><span class="p">,</span> <span class="n">pytorch_val_acc</span> <span class="o">=</span> <span class="n">evaluate_model</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">criterion</span><span class="p">,</span> <span class="n">val_loader</span><span class="p">,</span> <span class="n">device</span><span class="p">)</span>
<span class="n">pytorch_test_loss</span><span class="p">,</span> <span class="n">pytorch_test_acc</span> <span class="o">=</span> <span class="n">evaluate_model</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">criterion</span><span class="p">,</span> <span class="n">test_loader</span><span class="p">,</span> <span class="n">device</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"PyTorch validation accuracy: </span><span class="si">{</span><span class="n">pytorch_val_acc</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">, loss: </span><span class="si">{</span><span class="n">pytorch_val_loss</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"PyTorch test accuracy: </span><span class="si">{</span><span class="n">pytorch_test_acc</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">, loss: </span><span class="si">{</span><span class="n">pytorch_test_loss</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<h4 id="training-pytorch-model">Training PyTorch Model</h4>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
</pre></td><td class="rouge-code"><pre>(base) ➜  draft python ml/mlp_torch.py 
Epoch 01: train_loss=0.6716 train_acc=0.6168 val_loss=0.6073 val_acc=0.7809
Epoch 02: train_loss=0.3701 train_acc=0.8952 val_loss=0.1712 val_acc=0.9540
Epoch 03: train_loss=0.1077 train_acc=0.9695 val_loss=0.1098 val_acc=0.9631
Epoch 04: train_loss=0.0564 train_acc=0.9872 val_loss=0.1032 val_acc=0.9667
Epoch 05: train_loss=0.0335 train_acc=0.9942 val_loss=0.0987 val_acc=0.9700
Epoch 06: train_loss=0.0208 train_acc=0.9978 val_loss=0.0992 val_acc=0.9680
Epoch 07: train_loss=0.0132 train_acc=0.9982 val_loss=0.1018 val_acc=0.9684
Epoch 08: train_loss=0.0079 train_acc=0.9990 val_loss=0.1039 val_acc=0.9682
Epoch 09: train_loss=0.0049 train_acc=1.0000 val_loss=0.1036 val_acc=0.9709
Epoch 10: train_loss=0.0037 train_acc=1.0000 val_loss=0.1046 val_acc=0.9709
Epoch 11: train_loss=0.0029 train_acc=1.0000 val_loss=0.1052 val_acc=0.9709
Epoch 12: train_loss=0.0024 train_acc=1.0000 val_loss=0.1059 val_acc=0.9718
Epoch 13: train_loss=0.0020 train_acc=1.0000 val_loss=0.1067 val_acc=0.9718
Epoch 14: train_loss=0.0017 train_acc=1.0000 val_loss=0.1073 val_acc=0.9722
Epoch 15: train_loss=0.0015 train_acc=1.0000 val_loss=0.1080 val_acc=0.9727
Epoch 16: train_loss=0.0013 train_acc=1.0000 val_loss=0.1087 val_acc=0.9727
Epoch 17: train_loss=0.0012 train_acc=1.0000 val_loss=0.1094 val_acc=0.9731
Epoch 18: train_loss=0.0011 train_acc=1.0000 val_loss=0.1100 val_acc=0.9731
Epoch 19: train_loss=0.0010 train_acc=1.0000 val_loss=0.1105 val_acc=0.9731
Epoch 20: train_loss=0.0009 train_acc=1.0000 val_loss=0.1111 val_acc=0.9731
Epoch 21: train_loss=0.0008 train_acc=1.0000 val_loss=0.1117 val_acc=0.9731
Epoch 22: train_loss=0.0008 train_acc=1.0000 val_loss=0.1122 val_acc=0.9731
Epoch 23: train_loss=0.0007 train_acc=1.0000 val_loss=0.1127 val_acc=0.9731
Epoch 24: train_loss=0.0007 train_acc=1.0000 val_loss=0.1131 val_acc=0.9731
Epoch 25: train_loss=0.0006 train_acc=1.0000 val_loss=0.1136 val_acc=0.9733
Epoch 26: train_loss=0.0006 train_acc=1.0000 val_loss=0.1141 val_acc=0.9731
Epoch 27: train_loss=0.0006 train_acc=1.0000 val_loss=0.1145 val_acc=0.9736
Epoch 28: train_loss=0.0005 train_acc=1.0000 val_loss=0.1149 val_acc=0.9733
Epoch 29: train_loss=0.0005 train_acc=1.0000 val_loss=0.1152 val_acc=0.9733
Epoch 30: train_loss=0.0005 train_acc=1.0000 val_loss=0.1156 val_acc=0.9733
Epoch 31: train_loss=0.0005 train_acc=1.0000 val_loss=0.1160 val_acc=0.9731
Epoch 32: train_loss=0.0004 train_acc=1.0000 val_loss=0.1163 val_acc=0.9733
Epoch 33: train_loss=0.0004 train_acc=1.0000 val_loss=0.1167 val_acc=0.9731
Epoch 34: train_loss=0.0004 train_acc=1.0000 val_loss=0.1170 val_acc=0.9733
Epoch 35: train_loss=0.0004 train_acc=1.0000 val_loss=0.1173 val_acc=0.9731
Epoch 36: train_loss=0.0004 train_acc=1.0000 val_loss=0.1176 val_acc=0.9733
Epoch 37: train_loss=0.0004 train_acc=1.0000 val_loss=0.1179 val_acc=0.9733
Epoch 38: train_loss=0.0003 train_acc=1.0000 val_loss=0.1182 val_acc=0.9733
Epoch 39: train_loss=0.0003 train_acc=1.0000 val_loss=0.1185 val_acc=0.9736
Epoch 40: train_loss=0.0003 train_acc=1.0000 val_loss=0.1188 val_acc=0.9736
Epoch 41: train_loss=0.0003 train_acc=1.0000 val_loss=0.1191 val_acc=0.9736
Epoch 42: train_loss=0.0003 train_acc=1.0000 val_loss=0.1193 val_acc=0.9736
Epoch 43: train_loss=0.0003 train_acc=1.0000 val_loss=0.1196 val_acc=0.9736
Epoch 44: train_loss=0.0003 train_acc=1.0000 val_loss=0.1198 val_acc=0.9736
Epoch 45: train_loss=0.0003 train_acc=1.0000 val_loss=0.1201 val_acc=0.9736
Epoch 46: train_loss=0.0003 train_acc=1.0000 val_loss=0.1203 val_acc=0.9736
Epoch 47: train_loss=0.0003 train_acc=1.0000 val_loss=0.1205 val_acc=0.9736
Epoch 48: train_loss=0.0003 train_acc=1.0000 val_loss=0.1208 val_acc=0.9736
Epoch 49: train_loss=0.0002 train_acc=1.0000 val_loss=0.1210 val_acc=0.9736
Epoch 50: train_loss=0.0002 train_acc=1.0000 val_loss=0.1212 val_acc=0.9736
PyTorch validation accuracy: 0.9736, loss: 0.1212
PyTorch test accuracy: 0.9707, loss: 0.1009
</pre></td></tr></tbody></table></code></pre></div></div>

<h4 id="summary">Summary</h4>

<p>Side-by-side results highlight how much leverage a mature framework provides: PyTorch removes most manual bookkeeping around gradient calculation, parameter updates, device placement, and batching. The NumPy baseline remains valuable for building intuition about tensor shapes, gradient flow, and training dynamics, but its results should be checked carefully because small scaling mistakes can change the effective learning rate by orders of magnitude.</p>]]></content><author><name></name></author><category term="Engineering" /><category term="machine learning" /><category term="neural network" /><category term="mlp" /><summary type="html"><![CDATA[Key Concepts Hidden Layers An MLP contains one or more hidden layers between the input and output. Each hidden layer applies a linear transformation followed by a non-linear activation function: \(H_\ell = f_\ell(H_{\ell-1} W_\ell + \mathbf{1} b_\ell^\top)\) where: $H_{\ell-1}$ is the previous layer’s output (with $H_0 = X$). $W_\ell, b_\ell$ are the weight matrix and bias vector for layer $\ell$. $f_\ell(\cdot)$ is a non-linear activation (e.g., ReLU, tanh, sigmoid).]]></summary></entry><entry><title type="html">Understanding Single-Layer Neural Networks</title><link href="https://guozijn.github.io/engineering/2025/09/26/single-layer-neural-network.html" rel="alternate" type="text/html" title="Understanding Single-Layer Neural Networks" /><published>2025-09-26T00:00:00+09:30</published><updated>2025-09-26T00:00:00+09:30</updated><id>https://guozijn.github.io/engineering/2025/09/26/single-layer-neural-network</id><content type="html" xml:base="https://guozijn.github.io/engineering/2025/09/26/single-layer-neural-network.html"><![CDATA[<h2 id="key-concepts">Key Concepts</h2>
<ul>
  <li>
    <p><strong>Neuron (Perceptron)</strong><br />
A neuron is the fundamental unit of the network. Each neuron computes a weighted sum of inputs plus a bias, then applies an activation function (e.g. step, sigmoid).</p>
  </li>
  <li>
    <p><strong>Input Layer</strong><br />
The input layer accepts the raw data features. Each input is multiplied by an associated weight and passed to the neuron. It is often represented by a vector $\mathbf{x} \in \mathbb{R}^n$.</p>
  </li>
  <li><strong>Weights and Bias</strong>
    <ul>
      <li>Weights represent the importance of each input feature. They are often represented by a vector $\mathbf{w} \in \mathbb{R}^n$.</li>
      <li>Bias allows shifting the decision boundary away from the origin.</li>
    </ul>
  </li>
  <li>
    <p><strong>Linear Combination</strong><br />
The neuron computes:</p>

\[z = \mathbf{w}^\top \mathbf{x} + b\]

    <p>Where $\mathbf{w}$ = weight, $\mathbf{x}$ = input, $b$ = bias</p>
  </li>
  <li>
    <p><strong>Activation Function</strong><br />
The activation function introduces non-linearity (in classic perceptron often a step function). It determines the output class or value. Without it, the network is just a linear model.</p>
  </li>
  <li>
    <p><strong>Output Layer</strong><br />
It provides the final prediction. In a single-layer network, there is only one set of weights between the input and output (no hidden layer).</p>
  </li>
  <li>
    <p><strong>Decision Boundary</strong><br />
The hyperplane separating classes in the input space. In a single-layer network, this boundary is always linear.</p>
  </li>
  <li>
    <p><strong>Ground Truth</strong><br />
The true label of the data point, denoted as $y_{\text{true}} \in {0,1}$. It represents the actual class assigned in the dataset.</p>
  </li>
  <li>
    <p><strong>Prediction</strong><br />
The model produces $\hat{y}$, derived from the probability $p$. Typically, $\hat{y} = 1$ if $p \geq \tau$, else $\hat{y} = 0$.</p>
  </li>
  <li>
    <p><strong>Loss Function (Cross-Entropy)</strong><br />
To train the model, the predicted probability $p$ is compared with the ground truth $y_{\text{true}}$ using the binary cross-entropy loss:</p>

\[\mathcal{L}(y_{\text{true}}, p) = - \big[ y_{\text{true}} \cdot \log(p) + (1 - y_{\text{true}}) \cdot \log(1 - p) \big]\]

    <p>This loss penalises large differences between the predicted probability and the actual label. Minimising $\mathcal{L}$ adjusts the weights $w$ and bias $b$ to improve classification performance.</p>
  </li>
  <li>
    <p><strong>Backpropagation</strong><br />
Gradients of the loss are propagated back to update the weights and bias.</p>
  </li>
  <li>
    <p><strong>Gradient Descent</strong><br />
The parameters are updated iteratively to minimise the loss:</p>

\[\mathbf{w} := \mathbf{w} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{w}}, \quad b := b - \eta \frac{\partial \mathcal{L}}{\partial b}\]

    <p>where $\eta &gt; 0$ is the learning rate controlling the step size.</p>
  </li>
</ul>

<h2 id="architecture-of-a-single-layer-neural-network">Architecture of a Single-Layer Neural Network</h2>
<p><img src="https://images.zjguo.com/single_layer_nn_matrix.png" alt="single_layer_nn_matrix.png" /></p>

<p>This diagram illustrates a <strong>single-layer neural network for binary classification</strong>. The input is represented as a feature vector $\mathbf{x} \in \mathbb{R}^n$, and the parameters of the model are a weight vector $\mathbf{w} \in \mathbb{R}^n$ and a bias term $b \in \mathbb{R}$. The linear combination is computed as $z = \mathbf{w}^\top \mathbf{x} + b$. This value is then passed through a sigmoid activation function $\sigma(z) = \frac{1}{1+e^{-z}}$, which outputs a probability $p = \Pr(y=1 \mid \mathbf{x}) = \sigma(z)$, representing the likelihood that the class label $y$ equals 1. Finally, the probability is compared to a threshold $\tau$ (e.g., 0.5) to produce the predicted class label $y \in {0,1}$. The decision boundary of this model is defined by $\mathbf{w}^\top \mathbf{x} + b = 0$.</p>]]></content><author><name></name></author><category term="Engineering" /><category term="machine learning" /><category term="neural network" /><summary type="html"><![CDATA[Key Concepts Neuron (Perceptron) A neuron is the fundamental unit of the network. Each neuron computes a weighted sum of inputs plus a bias, then applies an activation function (e.g. step, sigmoid).]]></summary></entry><entry><title type="html">Apple HomeKit 智能家居低成本入门指南</title><link href="https://guozijn.github.io/journal/2024/08/29/smart-home-integration.html" rel="alternate" type="text/html" title="Apple HomeKit 智能家居低成本入门指南" /><published>2024-08-29T00:00:00+09:30</published><updated>2024-08-29T00:00:00+09:30</updated><id>https://guozijn.github.io/journal/2024/08/29/smart-home-integration</id><content type="html" xml:base="https://guozijn.github.io/journal/2024/08/29/smart-home-integration.html"><![CDATA[<blockquote>
  <p>本文记录了我如何用低成本打造一个 <strong>苹果 HomeKit 智能家居系统</strong> 的全过程，从选购硬件到软件配置，以及最终的使用体验。希望能给同样想入坑的朋友一些参考。</p>
</blockquote>

<h2 id="一起因">一、起因</h2>

<p>卧室吸顶灯坏了，想着先换个 LED 灯芯试试是不是灯芯的问题。于是花了约 ¥40 买了一款支持米家 WiFi 接入的灯芯。<br />
装好后才发现买的是 48W 的灯条，亮度一般，甚至不如台灯。不过既然能亮，就先凑合用。</p>

<p>这让我萌生了一个想法：既然要折腾，不如顺便把整个家里的灯光和电器做一套智能化改造。家里正好有个闲置的 <strong>HomePod mini</strong>，于是决定以 HomeKit 为核心做实验。</p>

<h2 id="二思路与方案">二、思路与方案</h2>

<h3 id="1-跨生态问题">1. 跨生态问题</h3>

<p>苹果 HomeKit 和米家、涂鸦等生态协议不兼容。<br />
解决方案是使用 <strong>Home Assistant（HA）</strong> —— 一款开源智能家居网关软件，能桥接不同协议，把米家、涂鸦等设备“伪装”成 HomeKit 设备，从而让 HomeKit 统一管理。</p>

<h3 id="2-硬件准备">2. 硬件准备</h3>

<ul>
  <li><strong>运行环境</strong>：树莓派太贵，我在二手市场买了一个“智趣盒子”（¥75，2 GB RAM + 16 GB ROM），刷机后运行 HA。</li>
  <li><strong>网络环境</strong>：主路由在客厅，卧室信号差，于是用卧室网口接了一个桥接路由器，并接入盒子。这样手机能在全屋无感切换 WiFi。</li>
  <li><strong>更新与调试</strong>：刷好机后访问 <code class="language-plaintext highlighter-rouge">http://192.168.31.31:8123/</code>，发现 HA 无法自动更新。通过 SSH 登录查看，发现是 Docker 部署，于是手动拉取最新版镜像并重启，问题解决。</li>
</ul>

<p><strong>Home Assistant Docker Services</strong></p>

<p><img src="https://images.zjguo.com/sfbox01.png" alt="Home Assistant Docker services" width="90%" /></p>

<h2 id="三设备接入">三、设备接入</h2>

<h3 id="米家设备">米家设备</h3>

<p>接入流程：</p>

<ol>
  <li>
    <p>确认有 <strong>家庭中枢</strong>（HomePod / iPad），才能实现远程控制。</p>
  </li>
  <li>
    <p>在 HACS 安装插件 <strong>Xiaomi Miot Auto</strong>。</p>

    <p><img src="https://images.zjguo.com/ha03.png" alt="Xiaomi Miot Auto" width="90%" /></p>
  </li>
  <li>
    <p>添加米家账号并导入设备。</p>

    <p><img src="https://images.zjguo.com/ha05.png" alt="Device and Service" width="90%" />
 <img src="https://images.zjguo.com/miiot01.png" alt="Miiot" width="90%" /></p>
  </li>
  <li>
    <p>创建 <strong>HomeKit Bridge</strong> 并桥接至 HomeKit。</p>
    <ul>
      <li>注意：一个桥接中只能有一个空调，所以我把卧室空调单独建桥。</li>
    </ul>

    <p><img src="https://images.zjguo.com/hb01.png" alt="hb01" width="90%" />
 <img src="https://images.zjguo.com/hb02.png" alt="hb02" width="90%" /></p>
  </li>
</ol>

<p>最终在 iOS “家庭”App 中能直接看到米家设备，并通过按钮或 Siri 控制。</p>

<p><img src="https://images.zjguo.com/homeapp.jpeg" alt="homeapp" width="30%" /></p>

<h3 id="涂鸦设备">涂鸦设备</h3>

<p>主要用于红外 / 射频控制的电器，例如电视、晾衣架。</p>

<p>步骤：</p>

<ol>
  <li>
    <p>在 <strong>涂鸦开发者平台</strong> 创建云项目，配置权限。</p>

    <p><img src="https://images.zjguo.com/ty01.png" alt="ty01" width="90%" />
 <img src="https://images.zjguo.com/ty03.png" alt="ty03" width="90%" /></p>
  </li>
  <li>
    <p>在涂鸦或“智能生活”App 中完成设备配对与学习遥控信号。</p>

    <p><img src="https://images.zjguo.com/tyapp01.png" alt="tyapp01" width="30%" />
 <img src="https://images.zjguo.com/tyapp02.png" alt="tyapp02" width="30%" /></p>
  </li>
  <li>
    <p>将密钥填入 HA，设备，在HomeKit Bridge中添加对应的开关，即可出现在 HomeKit 中。</p>

    <p><img src="https://images.zjguo.com/ty04.png" alt="ty04" width="90%" /></p>
  </li>
</ol>

<h3 id="其他设备接入方案">其他设备接入方案</h3>

<ul>
  <li><strong>灯具</strong>：
    <ul>
      <li>灯条控制：能调色温，但断电后失效。</li>
      <li>智能开关：手动/智能两用，更推荐。</li>
    </ul>
  </li>
  <li><strong>空调</strong>：
    <ul>
      <li>IoT 空调可直接接入。</li>
      <li>普通空调需配合 <strong>空调伴侣</strong>（如 Gosund 电小酷）。</li>
    </ul>
  </li>
  <li><strong>电视 / 晾衣架</strong>：
    <ul>
      <li>用支持红外/射频的 <strong>万能遥控器</strong>（涂鸦平台）。</li>
    </ul>
  </li>
  <li><strong>插座</strong>：
    <ul>
      <li>用智能插座即可实现通断电控制。</li>
    </ul>
  </li>
  <li><strong>门铃</strong>：
    <ul>
      <li>高端选 <code class="language-plaintext highlighter-rouge">Aqara G4</code>，性价比选 <strong>小米智能门铃 3</strong>。</li>
    </ul>
  </li>
  <li><strong>窗帘</strong>：
    <ul>
      <li>可选轨道电机或窗帘伴侣（体验一般，我退货了）。</li>
    </ul>
  </li>
</ul>

<h2 id="四成本清单">四、成本清单</h2>

<table>
  <thead>
    <tr>
      <th>项目</th>
      <th>金额 (¥)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>二手 HomePod mini</td>
      <td>599</td>
    </tr>
    <tr>
      <td>智趣盒子</td>
      <td>75</td>
    </tr>
    <tr>
      <td>HomeKit 2 路开关</td>
      <td>45</td>
    </tr>
    <tr>
      <td>涂鸦万能遥控</td>
      <td>152</td>
    </tr>
    <tr>
      <td>72W HomeKit 灯条</td>
      <td>85</td>
    </tr>
    <tr>
      <td>米家空调伴侣</td>
      <td>59</td>
    </tr>
    <tr>
      <td>智能插座</td>
      <td>30</td>
    </tr>
    <tr>
      <td><strong>合计</strong></td>
      <td><strong>1045</strong></td>
    </tr>
  </tbody>
</table>

<h2 id="五未来扩展">五、未来扩展</h2>

<p>目前系统已能满足大部分日常需求。下一步可考虑：</p>

<ul>
  <li>智能门锁</li>
  <li>温湿度传感器</li>
  <li>扫地机器人、空气传感器</li>
  <li>自动化场景（如“回家模式”、温湿度触发）</li>
</ul>

<p>借助 HA 的插件与自动化功能，苹果生态下的家庭控制已经足够灵活和强大。</p>

<h2 id="总结">总结</h2>

<p>通过 <strong>¥1045</strong> 的低成本改造，成功将米家、涂鸦等设备整合进苹果 HomeKit，实现了统一控制。整个过程的关键在于 <strong>Home Assistant 的桥接作用</strong>。</p>

<p>如果你也想入坑，可以先从一两个设备试起，逐步扩展。<br />
欢迎在评论区交流你的智能家居玩法。 🚀</p>]]></content><author><name></name></author><category term="Journal" /><category term="homekit" /><category term="home assistant" /><category term="smart home" /><summary type="html"><![CDATA[本文记录了我如何用低成本打造一个 苹果 HomeKit 智能家居系统 的全过程，从选购硬件到软件配置，以及最终的使用体验。希望能给同样想入坑的朋友一些参考。]]></summary></entry><entry><title type="html">Ansible Csv Vars Plugin</title><link href="https://guozijn.github.io/engineering/2020/05/07/ansible-csv-vars-plugin.html" rel="alternate" type="text/html" title="Ansible Csv Vars Plugin" /><published>2020-05-07T00:00:00+09:30</published><updated>2020-05-07T00:00:00+09:30</updated><id>https://guozijn.github.io/engineering/2020/05/07/ansible-csv-vars-plugin</id><content type="html" xml:base="https://guozijn.github.io/engineering/2020/05/07/ansible-csv-vars-plugin.html"><![CDATA[<h2 id="introduction">Introduction</h2>
<p>When rolling out network configs I kept address assignments in spreadsheets because they are easy to edit. Translating those tables into Ansible host vars manually grew painful, so this plugin lets each CSV row feed host-specific variables automatically. During template rendering every device pulls its unique addresses from the sheet, keeping the workflow spreadsheet-friendly while delivering correct per-host values in playbooks.</p>

<h2 id="requirements">Requirements</h2>

<p>Parse data from CSV into host variables.</p>

<p>CSV file example:</p>

<pre><code class="language-csv">hostname,gateway,vlan30,vlan40,vlan50,vlan60,vlan70
localhost,192.168.30.254,192.168.30.31,192.168.40.31,192.168.50.31,192.168.60.31,192.168.70.31
</code></pre>

<ul>
  <li>Place the CSV file in the <code class="language-plaintext highlighter-rouge">csv_vars</code> directory under <code class="language-plaintext highlighter-rouge">inventory</code> or <code class="language-plaintext highlighter-rouge">playbook</code>, and it will be automatically parsed.</li>
  <li>The <code class="language-plaintext highlighter-rouge">hostname</code> field in the CSV must match the host name in the inventory.</li>
  <li>It is recommended to name the CSV as GROUP_OR_HOST_NAME.csv. Multiple CSV files are supported, and variable overriding is also supported.</li>
</ul>

<h2 id="write-the-plugin">Write the Plugin</h2>

<p>File path: <code class="language-plaintext highlighter-rouge">vars_plugins/csv_vars.py</code></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre><span class="p">...</span><span class="n">existing</span> <span class="n">code</span><span class="p">...</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<blockquote>
  <p>Code from https://github.com/guozijn/csv_vars</p>
</blockquote>

<h2 id="csv-file">CSV File</h2>

<p>Place the CSV file in the <code class="language-plaintext highlighter-rouge">csv_vars</code> directory under inventory or playbook.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre>mdkir csv_vars
<span class="nb">cat</span> <span class="o">&lt;&lt;</span> <span class="no">EOF</span><span class="sh"> &gt;&gt; csv_vars/nodes.csv
hostname,gateway,vlan30,vlan40,vlan50,vlan60,vlan70
192.168.77.130,192.168.30.254,192.168.30.31,192.168.40.31,192.168.50.31,192.168.60.31,192.168.70.31
</span><span class="no">EOF
</span></pre></td></tr></tbody></table></code></pre></div></div>

<h2 id="run-playbook">Run Playbook</h2>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="rouge-code"><pre><span class="c1"># cat test_vars.yml</span>

<span class="pi">-</span> <span class="na">hosts</span><span class="pi">:</span> <span class="s">192.168.77.130</span>
  <span class="na">gather_facts</span><span class="pi">:</span> <span class="s">no</span>
  <span class="na">tasks</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">debug</span><span class="pi">:</span>
        <span class="na">msg</span><span class="pi">:</span> <span class="s2">"</span><span class="s">{{</span><span class="nv"> </span><span class="s">lookup('vars',</span><span class="nv"> </span><span class="s">item)</span><span class="nv"> </span><span class="s">}}"</span>
      <span class="na">loop</span><span class="pi">:</span> <span class="s2">"</span><span class="s">{{</span><span class="nv"> </span><span class="s">hostvars[inventory_hostname].keys()</span><span class="nv"> </span><span class="s">|</span><span class="nv"> </span><span class="s">select('match',</span><span class="nv"> </span><span class="s">'^vlan.*$|gateway')</span><span class="nv"> </span><span class="s">|</span><span class="nv"> </span><span class="s">list</span><span class="nv"> </span><span class="s">}}"</span>

</pre></td></tr></tbody></table></code></pre></div></div>

<p>Execution result:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
</pre></td><td class="rouge-code"><pre><span class="c"># ansible-playbook test_vars.yml</span>

PLAY <span class="o">[</span>192.168.77.130] <span class="k">************************************************************************************************</span>

TASK <span class="o">[</span>debug] <span class="k">*********************************************************************************************************</span>
ok: <span class="o">[</span>192.168.77.130] <span class="o">=&gt;</span> <span class="o">(</span><span class="nv">item</span><span class="o">=</span>gateway<span class="o">)</span> <span class="o">=&gt;</span> <span class="o">{</span>
    <span class="s2">"msg"</span>: <span class="s2">"192.168.30.254"</span>
<span class="o">}</span>
ok: <span class="o">[</span>192.168.77.130] <span class="o">=&gt;</span> <span class="o">(</span><span class="nv">item</span><span class="o">=</span>vlan60<span class="o">)</span> <span class="o">=&gt;</span> <span class="o">{</span>
    <span class="s2">"msg"</span>: <span class="s2">"192.168.60.31"</span>
<span class="o">}</span>
ok: <span class="o">[</span>192.168.77.130] <span class="o">=&gt;</span> <span class="o">(</span><span class="nv">item</span><span class="o">=</span>vlan30<span class="o">)</span> <span class="o">=&gt;</span> <span class="o">{</span>
    <span class="s2">"msg"</span>: <span class="s2">"192.168.30.31"</span>
<span class="o">}</span>
ok: <span class="o">[</span>192.168.77.130] <span class="o">=&gt;</span> <span class="o">(</span><span class="nv">item</span><span class="o">=</span>vlan40<span class="o">)</span> <span class="o">=&gt;</span> <span class="o">{</span>
    <span class="s2">"msg"</span>: <span class="s2">"192.168.40.31"</span>
<span class="o">}</span>
ok: <span class="o">[</span>192.168.77.130] <span class="o">=&gt;</span> <span class="o">(</span><span class="nv">item</span><span class="o">=</span>vlan70<span class="o">)</span> <span class="o">=&gt;</span> <span class="o">{</span>
    <span class="s2">"msg"</span>: <span class="s2">"192.168.70.31"</span>
<span class="o">}</span>
ok: <span class="o">[</span>192.168.77.130] <span class="o">=&gt;</span> <span class="o">(</span><span class="nv">item</span><span class="o">=</span>vlan50<span class="o">)</span> <span class="o">=&gt;</span> <span class="o">{</span>
    <span class="s2">"msg"</span>: <span class="s2">"192.168.50.31"</span>
<span class="o">}</span>

PLAY RECAP <span class="k">***********************************************************************************************************</span>
192.168.77.130             : <span class="nv">ok</span><span class="o">=</span>1    <span class="nv">changed</span><span class="o">=</span>0    <span class="nv">unreachable</span><span class="o">=</span>0    <span class="nv">failed</span><span class="o">=</span>0    <span class="nv">skipped</span><span class="o">=</span>0    <span class="nv">rescued</span><span class="o">=</span>0    <span class="nv">ignored</span><span class="o">=</span>0   

</pre></td></tr></tbody></table></code></pre></div></div>

<h2 id="run-ad-hoc">Run ad-hoc</h2>

<p>Configure <code class="language-plaintext highlighter-rouge">ansible.cfg</code> to set the custom plugin directory:</p>

<div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre><span class="nn">[defaults]</span>
<span class="py">vars_plugins</span>       <span class="p">=</span> <span class="s">/etc/ansible/vars_plugins</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="rouge-code"><pre><span class="c"># ansible 192.168.77.130 -m debug -a 'var=gateway'</span>
192.168.77.130 | SUCCESS <span class="o">=&gt;</span> <span class="o">{</span>
    <span class="s2">"gateway"</span>: <span class="s2">"192.168.30.254"</span>
<span class="o">}</span>

<span class="c"># ansible 192.168.77.130 -m debug -a 'var=vlan30'</span>
192.168.77.130 | SUCCESS <span class="o">=&gt;</span> <span class="o">{</span>
    <span class="s2">"vlan30"</span>: <span class="s2">"192.168.30.31"</span>
<span class="o">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>]]></content><author><name></name></author><category term="Engineering" /><category term="ansible" /><category term="devops" /><summary type="html"><![CDATA[Introduction When rolling out network configs I kept address assignments in spreadsheets because they are easy to edit. Translating those tables into Ansible host vars manually grew painful, so this plugin lets each CSV row feed host-specific variables automatically. During template rendering every device pulls its unique addresses from the sheet, keeping the workflow spreadsheet-friendly while delivering correct per-host values in playbooks.]]></summary></entry></feed>