fix(pdf): cap Mermaid figure sizes and fix nested code fences

Mermaid diagrams were oversized in PDF output. Reduced viewport width from 800→600 and added LaTeX preamble to cap mermaid figures at 0.75\linewidth. Also fixed 7 admonition blocks across 5 ABOUT.md files where nested triple-backtick code fences broke the MyST parser, causing raw markdown to render in PDF output.
2026-04-29 17:20:21 -05:00 · 2026-02-17 15:15:56 -05:00
parent c13c4c6b9c
commit 672eee9618
6 changed files with 41 additions and 16 deletions
--- a/tinytorch/site/_config_pdf.yml
+++ b/tinytorch/site/_config_pdf.yml
@@ -57,8 +57,9 @@ sphinx:
    # --pdfFit scales PDF to fit the diagram (not full page)
    # --scale 1.0 keeps diagrams at natural size (1.5 was too large for tall diagrams)
    mermaid_output_format: "pdf"
-    # Width 800 constrains diagram width; scale must be integer (1 = natural size)
-    mermaid_params: ['--pdfFit', '--scale', '1', '--width', '800', '--backgroundColor', 'white']
+    # Width 600 constrains diagram viewport; scale 1 = natural size
+    # Smaller viewport + pdfcrop produces tighter diagrams that don't stretch to full page width
+    mermaid_params: ['--pdfFit', '--scale', '1', '--width', '600', '--backgroundColor', 'white']
    # Use pdfcrop to trim whitespace from mermaid PDFs
    mermaid_pdfcrop: "pdfcrop"
    # Use professional sans-serif font for mermaid diagrams to match document
@@ -91,6 +92,9 @@ sphinx:
      papersize: 'letterpaper'
      pointsize: '10pt'
      figure_align: 'H'
+      # Pass 'export' option to adjustbox before Sphinx loads it (avoids option clash).
+      # This enables max width/height keys in \includegraphics for mermaid figure capping.
+      passoptionstopackages: '\PassOptionsToPackage{export}{adjustbox}'
      fontpkg: |
        % Professional academic font stack (TeX Gyre - available in TeX Live)
        \usepackage{fontspec}
@@ -111,6 +115,27 @@ sphinx:
        \usepackage{hyperref}
        \usepackage{float}

+        % Cap Mermaid diagram width at 75% of text width.
+        % sphinxcontrib-mermaid hardcodes width=\linewidth for all diagrams,
+        % which stretches small flowcharts to full page width. This override
+        % intercepts \includegraphics and uses adjustbox's max width for
+        % mermaid-*.pdf files while passing other images through unchanged.
+        % Note: adjustbox 'export' option passed via passoptionstopackages above.
+        \let\OrigIncludeGraphics\includegraphics
+        \makeatletter
+        \renewcommand{\includegraphics}[2][]{%
+          \begingroup
+          \def\@mermaidtest{mermaid-}%
+          \@expandtwoargs\in@{\@mermaidtest}{#2}%
+          \ifin@
+            \OrigIncludeGraphics[max width=0.75\linewidth,max height=0.45\textheight,keepaspectratio]{#2}%
+          \else
+            \OrigIncludeGraphics[#1]{#2}%
+          \fi
+          \endgroup
+        }
+        \makeatother
+
        % Better figure placement - keep figures inline with text
        \renewcommand{\topfraction}{0.9}
        \renewcommand{\bottomfraction}{0.9}
--- a/tinytorch/src/04_losses/ABOUT.md
+++ b/tinytorch/src/04_losses/ABOUT.md
@@ -732,7 +732,7 @@ Your training profile shows: Forward pass 80ms, Loss computation 120ms, Backward

 Your model outputs logits `[50, 100, 150]`. Without the log-sum-exp trick, what happens when you compute softmax? With the trick, what values are actually computed?

-```{admonition} Answer
+````{admonition} Answer
 :class: dropdown

 **Without the trick (naive softmax):**
@@ -756,13 +756,13 @@ log_softmax = shifted - log_sum_exp = [-100, -50, 0]
 **Result**: Valid log-probabilities, stable training.

 **Key insight**: Subtracting max makes largest value 0, so `exp(0) = 1.0` is always safe. Smaller values underflow to 0, but that's fine - they contribute negligibly anyway. This is why **you must use log-sum-exp for any softmax computation**.
-```
+````

 **Q4: Loss Function Selection - Classification Problem**

 You're building a medical diagnosis system with 5 disease categories. Should you use BinaryCrossEntropyLoss or CrossEntropyLoss? What if the categories aren't mutually exclusive (patient can have multiple diseases)?

-```{admonition} Answer
+````{admonition} Answer
 :class: dropdown

 **Case 1: Mutually exclusive diseases** (patient has exactly one)
@@ -789,13 +789,13 @@ loss = BinaryCrossEntropyLoss()(probs, targets)
 ```

 **Critical medical consideration**: Multi-label is more realistic - patients often have comorbidities!
-```
+````

 **Q5: Batch Size Impact - Memory and Gradients**

 You train with batch size 32, using 4GB GPU memory. You want to increase to batch size 128. Will memory usage be 16GB? What happens to the loss value and gradient quality?

-```{admonition} Answer
+````{admonition} Answer
 :class: dropdown

 **Memory usage**: Yes, approximately **16GB** (4× increase)
@@ -828,7 +828,7 @@ optimizer.step()  # Update once with accumulated gradients (4×32 = 128 effectiv
 ```

 This gives you the gradient quality of batch 128 with only the memory cost of batch 32!
-```
+````

 ## Further Reading

--- a/tinytorch/src/08_training/ABOUT.md
+++ b/tinytorch/src/08_training/ABOUT.md
@@ -881,7 +881,7 @@ Starting high (0.1) provides fast early progress. Gradual decay (0.1 → 0.01) a

 You're training for 100 epochs. Each checkpoint is 1 GB. Checkpointing every epoch creates 100 GB of storage. Checkpointing every 10 epochs risks losing 10 epochs of work if training crashes. Design a checkpointing strategy that balances fault tolerance and storage costs.

-```{admonition} Answer
+````{admonition} Answer
 :class: dropdown

 **Strategy: Keep last N + best + milestones**
@@ -908,7 +908,7 @@ if is_best_validation:  # Best
 ```

 **Production systems** use this strategy plus cloud storage for off-site backup.
-```
+````

 **Q5: Global Norm Clipping Analysis**

--- a/tinytorch/src/11_embeddings/ABOUT.md
+++ b/tinytorch/src/11_embeddings/ABOUT.md
@@ -871,7 +871,7 @@ This is why you can't arbitrarily increase embedding dimensions. Each doubling d

 You trained a model with sinusoidal positional encoding and `max_seq_len=512`. Can you process sequences of length 1024 at inference time? What about with learned positional encoding?

-```{admonition} Answer
+````{admonition} Answer
 :class: dropdown

 **Sinusoidal PE: Yes** - can extrapolate to length 1024 (or any length)
@@ -889,7 +889,7 @@ Learned PE creates a fixed embedding table of shape `(max_seq_len, embed_dim)`.
 - Truncate sequences to 512 tokens

 This is why many production models use sinusoidal or relative positional encodings that can handle variable lengths.
-```
+````

 ## Further Reading

--- a/tinytorch/src/12_attention/ABOUT.md
+++ b/tinytorch/src/12_attention/ABOUT.md
@@ -815,7 +815,7 @@ Why use 8 heads of 64 dimensions instead of 1 head of 512 dimensions? Parameters

 Causal masking zeros out the upper triangle (roughly half the attention matrix). Do we save computation, or just ensure correctness?

-```{admonition} Answer
+````{admonition} Answer
 :class: dropdown

 **In your implementation: NO computation saved**
@@ -842,7 +842,7 @@ scores = scores + adder_mask_tensor  # Masking happens after
 - Sparse attention (BigBird, Longformer): Actually skips computation for sparse patterns

 **Memory could be saved:** Store only lower triangle (n²/2 elements), but requires custom indexing
-```
+````

 **Q5: Gradient Memory**

--- a/tinytorch/src/20_capstone/ABOUT.md
+++ b/tinytorch/src/20_capstone/ABOUT.md
@@ -860,7 +860,7 @@ Speedup = baseline_latency / optimized_latency = 20ms / 5ms = **4.0x**

 Why does the submission schema require `accuracy` as float in [0, 1] instead of allowing any format?

-```{admonition} Answer
+````{admonition} Answer
 :class: dropdown

 **Type safety enables automation.**
@@ -889,7 +889,7 @@ With schema:
 4. **APIs** - Other tools can consume submissions without custom parsers

 **Real example:** Papers with Code leaderboards require strict schemas. Thousands of submissions from different teams aggregate automatically because everyone follows the same format.
-```
+````

 ## Further Reading