feat(pdf): pipe table cell line breaks with <br> and makecell

- Add Lua filter to convert <br> in table cells to \makecell for PDF
- Use pipe table with • and <br> in hw_acceleration hardware evolution table
- Add makecell package; set arraystretch to 1.6; top-align makecell cells
- Register filter in PDF config
This commit is contained in:
Vijay Janapa Reddi
2026-02-21 15:50:37 -05:00
parent eaa545f115
commit 99661315eb
4 changed files with 144 additions and 34 deletions

View File

@@ -2,5 +2,6 @@ filters:
- filters/sidenote.lua
- filters/inject_parts.lua
- filters/dropcap.lua
- filters/table-cell-linebreaks.lua
- pandoc-ext/diagram
- mlsysbook-ext/custom-numbered-blocks

View File

@@ -98,11 +98,9 @@ $$ Speedup = \frac{1}{(1 - p) + \frac{p}{S}} $$ {#eq-amdahl}
\index{Acceleration Wall!diminishing returns}
Amdahl's Law is not merely theoretical: it explains *why* many GPU upgrades disappoint in practice. The following heatmap (@fig-iron-law-heatmap) visualizes the *Acceleration Wall*—the diminishing returns from faster hardware when serial bottlenecks persist—showing that unless your workload is highly parallelizable ($p > 0.99$), investing in faster hardware yields diminishing returns. The contour values are illustrative ranges for intuition.
::: {#fig-iron-law-heatmap fig-env="figure" fig-pos="htb" fig-cap="**The Iron Law Heatmap**: Total system speedup as a function of Accelerator Speed ($S$) and Parallel Fraction ($p$). The 'Acceleration Wall' at the top reveals that if a workload is even slightly serial ($p < 0.9$), increasing hardware speed yields almost no benefit. Contours span roughly 1×500× speedup." fig-alt="Heatmap of Speedup vs Accelerator Speed and Parallel Fraction. High speedup (green/yellow) is only achieved in the bottom right corner where Parallel Fraction is near 1.0. The rest of the map is dominated by blue (low speedup), showing the serial bottleneck."}
```{python}
#| label: fig-iron-law-heatmap
#| echo: false
#| fig-cap: "**The Iron Law Heatmap**: Total system speedup as a function of Accelerator Speed ($S$) and Parallel Fraction ($p$). The 'Acceleration Wall' at the top reveals that if a workload is even slightly serial ($p < 0.9$), increasing hardware speed yields almost no benefit. Contours span roughly 1×500× speedup."
#| fig-alt: "Heatmap of Speedup vs Accelerator Speed and Parallel Fraction. High speedup (green/yellow) is only achieved in the bottom right corner where Parallel Fraction is near 1.0. The rest of the map is dominated by blue (low speedup), showing the serial bottleneck."
# ┌─────────────────────────────────────────────────────────────────────────────
# │ IRON LAW HEATMAP (FIGURE)
# ├─────────────────────────────────────────────────────────────────────────────
@@ -149,6 +147,7 @@ ax.text(100, 0.98, "Compute Bound", color='black', ha='center', va='top', fontwe
ax.text(100, 0.82, "Serial Bound", color='white', ha='center', va='bottom', fontweight='bold', fontsize=9, bbox=dict(facecolor='black', alpha=0.6, edgecolor='none', pad=0.5))
plt.show()
```
:::
Before examining specific hardware architectures, test your intuition about these physical limits.
@@ -412,11 +411,9 @@ The scale of this challenge becomes stark in @fig-systems-gap, which plots the *
The plot is normalized to a 2012 baseline to emphasize relative growth. Notice how the purple-shaded region between the curves keeps widening — this gap cannot be closed by waiting for faster chips; it requires architectural innovation.
::: {#fig-systems-gap fig-env="figure" fig-pos="htb" fig-cap="**The Systems Gap**: Relative compute growth (log scale) comparing model demand to hardware supply, normalized to 2012 = 1.0. The gray dotted line (CPU) and blue dashed line (GPU) reflect hardware progress, which lags the exponential red solid line (Model Demand). The purple region is the 'Systems Gap' that must be bridged through parallelism and co-design." fig-alt="Log-scale line chart from 2012 to 2024. Red line (Model Demand) rises steeply. Blue line (GPU Supply) rises moderately. Gray line (CPU Trend) rises slowly. A large purple shaded area between Red and Blue is labeled 'THE SYSTEMS GAP'."}
```{python}
#| label: fig-systems-gap
#| echo: false
#| fig-cap: "**The Systems Gap**: Relative compute growth (log scale) comparing model demand to hardware supply, normalized to 2012 = 1.0. The gray dotted line (CPU) and blue dashed line (GPU) reflect hardware progress, which lags the exponential red solid line (Model Demand). The purple region is the 'Systems Gap' that must be bridged through parallelism and co-design."
#| fig-alt: "Log-scale line chart from 2012 to 2024. Red line (Model Demand) rises steeply. Blue line (GPU Supply) rises moderately. Gray line (CPU Trend) rises slowly. A large purple shaded area between Red and Blue is labeled 'THE SYSTEMS GAP'."
# ┌─────────────────────────────────────────────────────────────────────────────
# │ SYSTEMS GAP (FIGURE)
# ├─────────────────────────────────────────────────────────────────────────────
@@ -478,6 +475,7 @@ for y, v, l in [(2012, 1.0, "AlexNet"), (2017, 10**(demand_slope*5), "Transforme
ax.legend(loc='lower right', fontsize=8)
plt.show()
```
:::
[^fn-dsa]: **Domain-Specific Architectures (DSA)**: Computing architectures optimized for specific application domains rather than general-purpose computation. Unlike CPUs designed for flexibility, DSAs sacrifice programmability for dramatic efficiency gains. Google's TPU achieves 1530 $\times$ better performance per watt than GPUs for neural networks, while video codecs provide 1001000 $\times$ improvements over software decoding. The 2018 Turing Award recognized this shift as the defining trend in modern computer architecture.
@@ -494,10 +492,8 @@ To understand the gravity of this transition, we must view it through the lens o
Look at the two overlapping curves in @fig-tech-s-curve: general-purpose computing has entered its saturation phase, and the industry is now riding the steep take-off of a new S-curve driven by domain-specific architectures.
::: {#fig-tech-s-curve fig-env="figure" fig-pos="htb" fig-cap="**The Twin S-Curves of Modern Computing**. General-purpose CPUs (gray) enjoyed decades of exponential growth driven by Moore's Law and Dennard Scaling. As physics constrained this curve around 2010 (Saturation), the industry was forced to jump to a new curve: Domain Specific Architectures (blue). We are currently in the **Take-off** phase of this new paradigm, where massive efficiency gains come from specializing hardware for linear algebra, albeit at the cost of general programmability." fig-alt="Two overlapping S-curves plotting performance over time. Gray curve shows general-purpose CPUs reaching saturation around 2010. Blue curve shows domain-specific architectures in take-off phase starting 2015."}
```{python}
#| label: fig-tech-s-curve
#| fig-cap: "**The Twin S-Curves of Modern Computing**. General-purpose CPUs (gray) enjoyed decades of exponential growth driven by Moore's Law and Dennard Scaling. As physics constrained this curve around 2010 (Saturation), the industry was forced to jump to a new curve: Domain Specific Architectures (blue). We are currently in the **Take-off** phase of this new paradigm, where massive efficiency gains come from specializing hardware for linear algebra, albeit at the cost of general programmability."
#| fig-alt: "Two overlapping S-curves plotting performance over time. Gray curve shows general-purpose CPUs reaching saturation around 2010. Blue curve shows domain-specific architectures in take-off phase starting 2015."
#| echo: false
#| warning: false
# ┌─────────────────────────────────────────────────────────────────────────────
@@ -560,6 +556,7 @@ ax.set_ylabel('Performance / Efficiency (Log Scale)')
ax.legend(loc='upper left', fontsize=10)
plt.show()
```
:::
\index{Moore's Law!slowdown impact}
The "easy" gains from shrinking transistors are gone. To sustain the exponential growth required by AI models (which are growing 410 $\times$ faster than Moore's Law), we cannot simply wait for the next CPU generation. We must shift to a new curve, one defined not by clock speed but by *architecture*. To understand how we reached this inflection point, we must first examine the mechanics of the scaling laws that once fueled the general-purpose era.
@@ -684,11 +681,11 @@ This historical progression reveals a key pattern: each wave of hardware special
| **Era** | **Computational Pattern** | **Architecture Examples** | **Characteristics** |
|:----------|:-----------------------------------|:--------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------|
| **1980s** | Floating-Point & Signal Processing | FPU, DSP | <ul><li>Single-purpose engines</li> <li>Focused instruction sets</li> <li>Coprocessor interfaces</li></ul> |
| **1990s** | 3D Graphics & Multimedia | GPU, SIMD Units | <ul><li>Many identical compute units</li> <li>Regular data patterns</li> <li>Wide memory interfaces</li></ul> |
| **2000s** | Real-time Media Coding | Media Codecs, Network Processors | <ul><li>Fixed-function pipelines</li> <li>High throughput processing</li> <li>Power-performance optimization</li></ul> |
| **2010s** | Deep Learning Tensor Operations | TPU, GPU Tensor Cores | <ul><li>Matrix multiplication units</li> <li>Massive parallelism</li> <li>Memory bandwidth optimization</li></ul> |
| **2020s** | Application-Specific Acceleration | ML Engines, Smart NICs, Domain Accelerators | <ul><li>Workload-specific datapaths</li> <li>Customized memory hierarchies</li> <li>Application-optimized designs</li></ul> |
| **1980s** | Floating-Point & Signal Processing | FPU, DSP | Single-purpose engines<br>• Focused instruction sets<br>• Coprocessor interfaces |
| **1990s** | 3D Graphics & Multimedia | GPU, SIMD Units | Many identical compute units<br>• Regular data patterns<br>• Wide memory interfaces |
| **2000s** | Real-time Media Coding | Media Codecs, Network Processors | Fixed-function pipelines<br>• High throughput processing<br>• Power-performance optimization |
| **2010s** | Deep Learning Tensor Operations | TPU, GPU Tensor Cores | Matrix multiplication units<br>• Massive parallelism<br>• Memory bandwidth optimization |
| **2020s** | Application-Specific Acceleration | ML Engines, Smart NICs, Domain Accelerators | Workload-specific datapaths<br>• Customized memory hierarchies<br>• Application-optimized designs |
: **Hardware Specialization Trends.** Successive computing eras progressively integrate specialized hardware to accelerate prevalent workloads, moving from general-purpose CPUs to domain-specific architectures and ultimately to customizable AI accelerators. Tailoring hardware to computational patterns improves performance and energy efficiency, driving innovation in machine learning systems. {#tbl-hw-evolution}
@@ -963,16 +960,16 @@ Vector processing units solve this by operating on multiple data elements simult
::: {#lst-riscv_vector_mac lst-cap="**Vectorized Multiply-Accumulate Loop**: This loop showcases how RISC-V vector instructions enable efficient batch processing by performing 8 multiply-add operations simultaneously, reducing computational latency in neural network training. [@riscv_manual]"}
```{.c}
vsetvli t0, a0, e32 # <1>
vsetvli t0, a0, e32
loop_batch:
loop_neuron:
vxor.vv v0, v0, v0 # <2>
vxor.vv v0, v0, v0
loop_feature:
vle32.v v1, (in_ptr) # <3>
vle32.v v2, (wt_ptr) # <3>
vfmacc.vv v0, v1, v2 # <4>
add in_ptr, in_ptr, 32 # <5>
add wt_ptr, wt_ptr, 32 # <5>
vle32.v v1, (in_ptr)
vle32.v v2, (wt_ptr)
vfmacc.vv v0, v1, v2
add in_ptr, in_ptr, 32
add wt_ptr, wt_ptr, 32
bnez feature_cnt, loop_feature
```
@@ -2249,11 +2246,9 @@ For a breakdown of the specific latency costs (e.g., L1 Cache vs. HBM vs. Networ
[^fn-von-neumann]: **John von Neumann**: Hungarian-American mathematician at Princeton's Institute for Advanced Study who described the stored-program computer architecture in his 1945 "First Draft of a Report on the EDVAC." The Von Neumann architecture---where instructions and data share the same memory and bus---enabled programmable computing but created an inherent throughput constraint. John Backus named this the "Von Neumann Bottleneck" in his 1977 Turing Award lecture; it is precisely the constraint that modern AI accelerators fight against with on-chip SRAM, systolic arrays, and near-memory computing. To grasp the severity, study the bar chart in @fig-energy-hierarchy — the "Horowitz Numbers"\index{Horowitz Numbers!energy hierarchy}\index{Energy Hierarchy!silicon costs} lay bare the immutable energy constants of silicon, and the gap between a simple arithmetic operation and a DRAM fetch is staggering.
::: {#fig-energy-hierarchy fig-env="figure" fig-pos="htb" fig-cap="**The Energy Hierarchy**: Energy cost per operation (Log Scale) based on the 'Horowitz Numbers.' Fetching data from off-chip DRAM costs ~128× more energy than an SRAM access and ~20,000× more than an INT8 addition. This stark physical disparity dictates that AI accelerators must prioritize data locality (keeping weights in SRAM/Registers) over raw arithmetic throughput to remain within power budgets." fig-alt="Horizontal bar chart of Energy (pJ) per operation on log scale. INT8 Add is tiny (0.03). DRAM Read is huge (640). An arrow highlights the massive gap between computation and memory access."}
```{python}
#| label: fig-energy-hierarchy
#| echo: false
#| fig-cap: "**The Energy Hierarchy**: Energy cost per operation (Log Scale) based on the 'Horowitz Numbers.' Fetching data from off-chip DRAM costs ~128× more energy than an SRAM access and ~20,000× more than an INT8 addition. This stark physical disparity dictates that AI accelerators must prioritize data locality (keeping weights in SRAM/Registers) over raw arithmetic throughput to remain within power budgets."
#| fig-alt: "Horizontal bar chart of Energy (pJ) per operation on log scale. INT8 Add is tiny (0.03). DRAM Read is huge (640). An arrow highlights the massive gap between computation and memory access."
# ┌─────────────────────────────────────────────────────────────────────────────
# │ ENERGY HIERARCHY (FIGURE)
# ├─────────────────────────────────────────────────────────────────────────────
@@ -2305,6 +2300,7 @@ ax.annotate("", xy=(640, 3), xytext=(12, 3),
ax.text(80, 3.3, "~128× Cost\n(The Memory Wall)", color=COLORS['RedLine'], ha='center', fontsize=9, bbox=dict(facecolor='white', alpha=0.8, edgecolor='none', pad=0.5))
plt.show()
```
:::
#### Quantifying the Compute-Memory Performance Gap {#sec-hardware-acceleration-quantifying-computememory-performance-gap-1526}
@@ -2317,11 +2313,9 @@ The memory wall manifests through three critical constraints.[^fn-bw-analogy] Fi
The divergence between these two scaling rates is quantified in @fig-compute-memory-imbalance: watch how the gap between the compute curve and the bandwidth curve widens year over year, confirming that memory bandwidth — not compute — is the primary constraint in AI acceleration. The values are illustrative to emphasize the divergence trend.
::: {#fig-compute-memory-imbalance fig-env="figure" fig-pos="htb" fig-cap="**The Compute-Bandwidth Divergence**: Compute throughput (FLOPs) and memory bandwidth (GB/s) plotted on a log scale (20002025). While arithmetic throughput has grown exponentially, bandwidth has improved more slowly. Values are illustrative to show the widening AI Memory Wall." fig-alt="Line graph comparing compute performance and memory bandwidth from 2000 to 2025 on log scale. Compute grows exponentially; bandwidth grows linearly. Shaded gap labeled Memory Wall widens over time."}
```{python}
#| label: fig-compute-memory-imbalance
#| echo: false
#| fig-cap: "**The Compute-Bandwidth Divergence**: Compute throughput (FLOPs) and memory bandwidth (GB/s) plotted on a log scale (20002025). While arithmetic throughput has grown exponentially, bandwidth has improved more slowly. Values are illustrative to show the widening AI Memory Wall."
#| fig-alt: "Line graph comparing compute performance and memory bandwidth from 2000 to 2025 on log scale. Compute grows exponentially; bandwidth grows linearly. Shaded gap labeled Memory Wall widens over time."
# ┌─────────────────────────────────────────────────────────────────────────────
# │ COMPUTE-MEMORY IMBALANCE (FIGURE)
# ├─────────────────────────────────────────────────────────────────────────────
@@ -2363,14 +2357,13 @@ ax.set_ylabel('Performance (FLOPs or GB/s, log scale)')
ax.legend(loc='upper left', frameon=True, edgecolor=COLORS['grid'])
plt.show()
```
:::
This imbalance has a direct architectural consequence visible in @fig-rising-ridge: the hardware's "Ridge Point" — the arithmetic intensity required to fully saturate the chip — has skyrocketed, pushing sparse and low-reuse operations further into the memory-bound regime with each new accelerator generation.
::: {#fig-rising-ridge fig-env="figure" fig-pos="htb" fig-cap="**The Rising Ridge**: Hardware arithmetic intensity (FLOP/byte) over time. As compute capability grows faster than memory bandwidth, the 'Ridge Point' (the intensity required to saturate the chip) skyrockets. This trend explains why architectures with high data reuse flourish while low-reuse workloads face a growing hardware tax." fig-alt="Line plot showing the Arithmetic Intensity Ridge Point growing from ~140 in 2017 (V100) to over 500 in 2024 (B200). Shaded regions indicate 'Memory-Rich' and 'Compute-Dense' zones."}
```{python}
#| label: fig-rising-ridge
#| echo: false
#| fig-cap: "**The Rising Ridge**: Hardware arithmetic intensity (FLOP/byte) over time. As compute capability grows faster than memory bandwidth, the 'Ridge Point' (the intensity required to saturate the chip) skyrockets. This trend explains why architectures with high data reuse flourish while low-reuse workloads face a growing hardware tax."
#| fig-alt: "Line plot showing the Arithmetic Intensity Ridge Point growing from ~140 in 2017 (V100) to over 500 in 2024 (B200). Shaded regions indicate 'Memory-Rich' and 'Compute-Dense' zones."
# ┌─────────────────────────────────────────────────────────────────────────────
# │ RISING RIDGE (FIGURE)
# ├─────────────────────────────────────────────────────────────────────────────
@@ -2417,6 +2410,7 @@ ax.set_ylim(0, 650)
ax.set_xticks(years)
plt.show()
```
:::
Beyond performance limitations, memory access imposes a steep energy cost. Fetching data from off-chip DRAM consumes far more energy than performing arithmetic operations [@horowitz2014computing]. This inefficiency is particularly evident in machine learning models, where large parameter sizes, frequent memory accesses, and non-uniform data movement patterns exacerbate memory bottlenecks. The energy differential drives architectural decisions: Google's TPU achieves 3083 $\times$ better energy efficiency than contemporary GPUs by minimizing data movement through systolic arrays and large on-chip memory. These design choices demonstrate that energy constraints, not computational limits, often determine practical deployment feasibility.
@@ -2533,10 +2527,8 @@ where the number of floating-point operations (FLOPs) is divided by the peak har
\index{Memory Access!irregular patterns}
Unlike traditional computing workloads, where memory access follows well-structured and predictable patterns, machine learning models often exhibit irregular memory access behaviors that make efficient data retrieval a challenge. These irregularities arise due to the nature of ML computations, where memory access patterns are influenced by factors such as batch size, layer type, and sparsity. As a result, standard caching mechanisms and memory hierarchies often struggle to optimize performance, leading to increased memory latency and inefficient bandwidth utilization.
::: {#fig-memory-wall fig-env="figure" fig-pos="htb" fig-cap="**Model Size vs. Hardware Bandwidth.** Model parameter counts and hardware memory bandwidth plotted from 2012 to 2025, showing how model growth from AlexNet to trillion-parameter models has far outpaced bandwidth improvements across GPU and TPU generations." fig-alt="Scatter plot with trend lines comparing AI model parameters (red) and hardware bandwidth (blue) from 2012 to 2024. Models grow from AlexNet to Gemini 1. Shaded gap shows widening memory wall."}
```{python}
#| label: fig-memory-wall
#| fig-cap: "**Model Size vs. Hardware Bandwidth.** Model parameter counts and hardware memory bandwidth plotted from 2012 to 2025, showing how model growth from AlexNet to trillion-parameter models has far outpaced bandwidth improvements across GPU and TPU generations."
#| fig-alt: "Scatter plot with trend lines comparing AI model parameters (red) and hardware bandwidth (blue) from 2012 to 2024. Models grow from AlexNet to Gemini 1. Shaded gap shows widening memory wall."
#| echo: false
# ┌─────────────────────────────────────────────────────────────────────────────
# │ MEMORY WALL (FIGURE)
@@ -2605,6 +2597,7 @@ ax.set_ylabel('Log Scale (Base 10)')
ax.set_xlim(2011, 2027)
plt.show()
```
:::
Comparing ML memory access patterns against traditional computing workloads reveals the scale of the challenge. Traditional workloads, such as scientific computing, general-purpose CPU applications, and database processing, typically exhibit well-defined memory access characteristics that benefit from standard caching and prefetching techniques. ML workloads, on the other hand, introduce highly dynamic access patterns (@tbl-traditional-vs-ml-mem) that challenge conventional memory optimization strategies.

View File

@@ -0,0 +1,115 @@
-- =============================================================================
-- TABLE CELL LINE BREAKS (PDF/LaTeX)
-- =============================================================================
-- Converts table cells containing <br> (LineBreak) to use LaTeX \makecell{}
-- so that line breaks render correctly in PDF. HTML/EPUB keep <br> as-is.
--
-- Without this filter, \newline from Pandoc's default conversion does not
-- produce visible line breaks in standard tabular columns (l, c, r).
-- \makecell{line1 \\ line2} works with any column type.
-- =============================================================================
local function is_latex_format()
if quarto and quarto.doc and quarto.doc.is_format then
return quarto.doc.is_format("latex") or
quarto.doc.is_format("pdf") or
quarto.doc.is_format("titlepage-pdf") or
quarto.doc.is_format("beamer")
end
if FORMAT then
return FORMAT:match("latex") or FORMAT:match("pdf") or FORMAT:match("beamer")
end
return false
end
-- Check if a cell (list of Blocks) contains LineBreak or <br> (RawInline)
local function cell_has_linebreak(cell)
for _, block in ipairs(cell) do
if block.content then
for _, inline in ipairs(block.content) do
if inline.t == "LineBreak" then
return true
end
if inline.t == "RawInline" and inline.format == "html" then
local raw = inline.text or ""
if raw:match("^<br%s*/?>$") or raw == "<br>" then
return true
end
end
end
end
end
return false
end
-- Replace RawInline <br> with LineBreak so pandoc.write produces \newline
local function normalize_br_in_block(block)
if not block.content then return block end
local new_content = pandoc.List()
for _, inline in ipairs(block.content) do
if inline.t == "RawInline" and inline.format == "html" then
local raw = inline.text or ""
if raw:match("^<br%s*/?>$") or raw == "<br>" then
new_content:insert(pandoc.LineBreak())
else
new_content:insert(inline)
end
else
new_content:insert(inline)
end
end
return pandoc.Plain(new_content)
end
-- Convert cell content to LaTeX and wrap in \makecell, replacing \newline with \\
local function convert_cell_to_makecell(cell)
-- Normalize <br> to LineBreak so pandoc.write produces \newline
local normalized = pandoc.List()
for _, block in ipairs(cell) do
if block.content then
normalized:insert(normalize_br_in_block(block))
else
normalized:insert(block)
end
end
local doc = pandoc.Pandoc(normalized)
local latex = pandoc.write(doc, "latex")
-- Pandoc outputs \newline for LineBreak; \makecell needs \\
latex = latex:gsub("\\newline", "\\\\")
-- Remove trailing newline from pandoc.write
latex = latex:gsub("\n$", "")
return pandoc.RawBlock("latex", "\\makecell[tl]{" .. latex .. "}")
end
-- Process a single cell (Blocks list), return modified Blocks
local function process_cell(cell)
if not cell_has_linebreak(cell) then
return cell
end
return { convert_cell_to_makecell(cell) }
end
-- Table filter: use simple table for easy cell iteration
local function Table(tbl)
if not is_latex_format() then
return nil
end
local simple = pandoc.utils.to_simple_table(tbl)
-- Process header cells
for i, cell in ipairs(simple.headers) do
simple.headers[i] = process_cell(cell)
end
-- Process body cells
for i, row in ipairs(simple.rows) do
for j, cell in ipairs(row) do
simple.rows[i][j] = process_cell(cell)
end
end
return pandoc.utils.from_simple_table(simple)
end
return { Table = Table }

View File

@@ -26,6 +26,7 @@
\usepackage{afterpage} % Execute commands after page break
\usepackage{morefloats} % Increase number of floats
\usepackage{array} % Enhanced table column formatting
\usepackage{makecell} % Line breaks in table cells (\makecell)
\usepackage{atbegshi} % Insert content at page beginning
%\usepackage{changepage} % Change page dimensions mid-document
\usepackage{emptypage} % Clear headers/footers on empty pages
@@ -961,7 +962,7 @@ align=right,font={\fontsize{40pt}{40}\selectfont}]
\AtBeginEnvironment{longtable}{\footnotesize}
% Increase vertical spacing in table cells (default is 1.0)
\renewcommand{\arraystretch}{1.5}
\renewcommand{\arraystretch}{1.6}
% Prefer placing figures and tables at the top of pages
\makeatletter