mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-06 09:38:33 -05:00
The same path-prefix bug that broke Lander on dev preview affected the other
13 games too. Fixing all of them in one batch so the entire catalog works
on /cs249r_book_dev/, mlsysbook.ai/, and localhost equally.
Pattern applied:
.qmd include-in-header script:
import "/assets/games/X.mjs" → import "../assets/games/X.mjs"
.mjs ES imports:
from "/assets/games/runtime.mjs" → from "./runtime.mjs"
from "/assets/games/vendor/pixi.min.mjs" → from "./vendor/pixi.min.mjs"
Files touched (10 .mjs + 13 .qmd):
.mjs: allreduce, batch, cluster, kvcache, moe, oom, pipeline, prune,
quantization, topology
.qmd: allreduce, batch, checkpoint, cluster, kvcache, loader, moe, oom,
pipeline, prune, quantization, roofline, topology
(checkpoint, loader, roofline .mjs already used 'import * as runtime from
./runtime.mjs' — only their qmd files needed updating)
Verification: all 14 games rendered locally (quarto render games/), served
via python3 -m http.server, swept with Playwright headless Chromium.
Result: 14/14 pass — canvas mounted, MLSP runtime ready, game registered,
no JS errors, no 4xx network requests. Visual screenshots confirm each
game's HUD/title/content paints correctly.
69 lines
2.9 KiB
Plaintext
69 lines
2.9 KiB
Plaintext
---
|
|
title: "Cluster Commander"
|
|
subtitle: "Schedule jobs without fragmenting the fleet."
|
|
description: "A browser mini-game from MLSysBook Playground. Manage GPU cluster scheduling and head-of-line blocking."
|
|
page-layout: article
|
|
format:
|
|
html:
|
|
include-in-header:
|
|
- text: |
|
|
<link rel="stylesheet" href="/assets/games/common.css">
|
|
<script type="module">
|
|
import "../assets/games/runtime.mjs";
|
|
import "../assets/games/cluster.mjs";
|
|
</script>
|
|
---
|
|
|
|
```{=html}
|
|
<div class="mlsp-game-container" role="region" aria-label="Cluster Commander mini-game">
|
|
<canvas id="mlsp-canvas" class="mlsp-game-canvas" width="680" height="460"></canvas>
|
|
<div class="mlsp-game-hud">
|
|
<span class="mlsp-score">score <span id="mlsp-score">0</span></span>
|
|
<span>Click empty space to schedule · Avoid blocking the queue · <kbd>R</kbd> retry</span>
|
|
<button type="button" class="mlsp-fullscreen-btn" onclick="this.closest('.mlsp-game-container').requestFullscreen()" title="Full Screen" aria-label="Full Screen">⛶</button>
|
|
</div>
|
|
</div>
|
|
|
|
<div id="mlsp-aha-slot"></div>
|
|
|
|
<script>
|
|
(function bootCluster(){
|
|
function tryBoot() {
|
|
if (!window.MLSP || !MLSP.games || !MLSP.games.cluster) return setTimeout(tryBoot, 30);
|
|
var canvas = document.getElementById("mlsp-canvas");
|
|
var $score = document.getElementById("mlsp-score");
|
|
var ahaSlot = document.getElementById("mlsp-aha-slot");
|
|
var pendingResult = null;
|
|
var resolvedApi = null;
|
|
|
|
Promise.resolve(MLSP.games.cluster(canvas, {
|
|
onScoreChange: function(s) { $score.textContent = s.score; },
|
|
onGameOver: function(result) {
|
|
if (resolvedApi) attachAha(resolvedApi, result);
|
|
else pendingResult = result;
|
|
},
|
|
onRetry: function() { window.location.reload(); }
|
|
})).then(function(api) {
|
|
resolvedApi = api;
|
|
if (pendingResult) attachAha(api, pendingResult);
|
|
});
|
|
|
|
function attachAha(api, result) {
|
|
MLSP.showAhaCard(ahaSlot, api.ahaLabel, api.ahaText, api.ahaLink);
|
|
}
|
|
}
|
|
tryBoot();
|
|
})();
|
|
</script>
|
|
```
|
|
|
|
## How to play
|
|
You have an 8x8 GPU cluster. Jobs appear in the queue with varying shapes (1x1 inference, 2x2 fine-tuning, 4x4 pre-training). Click a valid empty space in the grid to schedule the next job. If the grid becomes too fragmented to fit a large job, it will block the entire queue!
|
|
|
|
## The Systems Concept
|
|
Scheduling diverse jobs on a shared GPU cluster (like Slurm or Kubernetes) often leads to fleet fragmentation. When many small jobs scatter across the cluster, a large contiguous job (e.g., a massive pre-training run requiring 16 interconnected GPUs) might be unable to start, causing cluster utilization to plummet. Modern schedulers use defragmentation and backfilling to mitigate this.
|
|
|
|
---
|
|
|
|
*Part of [MLSysBook Playground](/games/). Found a bug? [Report an issue](https://github.com/harvard-edge/cs249r_book/issues/new?labels=bug&title=Bug+in+Game).*
|