User concern: preventing commercial reuse of the corpus (e.g., a vendor training a paid product on the questions, selling access to them). CC-BY-NC-4.0 permits research citation + non-commercial derivatives while requiring written permission for commercial use. interviews/vault/questions/LICENSE (NEW) CC-BY-NC-4.0 full text with BibTeX template tied to release_hash. Commercial licensing contact noted. interviews/vault/ARCHITECTURE.md §15 #1 Marked DECIDED. Rationale recorded. vault-cli license intentionally left at historical status (not relicensed as part of this change). interviews/vault/REVIEWS.md License state: DECIDED. Removed from Phase-3 blocker list. interviews/CONTRIBUTING.md New 'License' section: NC constraint explicit. External corpus PRs assumed offered under same CC-BY-NC-4.0. Contact for commercial licensing specified.
7.1 KiB
Contributing to StaffML
Thanks for your interest. This guide covers contributing to the StaffML vault
(question corpus) and the site (interviews/staffml/).
For the full architecture of the vault pipeline, see
vault/ARCHITECTURE.md. For the review ledger behind
it, see vault/REVIEWS.md.
Quick start — from clone to first-question-visible
# 1. Clone and pick the staffml worktree
git clone https://github.com/harvard-edge/cs249r_book.git
cd cs249r_book/interviews
# 2. Install vault-cli (Python 3.12+)
pip install -e vault-cli/[dev]
vault --version
pytest vault-cli/tests/
# 3. Phase-aware explore
# At Phase 0 only --version/--help exist. At Phase 1 the corpus is split
# and `build` / `check` work end-to-end. At Phase 2+, `stats` / `verify`
# / `export-paper` exist. See §Phase-by-phase scope at the bottom of this
# file for exactly which subcommands are live at your checkout.
vault --help # always works — shows subcommands live at your phase
# 4. Run the local API shim (requires a prior `vault build` so there's a vault.db)
# Serves the Worker endpoint surface from a local vault.db so you don't
# need a Cloudflare account to develop the site. Available from Phase 1 onward.
vault build # produces interviews/vault/vault.db
vault api --db interviews/vault/vault.db --port 8002 &
# 5. Run the site against your local API
cd staffml/
cp .env.example .env.local
# edit .env.local: NEXT_PUBLIC_VAULT_API=http://localhost:8002
pnpm install
pnpm dev
# visit http://localhost:3000
The goal is clone → question-visible in under 10 minutes on a fresh machine. If it's longer, file an issue titled "CONTRIBUTING.md getting-started friction".
What can I contribute?
| Contribution type | Where | How |
|---|---|---|
| New question | vault/questions/<track>/<level>/<zone>/ |
vault new |
| Fix a question | same | vault edit <id> |
| Reclassify a question | same | vault move <id> --to <track>/<level>/<zone> |
| New topic | vault/taxonomy.yaml |
PR with a §7 entry in EVOLUTION.md |
| Website UX | interviews/staffml/src/ |
Next.js; see AGENTS.md in staffml if present |
| Worker API | interviews/staffml-vault-worker/ (Phase 3+) |
Wrangler project |
| Schema evolution | vault/schema/ |
RFC-style PR per EVOLUTION.md |
New vault-cli subcommand |
vault-cli/src/vault_cli/commands/ |
Land with tests + docs update |
Workflow
Branching
- Start from
devfor standalone work:git checkout -b feat/short-description dev. - One logical change per branch. Atomic commits. No
git add -Aon vault changes. - No
Co-Authored-Bytags, no "made with " footers. Commit messages read like regular engineering work.
Before opening a PR
vault check --strict # invariants
pytest vault-cli/tests/ # unit + integration + contract
vault codegen --check # LinkML ↔ Pydantic/DDL/TS drift check (Phase 1+)
CI re-runs these. PRs are merge-blocked on red CI.
PR review
- Corpus PRs: at least one maintainer review, CI green.
- Code PRs (vault-cli, worker): one review, CI green.
- Schema-evolution PRs: two reviews required once external-contributor onboarding opens (Phase 7+).
- Schema-breaking PRs must include a migration script under
vault-cli/migrations/.
Provenance honesty
Every question's provenance field must honestly reflect how it was made:
human— written from scratch by a human.llm-draft— produced byvault generate; not yet human-reviewed.llm-then-human-edited— an LLM draft substantially revised by a human (the common case).generation_meta.human_reviewed_atrecords when.imported— from an external source (e.g., book, published paper). Include source intags.
Misattributing LLM content as human is a correctness bug, not a style nit.
Author attribution
vault new populates authors from your git config user.email via
vault/contributors.yaml (mapping from email → handle). Submit a PR updating
that file to add yourself.
For external PRs: commit signatures (GPG/SSH) or GitHub-verified-email match is
required — CI rejects authors: claims that don't match the committer
identity.
Style
- Keep questions focused on a single concept. "Good napkin math" is usually a signal; "grab-bag of facts" is usually a code smell.
- Use realistic hardware specs — check
mlsysbook/constants.pyand thevault/schema/models.yamlregistry for the canonical values. - Paper-cite URL format only:
https://mlsysbook.ai/book/chapters/<slug>. - Scenarios are plaintext; solutions and napkin math can use restricted Markdown + KaTeX.
Things that block external PRs from merging
- Provenance lie —
vault mark-exemplarorvault promote --reviewed-byfields that don't match git committer. - Registry mutation — any commit that deletes lines from
id-registry.yaml. The registry is append-only. - Schema mixing — questions at different
schema_versionin the same PR. - Unsigned schema-evolution PR — schema bumps require two maintainer approvals.
Phase-by-phase scope
As of Phase 0, the vault pipeline is scaffolded but not operational end-to-end. What works today:
vault --version,vault --help.pip install -e vault-cli/[dev]andpytest.- Documentation (ARCHITECTURE.md, REVIEWS.md, TESTING.md, EVOLUTION.md, this file).
What's coming per Phase (see vault/ARCHITECTURE.md §14):
- Phase 1:
new,edit,move,rm,restore,build,check,serve,api. YAML split lands. - Phase 2:
publish+ primitives, paper-exporter rewrite, rollback-symmetry CI. - Phase 3: D1 + Worker +
@staffml/vault-types+ FTS5 load-test gate. - Phase 4: Website cutover + service worker + rollback drill.
- Phase 5: Chain pre-reveal indicator + instrumentation.
- Phase 6: About-page paper prominence.
External contributions to vault/questions/ become feasible at Phase 1 exit.
License
The corpus under interviews/vault/ is licensed CC-BY-NC-4.0 — see
vault/questions/LICENSE. Summary:
- Share and adapt for non-commercial purposes with attribution.
- Commercial use (training a paid product on the corpus, selling access,
building paid services around it) requires separate written permission
from the copyright holders. Contact:
vjreddi@g.harvard.edu. - Contributions to the corpus are assumed to be offered under the same CC-BY-NC-4.0 terms. Do not submit content you are not entitled to license this way.
The interviews/vault-cli/ tooling is a separate artifact; its license is
unchanged from the repository's historical state.
Asking for help
- Architecture questions → read
vault/ARCHITECTURE.md. - "Why was X decided this way?" → check
vault/REVIEWS.md. Most non-obvious decisions map to a reviewer finding. - "Is this bug or intended?" → open an issue with the command you ran and the output.
Thanks for contributing.