# Pilot Study Protocol: Evaluating mlsysim as a Teaching Tool for ML Systems Reasoning ## Study Overview **Title:** First-Principles Performance Modeling Improves ML Systems Reasoning: A Pre/Post Assessment of the mlsysim Tutorial at ISCA 2026 **Target venue:** SIGCSE Technical Symposium 2027 or L@S 2027 **Study type:** Single-group pre/post quasi-experiment (within-subjects design) **Research question:** Does a 6-hour hands-on tutorial using first-principles analytical modeling (mlsysim) improve attendees' ability to reason quantitatively about ML system performance, as measured by transfer questions that do not require the tool? **Key distinction:** The quiz tests mental models, not tool proficiency. Questions are answerable with mental arithmetic alone. This isolates the pedagogical contribution of the framework from the utility of the software. --- ## Study Design ### Design: Within-Subjects Pre/Post Each participant serves as their own control. The same 12-point quiz is administered immediately before the tutorial (pre-test) and immediately after (post-test). **Notation:** O1 X O2 - O1 = Pre-test (9:00 AM, before any instruction) - X = Treatment (6-hour tutorial) - O2 = Post-test (4:50 PM, after closing reflection) ### Why No Control Group A control group is not feasible or necessary for the first publication: 1. **Logistical constraint:** ISCA tutorial attendees self-select into the room. There is no natural control population attending the same conference who would agree to take the quiz without attending the tutorial. 2. **Sufficient for initial evidence:** Pre/post designs are the standard for first reports on educational interventions in computing (see: Software Carpentry evaluations, CSEd workshop studies at SIGCSE). The within-subjects design controls for individual differences in background knowledge. 3. **Threats to validity are addressable:** The main threat is maturation/history (would scores improve just from thinking about the topics for 8 hours?). This is mitigated by the transfer nature of the questions -- they require specific frameworks taught in the tutorial, not general familiarity. 4. **Future work:** A controlled study (tutorial group vs. self-study group) is planned for the second offering, where classroom deployment makes random assignment feasible. ### Threats to Internal Validity | Threat | Severity | Mitigation | |--------|----------|------------| | Testing effect (pre-test sensitizes attendees to post-test) | Moderate | Questions test transfer, not recall. The pre-test does not reveal correct answers or teach the framework. | | History (external events during the day) | Low | The tutorial runs continuously; attendees are in the same room all day. | | Maturation (natural improvement from 8 hours of thought) | Low | Questions require specific quantitative frameworks (Iron Law, Roofline) that are not common knowledge. | | Attrition (attendees leave before post-test) | Moderate | Administer post-test before the final 5 minutes of closing. Offer it at 4:50 PM, not 5:00 PM. Report attrition rate. | | Selection (self-selected ISCA attendees are not representative) | High | Acknowledged as a limitation. Results generalize to "motivated computer architects" not "all engineers." | --- ## Sample Size Justification ### Power Analysis **Parameters:** - Test: Paired (two-sided) t-test on total quiz score - Expected effect size: Cohen's d = 1.0 (conservative estimate; we expect d ~ 1.2 based on similar workshop studies, but power for d = 1.0 provides a safety margin) - Significance level: alpha = 0.05 - Desired power: 1 - beta = 0.80 **Required sample size:** N = 10 participants (paired t-test, d = 1.0, alpha = 0.05, power = 0.80). Computed via G*Power or the formula: ``` N = ((z_alpha/2 + z_beta) / d)^2 + 1 N = ((1.96 + 0.84) / 1.0)^2 + 1 N = 7.84 + 1 = 8.84 -> 10 ``` **Adjusted for attrition:** Assume 20% attrition (attendees who complete pre-test but leave before post-test). Required enrollment: N = 10 / 0.80 = 13 participants minimum. **Expected enrollment:** 40--80 attendees (ISCA tutorial capacity). Even with 50% participation in the research component, we expect 20--40 paired responses. This provides power > 0.99 for d = 1.0 and allows meaningful per-question analysis. ### Per-Question Analysis Power For per-question McNemar tests (binary: correct/incorrect), the minimum detectable effect requires: - At least 10 discordant pairs (participants who change from wrong-to-right or right-to-wrong) per question - With N = 30, and an expected pre-test accuracy of 40--50% rising to 70--80%, we expect 12--18 discordant pairs per question -- sufficient for McNemar's test. --- ## Analysis Plan ### Primary Analysis: Overall Learning Gain **Test:** Paired two-sided t-test on total quiz scores (pre vs. post). **Outcome variable:** Total score (0--12 scale). **Hypothesis:** H1: mean(post) > mean(pre). H0: mean(post) = mean(pre). **Reporting:** Mean pre-score, mean post-score, mean gain, 95% CI on the gain, paired t-statistic, p-value, Cohen's d with 95% CI. **Normality check:** Shapiro-Wilk test on the gain scores. If non-normal (p < 0.05), supplement with the Wilcoxon signed-rank test. Report both if normality is violated. ### Secondary Analysis: Per-Understanding-Goal Gains For each understanding goal (U1--U6), compute a 2-point sub-score (two questions per goal). Report: - Mean pre and post sub-scores per goal - Paired t-test per goal (with Bonferroni correction: alpha = 0.05/6 = 0.0083) - Rank goals by effect size to identify which understandings the tutorial teaches most effectively ### Tertiary Analysis: Per-Question McNemar Tests For each multiple-choice question (Q1--Q5, Q8--Q10), construct a 2x2 table: | | Post-correct | Post-incorrect | |--|-------------|----------------| | **Pre-correct** | a | b | | **Pre-incorrect** | c | d | **Test:** McNemar's exact test on the discordant cells (b, c). **Reporting:** For each question: pre-accuracy, post-accuracy, number of positive changes (c), number of negative changes (b), McNemar p-value, odds ratio. **Purpose:** Identifies which specific questions show the strongest learning signal and which (if any) show negative transfer (post-test regression). For short-answer questions (Q6, Q7), use the Wilcoxon signed-rank test on the 0/1/2 scores. ### Exploratory Analysis: Misconception Tracking Using the distractor analysis from `quiz.md`, compute the prevalence of each named misconception at pre-test and post-test: - "More FLOPS = faster" (Q1a + Q2b selection rate) - "Just add more GPUs" (Q3a + Q4c selection rate) - "Quantization is just a latency trick" (Q5b selection rate) - "Carbon = energy efficiency" (Q8a selection rate) - "Benchmark first, decide later" (Q9d + Q10c selection rate) Report the reduction in misconception prevalence with 95% CIs. This is the most publishable aspect for a SIGCSE audience -- it connects learning gains to specific conceptual changes. ### Exploratory Analysis: Demographic Subgroups If the optional demographic survey yields sufficient responses (N >= 10 per subgroup), compare learning gains by: - Career stage (PhD student vs. industry engineer vs. faculty) - Self-reported ML systems experience (none / some / extensive) - Architecture background (strong / moderate / weak) Use independent t-tests or Mann-Whitney U on gain scores. These are exploratory and will be reported as such (no multiple comparison correction; findings used to generate hypotheses for the controlled study). --- ## Data Collection Instruments | Instrument | When | Required/Optional | Contains PII | |------------|------|-------------------|--------------| | Pre-test quiz | 9:00 AM | Required for research | No (participant ID only) | | Post-test quiz | 4:50 PM | Required for research | No (participant ID only) | | Demographic survey | Registration or lunch | Optional | Minimal (career stage, years experience, institution type) | | Consent form | 9:00 AM (top of pre-test) | Required for research | No (opt-out model) | All instruments are in the `assessment/` directory. --- ## Timeline ### Pre-Tutorial (8+ weeks before ISCA) | Week | Task | |------|------| | T-10 | File IRB application (exempt category). Include all instruments. | | T-8 | Receive IRB approval (or clarification requests). | | T-6 | Finalize quiz wording. Pilot with 3--5 colleagues for timing and clarity. | | T-5 | Revise quiz based on pilot feedback. Finalize digital form (Google Forms). | | T-4 | Generate participant ID codes (100 six-digit random numbers). Print ID cards. | | T-3 | Print paper quiz forms (100 copies, double-sided). | | T-2 | Prepare data analysis scripts (R or Python). Pre-register analysis plan on OSF. | | T-1 | Dry-run the quiz administration with a practice group. Time it. | ### Day of Tutorial (ISCA 2026) | Time | Task | |------|------| | 8:30 AM | Set up: distribute ID cards at seats. Prepare quiz links/forms. | | 8:55 AM | Read consent statement aloud. Display quiz link/form. | | 9:00 AM | Start pre-test timer (5 minutes). | | 9:05 AM | Collect pre-test forms. Begin tutorial. | | 4:50 PM | Distribute post-test (same quiz). Start 5-minute timer. | | 4:55 PM | Collect post-test forms. Proceed to closing. | | 5:00 PM | Distribute optional demographic survey (paper or QR code). | ### Post-Tutorial (2--8 weeks after ISCA) | Week | Task | |------|------| | T+1 | Enter paper responses into spreadsheet (if paper forms used). | | T+1 | Destroy name-to-ID mapping. Data is now permanently de-identified. | | T+2 | Run primary analysis (paired t-test). Check assumptions. | | T+3 | Run secondary and tertiary analyses. Generate figures. | | T+4 | Draft results section. Compute all CIs and effect sizes. | | T+6 | Complete manuscript draft. | | T+8 | Submit to SIGCSE 2027 (September deadline) or L@S 2027. | --- ## Expected Results ### Hypotheses **H1 (Primary):** The mean post-test score will be significantly higher than the mean pre-test score (paired t-test, p < 0.05), with a large effect size (Cohen's d > 0.8). **Rationale:** The tutorial is 6 hours of intensive, hands-on instruction using the predict-code-reflect cycle. Similar computing workshops (Software Carpentry, CS Unplugged) report d = 0.8--1.5 for pre/post designs. The ISCA audience is highly motivated and technically sophisticated, which should amplify learning gains. **H2 (Secondary):** The largest per-goal gains will be on U4 (compression as architecture) and U5 (carbon geography), because these represent the most novel content for an architecture audience. **Rationale:** ISCA attendees likely already understand compute vs. memory bottlenecks (U1) and parallelism (U3) from their architecture training. The Roofline model is widely taught. However, the fleet-level implications of quantization and the dominance of grid carbon intensity over hardware efficiency are not standard architecture curriculum. These are the "aha moments" most likely to produce score gains. **H3 (Tertiary):** The misconception "more FLOPS = faster" will decrease by at least 50% from pre to post. **Rationale:** This is directly addressed by Aha Moment #1 at 10:00 AM, with the predict-then-reveal structure designed to create cognitive conflict. It is the single most targeted misconception in the tutorial. ### Expected Quantitative Results | Metric | Expected value | Basis for estimate | |--------|---------------|-------------------| | Pre-test mean | 5.5 / 12 (46%) | ISCA audience: strong architecture, partial ML systems | | Post-test mean | 8.5 / 12 (71%) | 6 hours of targeted instruction on exactly these topics | | Mean gain | 3.0 points | Difference | | Cohen's d | 1.0--1.3 | Comparable workshop studies | | Pre-test "more FLOPS = faster" prevalence | 40--50% | Common misconception even among architects | | Post-test "more FLOPS = faster" prevalence | 10--15% | After Aha #1 and extensive Roofline practice | | Attrition rate | 10--20% | Typical for full-day ISCA tutorials | | Research participation rate | 60--80% | Opt-out consent model with no compensation | ### What Would Be Surprising - **Pre-test mean > 8:** Would indicate the ISCA audience already reasons this way, reducing the tutorial's contribution. The quiz may need harder questions. - **No gain on U1 (Roofline):** Would suggest the Roofline model is already well-known to this audience (possible, since it is taught in architecture courses). - **Negative gain on any question:** Would indicate the tutorial introduced a new misconception. This requires immediate investigation and tutorial revision. - **Attrition > 30%:** Would threaten statistical power and suggest engagement problems in the afternoon sessions. --- ## Manuscript Outline (SIGCSE 2027) For planning purposes, the target paper structure: 1. **Introduction:** The need for quantitative ML systems reasoning; the gap between architecture education and ML systems practice. 2. **Related Work:** Roofline model pedagogy (Williams et al.), ML systems courses (Stanford CS229S, CMU 10-414), computing education assessment (ITiCSE working groups on concept inventories). 3. **The mlsysim Tutorial:** Design principles, the six understanding goals, the predict-code-reflect cycle. Reference the DESIGN.md document. 4. **Assessment Design:** The 10-question quiz, mapping to understanding goals, distractor rationale. Reference `quiz.md`. 5. **Methods:** Participants, procedure, analysis plan (this document). 6. **Results:** Pre/post scores, effect sizes, per-goal gains, misconception tracking. 7. **Discussion:** Which understandings transferred, which did not, implications for ML systems curriculum design. 8. **Limitations:** Single group, self-selected ISCA population, testing effect, no long-term retention data. 9. **Future Work:** Controlled study with self-study comparison group, deployment in semester-long courses, longitudinal retention assessment at 6 months. --- ## Pre-Registration Before the tutorial, pre-register the study on OSF (Open Science Framework): - **URL:** https://osf.io/registries - **Template:** AsPredicted or OSF Standard Pre-Data Collection Registration - **Include:** Research question, hypotheses H1--H3, analysis plan (primary, secondary, tertiary), sample size justification, quiz instrument, expected effect size. Pre-registration strengthens the publication by demonstrating that the analysis plan was not influenced by the observed data. This is increasingly expected at SIGCSE and L@S. --- ## Budget | Item | Cost | Notes | |------|------|-------| | Printing (200 quiz forms) | $50 | Double-sided, B&W | | Google Forms (digital backup) | $0 | Free with institutional account | | Participant ID cards | $20 | Pre-printed labels | | Statistical software | $0 | R (free) or scipy (free) | | OSF pre-registration | $0 | Free | | IRB filing | $0 | Typically free for exempt studies | | **Total** | **$70** | | --- ## Contact and Responsibilities | Role | Person | Responsibility | |------|--------|---------------| | PI | [TBD] | IRB filing, study design, manuscript lead | | Tutorial lead | [TBD] | Quiz administration, data collection | | Data analyst | [TBD] | Analysis scripts, figures, results section | | Second author | [TBD] | Related work, discussion, editing |