I have been thinking about the design, technology, and human factors aspects of
a historical retrospective,
a list of criteria, and
a proposal for progress.
As you read, it’s also worth keeping in mind that “pedagogic” is not a hard
criterion that admits some IDEs always and others never. The bullet items
below provide several characteristics that pedagogic IDE should have, and
many have some subset of them, even ones not intentionally designed for use
with students. Thus, it is perhaps more helpful to think about pedagogy as a
Thanks to the many people on Twitter who gave feedback, especially Jonathan Aldrich.
We began the DrRacket project (then called DrScheme) in 1995 on the premise that we wanted to build a pedagogic IDE. Our initial goals were set by our context: students learning to program in Scheme inside Emacs. Each such institution had its own set of idiosyncratic Emacs key-bindings that students were forced to memorize, while also being plunged into the complexity of an editor vastly more sophisticated than anything they had used before, where any odd key combination could drop them into a confusing state from which recovery often needed expert help.No, this is not an Emacs rant. For the record, I still use Emacs. In fact, this article was written using Emacs. It was hardly student-friendly for all but the most adventurous (who, of course, loved it).
At the same time, this is when the US’s AP Computer Science curriculum was shoving C++ down the throats of schools. As a result, many students were entering college having seen a sophisticated graphical IDE like Visual C++. On the one hand, these set an expectation amongst students that an IDE would be a rich graphical environment that looked like a native application; its absence made education feel inauthentic. On the other hand, the number of buttons and options in such IDEs were problematic. Many students felt overwhelmed; a few (probably the same ones who loved discovering the recesses of Emacs) clicked away at random, losing sight of fundamental learning.
Make a reasonable, modern, native-looking application.
Make a stand-alone application where we can control the experience, rather than an island surrounded by a sea of dragons.
Minimize distraction, confusion, and unnecessary (especially undesirable) choice.
Languages evolve by accruing complexity, which can create dangerous reefs on which students can founder. We need principled, teaching-friendly sub-languages that eliminate accidental complexity.Even if a language is designed from scratch to avoid this problem, it can still contain different levels of complexity.
Just as a textbook does not plunge students into a whole language at once but rather builds it up incrementally, IDEs should introduce the language incrementally, matching a book’s needs and supporting a learning progression.
Errors matter! Error messages should be carefully written to be friendly and helpful to students, whose needs may be different from those of a developer.
Sub-languages must be epistemically closed, with errors inside this closure. Concretely, error messages should never use vocabulary or concepts beyond what has been introduced up to that point in the curriculum.
These principles drove the development of DrRacket. Furthermore, Racket created linguistic abstractions that embodied these ideas. This enabled several textbooks to map their own learning progressions to corresponding sub-languages in DrRacket, giving their student users the same benefits that users of How to Design Programs enjoyed.
We weren’t done with error messages (and IDE feedback more generally). Starting in 2010, Guillaume Marceau, Kathi Fisler, and I began to investigate the error messages of DrRacket using the methods of HCI. What we found was deeply disturbing: that the errors often were not helping students fix their problems.
IDE feedback lives in an ecosystem consisting of the programming language, books, third-party libraries, educators, and more. All of these need to be in harmony, and require style guides to facilitate agreement.
Over the past few years, I’ve come to realize there is another critical aspect to pedagogic IDEs that truly distinguishes them from a conventional IDE designed for a developer. It’s that a pedagogic IDE has ground truth. I’ve been pursuing this under various guises for some years now, but only in the past two years have these pursuits coalesced into a clear thesis.
Here’s what I mean by ground truth. In almost every pedagogic setting, we are asking students to create something we have already done. Therefore, we not only full knowledge of the answer, we’ve even already created instances of the solutions we are asking them to build. That is, we are trying to bring students to the same level of achievement that we already have.
In a conventional programming setting, such as in industry, programmers do not
re-build perfect replicas of existing systems. If they’re re-implementing a
system, often it’s with an eye to at least improving it, and usually to
ultimately changing it. In the very rare cases that they want to identically
reproduce a system, it’s usually for intellectual property reasons, and thus
done with various safeguards against communication. In general, though, the
thing they are trying to build doesn’t already exist (which is why they’re
This is the opposite of what we do in education. In education, it is utterly standard for a hundred students to all be striving, independently, to perfectly reproduce something we have already done.Indeed, great effort goes into making sure these students don’t merely pawn off as their own what other students have done. There are solid educational reasons for this. The fact that we have done all this already, and can often point to our work as exemplary, gives us the ground truth that conventional programming lacks. My thesis is that:
This ground truth can and should be
exploited to improve student learning.
In a way, numerous educators, for decades, have been using ground truth: by giving students a test suite against which to run their programs. I find this practice odious for several reasons:If you are going to do it anyway, at least avoid the perils we’ve identified. It promotes “passing the instructor’s suite” as the most important trait a program should satisfy (whereas, often, it is one of the least important). It fails to teach students to think about correctness for themselves, treating it as an outsourced activity. It often encourages a mentality of “keep tweaking and submitting until you pass”, which as a programming practice is both utterly unauthentic and arguably dangerous. (Adding rate limits is a symptom of a problem with the pedagogy, not a solution.) More subtly, and perhaps worse, it robs students of important stimuli for problem understanding. But most of all, this is primarily a means of assessment, not of instruction.
We don’t ask students to produce a broad enough range of artifacts.
Comparing artifacts is hard from a technical perspective. Relatedly, giving actionable feedback is hard both technically and from a human factors perspective.
a purpose statement
a function signature
examples of function use
a function template
the function body
tests of the function
We can give feedback early.
We can correct problems when it doesn’t cost much to fix them. (Once a student has fully implemented their system, giving feedback on failing tests will invariably only produce minor tweaks in the program, never large-scale architectural changes.)
We can give different kinds of feedback, some of which will be better appreciated by some students and others better by others.
We can make the desolate progression from a blank page to a fully-working program a more rich, textured interaction between student and (otherwise foreboding and hostile) tool.
our data definition against theirs
our function signature against theirs
our examples against theirs
our data against their data definition
our examples against their function signature
our function against their examples
In fact, we can go farther. We don’t only have to implement positive instances of these artifacts: we can also implement negative instances (i.e., non-solutions). Using a combination of known-common mistakes and just random mutation, we can produce incorrect versions of all these artifacts, which have their own uses. Incorrect programs are routinely used to measure the quality of a test suite (mutation testing). We can do that, and more, comparing positive and negative instances of student artifacts with an eye towards making the programming process a conversation with a knowledgeable companion.
We run tests against programs because we know how to do it, and the output is easy to explain. An instructor may choose to hide output (e.g., only reporting the number of tests that passed or failed), but if they chose to provide more information, it’s quite easy to say “on the input 3, your function produced 8 instead of 9” and a student would not have much trouble understanding this utterance (even if they can’t explain why they produced 8 or why they were supposed to produce 9).
Unfortunately, that is not only the easiest case, it may be the only easy one. Comparing two purpose statements is AI-complete since it requires handling natural language; and while the language here may be relatively structured (and poorly-structured statements can be rejected summarily), ensuring very high precision and recall, and giving feedback that is understandable by a computing novice, is very hard. Other pairs sit somewhere in-between. For instance, a long line of research has studied how to compare two programs. But we need the output to be actionable by a student. A program dependency graph may be semantically meaningful, but a novice won’t be able to make sense out of it.
With several collaborators, I’ve been making progress towards realizing this
vision for several years now. For instance,
and Forge are all direct implementations of different parts of this
idea. Forge, which Tim Nelson spearheads, is a particularly good
illustration of this. It tackles a problem—
Obviously, all this only scratches the surface. The ideas are robust, but it takes a lot of careful work on semantics and human factors, and we need tooling to link the former with the latter. That said, to me there’s something oddly exciting about this line of research: I’ve been working in this space for 25 years, but each time there’s a fresh level of insight, it feels like I’m just getting started!
Even though this is written in the first person, all the ideas here grew out of collaborations with amazing people on the DrRacket, WeScheme, and Pyret teams. They include: Matthias Felleisen, Robby Findler, Matthew Flatt, Kathi Fisler, Guillaume Marceau, Danny Yoo, Emmanuel Schanzer, Ben Lerner, Joe Gibbs Politz, Jack Wrenn, Tim Nelson.