What is a Pedagogic IDE?

    1 How to Read This Article

    2 At the Beginning

    3 More Insights

    4 More About Feedback

    5 Ground Truth

    6 How Can We Use Ground Truth?

    7 Multiple Artifacts

    8 Comparison is Hard

    9 Progress

    10 Credits

I have been thinking about the design, technology, and human factors aspects of programming environments—henceforth IDEs, for “integrated development environments”—for decades. This article records my current thinking (as of 2020), and will probably grow over time.

1 How to Read This Article

This article is three very different things, that sit together a little awkwardly:
  1. a historical retrospective,

  2. a list of criteria, and

  3. a proposal for progress.

Indeed, it is written roughly in that order, starting with our early design of DrRacket, proceeding with newer lessons that all offer useful criteria, and culminating in a proposal for what really distinguishes pedagogic settings.

As you read, it’s also worth keeping in mind that “pedagogic” is not a hard criterion that admits some IDEs always and others never. The bullet items below provide several characteristics that pedagogic IDE should have, and many have some subset of them, even ones not intentionally designed for use with students. Thus, it is perhaps more helpful to think about pedagogy as a modea statement about how an IDE is being configured for use at a particular moment—than a blanket statement about it for all time.

Thanks to the many people on Twitter who gave feedback, especially Jonathan Aldrich.

2 At the Beginning

We began the DrRacket project (then called DrScheme) in 1995 on the premise that we wanted to build a pedagogic IDE. Our initial goals were set by our context: students learning to program in Scheme inside Emacs. Each such institution had its own set of idiosyncratic Emacs key-bindings that students were forced to memorize, while also being plunged into the complexity of an editor vastly more sophisticated than anything they had used before, where any odd key combination could drop them into a confusing state from which recovery often needed expert help.No, this is not an Emacs rant. For the record, I still use Emacs. In fact, this article was written using Emacs. It was hardly student-friendly for all but the most adventurous (who, of course, loved it).

At the same time, this is when the US’s AP Computer Science curriculum was shoving C++ down the throats of schools. As a result, many students were entering college having seen a sophisticated graphical IDE like Visual C++. On the one hand, these set an expectation amongst students that an IDE would be a rich graphical environment that looked like a native application; its absence made education feel inauthentic. On the other hand, the number of buttons and options in such IDEs were problematic. Many students felt overwhelmed; a few (probably the same ones who loved discovering the recesses of Emacs) clicked away at random, losing sight of fundamental learning.

Thus, at the very outset, we had three goals:
  • Make a reasonable, modern, native-looking application.

  • Make a stand-alone application where we can control the experience, rather than an island surrounded by a sea of dragons.

  • Minimize distraction, confusion, and unnecessary (especially undesirable) choice.

We have since grown comfortable with applications being delivered through the Web, so Web-based IDEs are now a reasonable alternative to that native application, and indeed confer many benefits that the native application does not. However, these conceptual rules remain important.

3 More Insights

As we continued to build DrRacket, a few more rules became clear. We have documented them and their justification extensively elsewhere (such as this paper and this one), so I’ll just summarize them here:
  • Languages evolve by accruing complexity, which can create dangerous reefs on which students can founder. We need principled, teaching-friendly sub-languages that eliminate accidental complexity.Even if a language is designed from scratch to avoid this problem, it can still contain different levels of complexity.

  • Just as a textbook does not plunge students into a whole language at once but rather builds it up incrementally, IDEs should introduce the language incrementally, matching a book’s needs and supporting a learning progression.

  • Errors matter! Error messages should be carefully written to be friendly and helpful to students, whose needs may be different from those of a developer.

  • Sub-languages must be epistemically closed, with errors inside this closure. Concretely, error messages should never use vocabulary or concepts beyond what has been introduced up to that point in the curriculum.

These principles drove the development of DrRacket. Furthermore, Racket created linguistic abstractions that embodied these ideas. This enabled several textbooks to map their own learning progressions to corresponding sub-languages in DrRacket, giving their student users the same benefits that users of How to Design Programs enjoyed.

4 More About Feedback

We weren’t done with error messages (and IDE feedback more generally). Starting in 2010, Guillaume Marceau, Kathi Fisler, and I began to investigate the error messages of DrRacket using the methods of HCI. What we found was deeply disturbing: that the errors often were not helping students fix their problems.

Indeed, we also found that even the vocabulary used in messages was difficult for students. For instance, we had carefully distinguished between concepts like “operator”, “constructor”, and “predicate”. But for students who had not yet picked up those distinctions, these all looked like gobbledygook; the only term they were familiar with was “function”. This necessitated removing these careful distinctions to make the messages more decipherable and hence actionable. We also noticed that instructors often used vocabulary different from that of the IDE; libraries also created their own ontology or used their own style that contradicted the IDE. In short,
  • IDE feedback lives in an ecosystem consisting of the programming language, books, third-party libraries, educators, and more. All of these need to be in harmony, and require style guides to facilitate agreement.

5 Ground Truth

Over the past few years, I’ve come to realize there is another critical aspect to pedagogic IDEs that truly distinguishes them from a conventional IDE designed for a developer. It’s that a pedagogic IDE has ground truth. I’ve been pursuing this under various guises for some years now, but only in the past two years have these pursuits coalesced into a clear thesis.

Here’s what I mean by ground truth. In almost every pedagogic setting, we are asking students to create something we have already done. Therefore, we not only have full knowledge of the answer, we’ve even already created instances of the solutions we are asking them to build. That is, we are trying to bring students to the same level of achievement that we already have.

In a conventional programming setting, such as in industry, programmers do not re-build perfect replicas of existing systems. If they’re re-implementing a system, often it’s with an eye to at least improving it, and usually to ultimately changing it. In the very rare cases that they want to identically reproduce a system, it’s usually for intellectual property reasons, and thus done with various safeguards against communication. In general, though, the thing they are trying to build doesn’t already exist (which is why they’re building it—if it already existed in good form, they would be put to work on something else).

This is the opposite of what we do in education. In education, it is utterly standard for a hundred students to all be striving, independently, to perfectly reproduce something we have already done.Indeed, great effort goes into making sure these students don’t merely pawn off as their own what other students have done. There are solid educational reasons for this. The fact that we have done all this already, and can often point to our work as exemplary, gives us the ground truth that conventional programming lacks. My thesis is that:

This ground truth can and should be
exploited to improve student learning.

6 How Can We Use Ground Truth?

In a way, numerous educators, for decades, have been using ground truth: by giving students a test suite against which to run their programs. I find this practice odious for several reasons:If you are going to do it anyway, at least avoid the perils we’ve identified. It promotes “passing the instructor’s suite” as the most important trait a program should satisfy (whereas, often, it is one of the least important). It fails to teach students to think about correctness for themselves, treating it as an outsourced activity. It often encourages a mentality of “keep tweaking and submitting until you pass”, which as a programming practice is both utterly unauthentic and arguably dangerous. (Adding rate limits is a symptom of a problem with the pedagogy, not a solution.) More subtly, and perhaps worse, it robs students of important stimuli for problem understanding. But most of all, this is primarily a means of assessment, not of instruction.

Nevertheless, we can recognize in this widely-used practice the germ of this idea, but we should turn it into a student aid, and we need to generalize it. Why has it not already been widely generalized? I would argue for two reasons, one pedagogic and one technical:
  • We don’t ask students to produce a broad enough range of artifacts.

  • Comparing artifacts is hard from a technical perspective. Relatedly, giving actionable feedback is hard both technically and from a human factors perspective.

7 Multiple Artifacts

How to Design Programs is perhaps the canonical example of a programming pedagogy that asks students to produce (a) multiple artifacts, that are (b) interrelated, and are in a (c) progression that matches progress through problem-solving. Concretely, students are asked to produce
  • data definitions

  • a purpose statement

  • a function signature

  • examples of function use

  • a function template

  • the function body

  • tests of the function

The nice thing about having multiple artifacts from different stages of development is that:
  • We can give feedback early.

  • We can correct problems when it doesn’t cost much to fix them. (Once a student has fully implemented their system, giving feedback on failing tests will invariably only produce minor tweaks in the program, never large-scale architectural changes.)

  • We can give different kinds of feedback, some of which will be better appreciated by some students and others better by others.

  • We can make the desolate progression from a blank page to a fully-working program a more rich, textured interaction between student and (otherwise foreboding and hostile) tool.

Imagine that for each of the artifacts above, we have our own ground truth versions (which presumably we have written while developing the problem). We can, in principle, compare
  • our data definition against theirs

  • our function signature against theirs

  • our examples against theirs

and so on. But we can also compare each of these artifacts against other kinds of artifacts. Running their program against our tests is conventional practice, but we have many more choices:
  • our data against their data definition

  • our examples against their function signature

  • our function against their examples

and so on and on—it is instructive to consider every ordered pair of these—each of which provides a diagnostic, many of which catch students early in the process before they commit to an incorrect path.

In fact, we can go farther. We don’t only have to implement positive instances of these artifacts: we can also implement negative instances (i.e., non-solutions). Using a combination of known-common mistakes and just random mutation, we can produce incorrect versions of all these artifacts, which have their own uses. Incorrect programs are routinely used to measure the quality of a test suite (mutation testing). We can do that, and more, comparing positive and negative instances of student artifacts with an eye towards making the programming process a conversation with a knowledgeable companion.

8 Comparison is Hard

We run tests against programs because we know how to do it, and the output is easy to explain. An instructor may choose to hide output (e.g., only reporting the number of tests that passed or failed), but if they chose to provide more information, it’s quite easy to say “on the input 3, your function produced 8 instead of 9” and a student would not have much trouble understanding this utterance (even if they can’t explain why they produced 8 or why they were supposed to produce 9).

Unfortunately, that is not only the easiest case, it may be the only easy one. Comparing two purpose statements is AI-complete since it requires handling natural language; and while the language here may be relatively structured (and poorly-structured statements can be rejected summarily), ensuring very high precision and recall, and giving feedback that is understandable by a computing novice, is very hard. Other pairs sit somewhere in-between. For instance, a long line of research has studied how to compare two programs. But we need the output to be actionable by a student. A program dependency graph may be semantically meaningful, but a novice won’t be able to make sense out of it.

9 Progress

With several collaborators, I’ve been making progress towards realizing this vision for several years now. For instance, Examplar, D4, and Forge are all direct implementations of different parts of this idea. Forge, which Tim Nelson spearheads, is a particularly good illustration of this. It tackles a problem—teaching formal methods—that may be harder than teaching programming, and yet has vastly less support. At the same time, it leverages formal methods tools to address the comparison question much better than it can be handled for conventional programs. Our brief paper to accompany a keynote talk says a little more about this (see section 3.2).

Obviously, all this only scratches the surface. The ideas are robust, but it takes a lot of careful work on semantics and human factors, and we need tooling to link the former with the latter. That said, to me there’s something oddly exciting about this line of research: I’ve been working in this space for 25 years, but each time there’s a fresh level of insight, it feels like I’m just getting started!

10 Credits

Even though this is written in the first person, all the ideas here grew out of collaborations with amazing people on the DrRacket, WeScheme, and Pyret teams. They include: Matthias Felleisen, Robby Findler, Matthew Flatt, Kathi Fisler, Guillaume Marceau, Danny Yoo, Emmanuel Schanzer, Ben Lerner, Joe Gibbs Politz, Jack Wrenn, Tim Nelson.