Test cases evaluate whether a computer program is doing what it’s supposed to do. There are various ways to generate them – automatically based on specifications, say by ensuring code coverage  or by subject matter experts (SMEs) who think through conditions based on the problem specification.
We asked ourselves whether there was something we could learn by looking at how student programs responded to test cases. Could this help us design better test cases or find flaws in them? By looking at such responses from a data-driven perspective, we wanted to know whether we could .a. design better test cases .b. understand whether there existed any clusters in the way responses on test cases were obtained and .c. whether we could discover salient concepts needed to solve a particular programming problem, which would then inform us of the right pedagogical interventions.
We built a cool tool which helped us look at statistics on over 2500 test cases spread across over fifty programming problems attempted by nearly 18,000 students and job-seekers in a span of four weeks!
We were also able to visualize how these test cases clustered for each problem, how they correlated with other cases across candidate responses and were also able to see what their item response curves looked like. Here are a couple of things we learnt in this process:
One of our problems required students to print comma-separated prime numbers starting from 2 till a given integer N. When designing test cases for this problem, our SMEs expected there to be certain edge cases (when N was less than 2) and some stress cases (when N was very large) while expecting the remainder of the cases to check the output for random values of N, without expecting them to behave any differently. Or so they thought. On clustering the responses obtained on each of the test cases for these problems (0 for failing a case and 1 for passing it), we found two very distinct clusters being formed (see figure below) besides the lone test case which checked for the edge condition. A closer look at some of the source codes helped us realize that values of N which were not prime numbers had to be handled differently – a trailing comma remained at the very end of the list and lots of students were not doing this right!
This was interesting! It showed that the problem’s hardness was not only linked to the algorithm of producing prime numbers till a given number, but also linked to the nuance of printing it in a specific form. In spite of students getting the former right, a majority of them did not get the latter right. There are several learnings from this. If the problem designer just wants to assess if students know the algorithm to generate primes till a number, s/he should drop the part to print them in a comma separated list – it adds an uncalled for impurity to the assessment objective. On the other hand, if both these skills are to be tested, our statistics is a way to confirm the existence of these two different skills – getting one right does not mean the other is doable (say, can this help us figure out dominant cognitive skills that are needed in programming?). By separating the test cases to check the trailing comma case and reporting a score on it separately, we could ideally give an assessor granular information on what the code is trying to achieve. Contrast this to when test cases were simply bundled together and it wasn’t clear what aspect the person got right.
More so, when we designed this problem, the assessment objective was to primarily check the algorithm for generating prime numbers. Unfortunately, the cases that did not handle the trailing comma went down on their test case scores in spite of having met our assessment criterion. The good news here was that our machine learning algorithm  niftily picked it up and was able to say by the virtue of their semantic features that they were doing the right job!
We also fit 3-PL models from Item Response Theory (more info) on each test case for some of our problems and have some interesting observations there on how we could relate item-parameters to test case design – more on this in a separate post!
Have ideas on how you could make use of such numbers and derive some interesting information? Write to us, or better, join our research group!
Kudos to Nishanth for putting together the neat tool to be able to visualize the clusters! Thanks to Ramakant and Bhavya for spotting this issue in their analysis.
– Shashank and Varun
 KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs, Cadar, Cristian, Daniel Dunbar, and Dawson R. Engler. OSDI. Vol. 8. 2008.
 A system to grade computer programming skills using machine learning, Srikant, Shashank, and Varun Aggarwal. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.