Machine learning has helped solved many grading challenges – spoken english, essay grading, program grading and math problem grading to cite a few examples. However, there is a big impedance in using these methods in real world settings. This is because one needs to build an ML model for every question/prompt – for instance, in essay grading, a different model designed to grade an essay on ‘Socialism’ will be very different from one which can grade essays on ‘Theatre’. These models require a large number of expert rated samples and a fresh model building exercise each time. A real-world practical assessment works on 100s of questions which then translates to requiring 100s of graders and 100s of models. The approach doesn’t yield to be scalable, takes too much time and most of the times, is impractical.
In our KDD paper accepted today, we solve this challenge quite a bit for grading computer programs. In KDD 2014, we had presented the first machine learning approach to grade computer programs, but we had to build a model per problem. We have now invented a technique where we need no expert graded samples for a new problem and we don’t need to build any new models! As soon as we have around a few tens of ‘good’ codes for a problem (automatically identified using test case coverage and static analysis), our newly invented question-agnostic models automatically take charge. How will this help us? With this technology, our machine learning based models can scale, in an automated way, to grade 1000s of questions in multiple languages in a really short span of time. Within a couple of weeks of a new question being introduced into our question pool, the machine learning evaluation kicks in.
There were couple of innovations which led to this work, a semi-supervised approach to model building:
We can identify a subset of the ‘good’ set automatically. In the case of programs, the ‘good set’, codes which get a high grade, can be identified automatically using test cases. We exploit this to find other programs similar to these in a feature space that we define. To get a sense of this, think of a distance measure from programs identified as part of the ‘good set’. Such a ‘nearness’ feature would then correlate with grades across questions irrespective of whether it is a binary search problem or a tree traversal problem. Such features help us build generic models across questions.
We design a number of such features which are invariant to the question and correlate to the expert grade. These features are inspired by the grammar we proposed in our earlier work. For instance, one feature is how different is an unseen program from the set of keywords present in the ‘good set’; while another is the difference in the programs in the kind of computations they are doing. Using such features, we learn generic models for a set of problems using supervised learning. These generic models work super well for any new problem as soon as we get our set of good codes!
Check out this illustrative and easy-to-grasp video which demonstrates our latest innovation.
The table presents a snapshot of the results presented in the paper. As shown in the last two columns, the ‘question-independent’ machine learning model (ML Model) constantly outperforms the test suite based baseline (Baseline). The claim of ‘question-independence’ is corroborated by similar and encouraging results (depicted in last three rows) obtained on totally unseen questions, which were not used to train the model.
What does this all mean?
- We can really scale ML based grading of computer programs. We can continue to add new problems and the models will automatically start working within a couple of weeks.
These set of innovations apply to a number of other problems where we can automatically identify a good set. For instance, in circuit solving problems, the ones with the correct final answer could be considered a good set; this can similarly be applied to mathematics problems or an automata design problem; problems where computer science techniques are mature to verify functional correctness of a solution. Machine learning can automatically then help grade other unseen responses using this information.
Hoping to see more and more ML applied to grading!
Work done with Gursimran Singh and Shashank Srikant
What makes a programming problem hard?
Why are some programming problems solved by more number of students while others are not. The varying numbers we saw got us thinking on how the human brain responds to programming problems. This was also an important question for us to have an answer for when we designed an assessment or wanted guidance on pedagogy. Understanding what makes a programming problem hard would enable us to put questions into a programming assessment of a given difficulty where neither everyone would get a perfect score nor a zero and would also help us in creating equivalent testforms for the test.
We tried taking a jab at it by answering it empirically. We marked 23 programming problems on four different parameters on a 2 or 3 point scale — how hard is the data structure used in the implementation, what data structure is being returned from the target function, how hard it is to conceive the algorithm, the implementation of the algorithm and how hard is it to handle edge cases for the given problem. [See the attached PDF for more details on the metrics and the rubric followed]. There was some nuance involved in choosing these metrics – for instance, the algorithm to a problem could be hard to conceive if, say, it requires thinking through a dynamic programming approach, but its implementation can be fairly easy, involving a couple of loops. On the other hand, the algorithm to sort and then to merge a bunch of arrays can be simple in themselves but implementing such a requirement could be a hassle.
For these problems, we had responses from some 8000 CS undergraduates each. Each problem was delivered to a test-taker in a randomized testform. From this we pulled out how many people were able to write compilable code (this was as low as 3.6% to as high as 74% for different problems) and how many got all test cases right. We wanted to see how well we could predict this using our expert-driven difficulty metrics (our difficulties are relative and can change based on sample; for an absolute analysis we could have predicted the IRT parameters of the question — wanna try?)
So, what came out? Yes! we can predict. Here is the base correlations matrix. They are negative because a harder problem has a lower correct rate.
|Percent-pass all test cases||-0.25||-0.42||-0.43||-0.05|
We tried a first cut analysis on our data by building a regression tree with some simple cross-validation. We got a really cool, intuitive tree and a prediction accuracy of 0.81! This is our ‘Tree of Program Difficulty’ . So what do we learn?
The primary metric in predicting whether a good percentage of people are able to solve a problem right is the algorithmic difficulty. Problems for which the algorithm is easy to deduce (<1.5) immediately witness a high pass rate whereas those for which it is hard (>2.5) witness a very poor pass rate. For those that’re moderately hard algorithmically (between 1 and 2.5), the next criterion deciding the pass percentage is the difficulty in implementing the algorithm. If it’s easy to implement (<2), we see a high pass rate being predicted. For those that're moderately hard in implementation and algorithm, the difficulty of the data structures used in the problem then predicts the pass rate. If an advanced data structure is used, the rate falls to less than 6% and is around a moderate 11% otherwise.
So, what nodes do your problems fall on? Does it match our result? Tell us!
Thanks Ramakant for the nifty work with data!
-Shashank and Varun