AI can help you spot the right programmer!

The industry today is on a constant look-out for good programmers. In this new age of digital services and products, it’s a premium to possess programming skills. Whenever a friend asks me to refer a good programmer for his company, I tell him – why would I refer to you, I will hire her for my team! But what does having programming skills really mean? What do we look for when we hire programmers? An ability to write functionally correct programs – those that pass test cases? Nah..

A seasoned interviewer would tell you that there is much more to writing code than passing test cases! For starters, we really care for how well a candidate understands the problem and approaches a solution than being able to write functionally correct code. “Did the person get the logic?” is generally the question discussed among interviewers. We’re also interested in seeing whether the candidate’s logic (algorithm) is efficient – with low time and space complexity. Besides this, we also care for how readable and maintainable a candidate makes her code – a very frustrating problem for the industry today is to deal with badly written code that breaks under exceptions and which is not amenable to fixes.

So then why do we all use automated programming assessments in the market which base themselves on just the number of test cases passed? It probably is because it is believed that there’s nothing better. If AI can drive cars automatically these days, can it not grade programs like humans? It can. In our KDD paper in 2014 [1], we showed that by using machine learning, we could grade programs on all these parameters as well as humans do. The machine’s agreement with human experts was as high as 0.8-0.9 correlation points!

Why is this useful? We looked at a sample of 90,000 candidates who took Automata, our programming evaluation product, in the US. These were seniors graduating with a computer science/engineering degree and were interested in IT jobs. They were scored on four metrics – percent test-cases passed, correctness of logic as detected by our ML algorithm (scored on a scale of 1-5), run-time efficiency (on a scale of 1-3) and best coding practices used (on a scale of 1-4). We find answers to all that an interviewer cares for –

  • Is the logic/thought process right?
  • Is this going to be an efficient code?
  • Will it be maintainable and scalable?

Fig.1. Distribution of the different score metrics. Around 48% of the candidates who scored 4 on the code logic metric (as detected by our ML algorithm) had passed less than 50% of their test suite.

A clever man commits no minor blunders – Von Goethe

Typically, companies look for candidates who get the code nearly right; say, those who pass 80% test-cases in a test-suite. 36% of the seniors made it through such a criteria. In a typical recruiting scenario, the remaining 64% would have been rejected and not considered for further processes. We turned to our machine learning algorithm to see how many of these “left out” had actually written logically correct programs having only silly errors. What do we find? Another good 16% of these were scored 4 or above by our system, which meant they had written codes which had ‘correct control structures and critical data-dependencies but had some silly mistakes’. These folks should have been considered for the remainder of the process! Smart algorithms are able to spot what would be totally missed by test cases but which could have been spotted by human experts.

incorrect_codes

Sample codes which fail most of their testsuites but are scored high by our ML system (click to enlarge)

We find a lot of these candidates’ codes pass less than 50% test cases (see figure 1). Why would this be happening? We sifted through some examples and found some very interesting ways in which students made errors. For the problem requiring to remove vowels in a string, a candidate had missed what even the best in the industry fall prey to at times – an incorrect usage of ORs and ANDs with the negation operator! Another had implemented the logic well but messed up on the very last line. He lacked the knowledge of converting character arrays back to strings in Java. The lack of such specific knowledge is typical of those who haven’t spent enough time with Java; but this is easy to learn and shouldn’t cost them their performance in an assessment and shouldn’t stop them from differentiating themselves from those who couldn’t think of the algorithm.

 

Fig.2. Distribution of runtime efficiency scores and program practices scores in perspective of the nature of the attempted problems

The computing scientist’s main challenge is not to get confused by the complexities of his own making – E. W. Dijkstra

We identified those who had written nearly correct code; those who could think of algorithms and put them into a programming language. However, was this just some shabbily written code which no one would want to touch? Further, had they written efficient code or would it have resulted in the all so familiar “still loading” messages we see in our apps and software. We find that roughly 38% folks thought of an efficient algorithm while writing their codes and 30% wrote code which was acceptable to reside in professional software systems. Together, roughly 20% students write programs which are both readable and efficient. Thus, AI tells us that there are 20% programmers here whom we should really talk to and hire! If we are ready to train them and run with them a little, there are various other score cuts one could use to make an informed decision.

In all, AI can not only drive cars, but find programmers who can make a driverless car. Thinking how to find data scientists  by using data science? – coming soon.

Interested in using Automata to better understand the programming talent you evaluate? Do you have a different take on this? Tell us by writing to research@aspiringminds.com

Gursimran, Shashank and Varun

References

[1] Srikant, Shashank, and Aggarwal, Varun. “A system to grade computer programming skills using machine learning.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014. Continue reading

Paper accepts at ICML and KDD!

Some more good news!

Soon after our recent acceptance of our spoken English grading work at ACL, our work on learning models for job selection and personalized feedback gets accepted at the workshop Machine Learning for Education at ICML! Some results from this paper were discussed in one of our previous posts. The tool was built five years ago and has since helped a couple of million students get personalized feedback and aided 200+ companies hire better. I shall also be giving an invited talk at this workshop.

Earlier this month, we also got a paper at KDD accepted, which builds on our previous work in spontaneous speech evaluation. We find how well we can grade spontaneous speech of natives of different countries and also analyze the benefits the industry gets with such an evaluation system.

Busy year ahead it seems – paper presentations at France, Beijing, Australia and finally New Jersey, where we’re organizing the second edition of ASSESS, our annual workshop on data mining for educational assessment and feedback. It’s being organized at ICDM 2015 this winter. July 20th is the submission deadline for the workshop. Here is a list of submissions we saw in our workshop last year, at KDD. Spread the word!

– Varun

What we learn from patterns in test case statistics of student-written computer programs

Test cases evaluate whether a computer program is doing what it’s supposed to do. There are various ways to generate them – automatically based on specifications, say by ensuring code coverage [1] or by subject matter experts (SMEs) who think through conditions based on the problem specification.

We asked ourselves whether there was something we could learn by looking at how student programs responded to test cases. Could this help us design better test cases or find flaws in them? By looking at such responses from a data-driven perspective, we wanted to know whether we could .a. design better test cases .b. understand whether there existed any clusters in the way responses on test cases were obtained and .c.  whether we could discover salient concepts needed to solve a particular programming problem, which would then inform us of the right pedagogical interventions.

A visualization which shows how our questions cluster by the average test case score received on them. More info on this in another blog :)

A visualization which shows how our questions cluster by the average test case score received on them. More on this in another post :)

We built a cool tool which helped us look at statistics on over 2500 test cases spread across over fifty programming problems attempted by nearly 18,000 students and job-seekers in a span of four weeks!

We were also able to visualize how these test cases clustered for each problem, how they correlated with other cases across candidate responses and were also able to see what their item response curves looked like. Here are a couple of things we learnt in this process:

One of our problems required students to print comma-separated prime numbers starting from 2 till a given integer N. When designing test cases for this problem, our SMEs expected there to be certain edge cases (when N was less than 2) and some stress cases (when N was very large) while expecting the remainder of the cases to check the output for random values of N, without expecting them to behave any differently. Or so they thought. :) On clustering the responses obtained on each of the test cases for these problems (0 for failing a case and 1 for passing it), we found two very distinct clusters being formed (see figure below) besides the lone test case which checked for the edge condition. A closer look at some of the source codes helped us realize that values of N which were not prime numbers had to be handled differently – a trailing comma remained at the very end of the list and lots of students were not doing this right!

A dendogram depicting test case clustering for the prime-print problem

A dendogram depicting test case clustering for the prime-print problem

This was interesting! It showed that the problem’s hardness was not only linked to the algorithm of producing prime numbers till a given number, but also linked to the nuance of printing it in a specific form. In spite of students getting the former right, a majority of them did not get the latter right. There are several learnings from this. If the problem designer just wants to assess if students know the algorithm to generate primes till a number, s/he should drop the part to print them in a comma separated list – it adds an uncalled for impurity to the assessment objective. On the other hand, if both these skills are to be tested, our statistics is a way to confirm the existence of these two different skills – getting one right does not mean the other is doable (say, can this help us figure out dominant cognitive skills that are needed in programming?). By separating the test cases to check the trailing comma case and reporting a score on it separately, we could ideally give an assessor granular information on what the code is trying to achieve. Contrast this to when test cases were simply bundled together and it wasn’t clear what aspect the person got right.

More so, when we designed this problem, the assessment objective was to primarily check the algorithm for generating prime numbers. Unfortunately, the cases that did not handle the trailing comma went down on their test case scores in spite of having met our assessment criterion. The good news here was that our machine learning algorithm [2] niftily picked it up and was able to say by the virtue of their semantic features that they were doing the right job!

We also fit 3-PL models from Item Response Theory (more info) on each test case for some of our problems and have some interesting observations there on how we could relate item-parameters to test case design – more on this in a separate post!

Have ideas on how you could make use of such numbers and derive some interesting information? Write to us, or better, join our research group! :)

Kudos to Nishanth for putting together the neat tool to be able to visualize the clusters! Thanks to Ramakant and Bhavya for spotting this issue in their analysis.

– Shashank and Varun

 References -

[1] KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs, Cadar, Cristian, Daniel Dunbar, and Dawson R. Engler. OSDI. Vol. 8. 2008.

[2] A system to grade computer programming skills using machine learning, Srikant, Shashank, and Varun Aggarwal. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.