AM Research is a division of Aspiring Minds. Aspiring Minds aspires to build an assessment-driven job marketplace (a SAT/GRE for jobs) to drive accountability in higher education and meritocracy in labor markets. The products developed based on our research has impacted more than two million lives and the resulting data is a source of continuous new research.
A cocktail of assessment, HR, machine learning, data science, education, social impact with two teaspoons of common sense stirred in it.
In fall 2014, we organized ASSESS, the first workshop on data mining for educational assessment and feedback, at KDD 2014 [link]. The workshop brought together a total of 80 participants including education psychologists, computer scientists and practitioners under one roof and led to a thoughtful discussion. We have put together a white paper which captures our key discussions from the workshop. The paper primarily discusses why assessments are important, what is the state of the art and what goals should we pursue as a community. It is a brief exposition and serves as a starting point for a discussion to set the agenda for the next decade.
Why are assessments important?
Automated and semi-automated assessments are a key to scaling learning, validating pedagogical innovations, and delivering socio-economic benefits of learning.
- Practice and Feedback: Whether considering large-scale learning for vocational training or non-vocational education, automating delivery of high-quality content is not enough. We need to be able to automate or semi-automate assessments for formative purposes. Substantial evidence indicates that learning is enhanced through doing assignments and obtaining feedback on one’s attempts. In addition, the so-called “testing effect” demonstrates that repeated testing with feedback enhances students long-term retention of information. By automating assessments, students can get real-time feedback on their learning in a way that scales with the number of students. Automated assessments may become, in some sense, “automated teaching assistants”.
- Education Pedagogy: There is a great need to understand which teaching/learning/delivery models of pedagogy are better than others, especially with new emerging modes and platforms for education. To understand the impact of and compare different pedagogies, we need assessments that can summatively measure learning outcomes precisely and accurately. Without valid assessments, empirical research on learning and pedagogy becomes questionable.
- Learning to socio-economic mobility: For learners that seek vocational benefits, there need to be scalable ways of measuring and certifying learning so that they may garner socio-economic benefits from what they’ve learnt. There need to be scalable ways of measuring learning so as to predict the KSOAs (knowledge, skills and other abilities) of learners to do specific tasks. This will help both learners and employers by driving meritocracy in labor markets through reduced information asymmetries and transaction costs. Matching of people to jobs can become more efficient.
We look forward to hearing your thoughts on the paper! Do feel free to write to firstname.lastname@example.org
This is an excerpt from the white paper ‘On Assessments – State of the art and goals’, which had contributions from Varun Aggarwal, Steven Stemler, Lav Varshney and Divyanshu Vats, co-organizers, ASSESS 2014 at KDD. The full paper can be accessed here.
The industry today is on a constant look-out for good programmers. In this new age of digital services and products, it’s a premium to possess programming skills. Whenever a friend asks me to refer a good programmer for his company, I tell him – why would I refer to you, I will hire her for my team! But what does having programming skills really mean? What do we look for when we hire programmers? An ability to write functionally correct programs – those that pass test cases? Nah..
A seasoned interviewer would tell you that there is much more to writing code than passing test cases! For starters, we really care for how well a candidate understands the problem and approaches a solution than being able to write functionally correct code. “Did the person get the logic?” is generally the question discussed among interviewers. We’re also interested in seeing whether the candidate’s logic (algorithm) is efficient – with low time and space complexity. Besides this, we also care for how readable and maintainable a candidate makes her code – a very frustrating problem for the industry today is to deal with badly written code that breaks under exceptions and which is not amenable to fixes.
So then why do we all use automated programming assessments in the market which base themselves on just the number of test cases passed? It probably is because it is believed that there’s nothing better. If AI can drive cars automatically these days, can it not grade programs like humans? It can. In our KDD paper in 2014 , we showed that by using machine learning, we could grade programs on all these parameters as well as humans do. The machine’s agreement with human experts was as high as 0.8-0.9 correlation points!
Why is this useful? We looked at a sample of 90,000 candidates who took Automata, our programming evaluation product, in the US. These were seniors graduating with a computer science/engineering degree and were interested in IT jobs. They were scored on four metrics – percent test-cases passed, correctness of logic as detected by our ML algorithm (scored on a scale of 1-5), run-time efficiency (on a scale of 1-3) and best coding practices used (on a scale of 1-4). We find answers to all that an interviewer cares for –
- Is the logic/thought process right?
- Is this going to be an efficient code?
- Will it be maintainable and scalable?
A clever man commits no minor blunders – Von Goethe
Typically, companies look for candidates who get the code nearly right; say, those who pass 80% test-cases in a test-suite. 36% of the seniors made it through such a criteria. In a typical recruiting scenario, the remaining 64% would have been rejected and not considered for further processes. We turned to our machine learning algorithm to see how many of these “left out” had actually written logically correct programs having only silly errors. What do we find? Another good 16% of these were scored 4 or above by our system, which meant they had written codes which had ‘correct control structures and critical data-dependencies but had some silly mistakes’. These folks should have been considered for the remainder of the process! Smart algorithms are able to spot what would be totally missed by test cases but which could have been spotted by human experts.
We find a lot of these candidates’ codes pass less than 50% test cases (see figure 1). Why would this be happening? We sifted through some examples and found some very interesting ways in which students made errors. For the problem requiring to remove vowels in a string, a candidate had missed what even the best in the industry fall prey to at times – an incorrect usage of ORs and ANDs with the negation operator! Another had implemented the logic well but messed up on the very last line. He lacked the knowledge of converting character arrays back to strings in Java. The lack of such specific knowledge is typical of those who haven’t spent enough time with Java; but this is easy to learn and shouldn’t cost them their performance in an assessment and shouldn’t stop them from differentiating themselves from those who couldn’t think of the algorithm.
The computing scientist’s main challenge is not to get confused by the complexities of his own making – E. W. Dijkstra
We identified those who had written nearly correct code; those who could think of algorithms and put them into a programming language. However, was this just some shabbily written code which no one would want to touch? Further, had they written efficient code or would it have resulted in the all so familiar “still loading” messages we see in our apps and software. We find that roughly 38% folks thought of an efficient algorithm while writing their codes and 30% wrote code which was acceptable to reside in professional software systems. Together, roughly 20% students write programs which are both readable and efficient. Thus, AI tells us that there are 20% programmers here whom we should really talk to and hire! If we are ready to train them and run with them a little, there are various other score cuts one could use to make an informed decision.
In all, AI can not only drive cars, but find programmers who can make a driverless car. Thinking how to find data scientists by using data science? – coming soon.
Interested in using Automata to better understand the programming talent you evaluate? Do you have a different take on this? Tell us by writing to email@example.com
Gursimran, Shashank and Varun
 Srikant, Shashank, and Aggarwal, Varun. “A system to grade computer programming skills using machine learning.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.
What is your job title/role/designation? If I ask this question to everyone in the world, probably I will get a million different answers. But, there are not a million different job roles, probably 2500-3500 of them, called by various names. We want to classify these many job titles into our internal taxonomy of 1500+ job roles. Why? We want to be able to tell what skills they need for each of the many jobs listed on the web and provide this information to the labor market. We want to tell jobseekers what jobs in the open market match them based on their AMCAT scores and what skills they need to improve for particular jobs. Similarly, we want to tell companies what candidates are suitable for their open job roles. This will not only lead to more efficient matching, but illumine training needs in the labor market.
Well, this becomes a 1500+ class classification problem! For every job role title and its job description (which may not exist for every title), we need to come up with one of our 1500+ job titles or just say we cannot do it. This is a challenging problem – large number of classes means needing huge amounts of labeled data for supervised learning; getting accurate expert ratings isn’t easy. Imagine doing a naive bayes on it – we will need at least 100 points per class, meaning a total requirement of 150K labeled data points. One thus needs to go with a mix of unsupervised and supervised learning to tackle a problem such as this.
We took this challenge up six months back with the usual toolbox of unigram/bigram frequency counts, SVDs, stemming, various distance metrics and so on. Good news! We came back with an 80-80 precision recall on test set (and improving): We can tell the title for 80% of the jobs and 80% times we were correct. Did our toolbox help – yes indeed, but with a lot of other innovations. The key learning: for real world application, general machine learning techniques do work, but is usually coupled with a lot of other smart engineering and innovations to make it happen – the best algorithm is a mix of rules from human intuition and statistical techniques. One needs to understand the problem domain well – to save oneself us from the no free lunch theorem by throwing a generic ML technique. So let us look at few of these innovations:
a. Vocabulary filter: Consider the titles “Junior CAD Operator-Northeast Commercial” or “PT M-F Summer Nanny for 4mo twins in Midtown West”. A lot of these words do not tell us about the job role, definitely not ‘Northeast Commercial’ or ‘Midtown West’. So could we just filter them out and save our ML algorithm the effort to figure out the right ‘features’? Second, we could tag the words semantically in two lists: one which tells us about the job function, say “CAD operator” and the other the level, junior/senior/manager etc. We do our distances separately on the function and level lists and then combine them in creative ways to get good results. Works wonders!
b. Title vs. job description: How do we use the two pieces of information optimally – the job title and job description. Interestingly we find that job descriptions help you save yourself from large mistakes; it does a good coarse comparison and gets you to the right set of jobs to consider. On the other hand, title comparisons can go grossly wrong at times (matching an attorney to a lawyer) but they do better on pinpointing the right title. A job description match has higher accuracy if we consider whether the right title was among the top 10 predicted titles, but a title match has higher accuracy if we look at the top title we could match. This makes intuitive sense. But, we can combine these two to better either!
c. Logical match vs. distance: We get one input from our various fancy projected distances between the title that’s queried and the set of titles we want to map to. Besides this, we also pull out a simple input from logical matching – is the query title a subset, exact match (has all the same words), superset or overlaps with one or more of our internal list of titles. This is useful information. It helps get some stuff absolutely right by a simple way and doesn’t let our creative statistical distances ruin it! For instance, if it is an exact match with a single internal title, we just choose it. More importantly, it helps create a decision tree on what to do with each ‘kind’ of query title according to its logical match and use the matching titles as an input for further processing. Furthermore, it provides a guide on when to recall.
d. Crowdsourcing: Creating labeled data is a tough one here – it is subjective and needs some expert oversight. Crowd would not do a good job of selecting the matching title from a list of 1500 titles! Interestingly, we use the crowd, the expert and the ML predictor all feeding into each other creating a living system, which continuously improves itself (our previous work using the crowd innovatively). For instance, the ML predictor tells us the top 5 guesses, which we feed to the crowd. They tell us whether the title is one of these or not. If not, the expert jumps in. This helps us build a system where we continue to create new labeled data, benchmark our performance and improve the ML algorithm.
And this is the tip of the iceberg. One can solve seemingly very hard problems by smartly using machine learning techniques together with human intelligence — knowledge from the problem domain and the crowd. And this can create a lot of value – like in our case in the labor markets, if you look at USA, there are almost 4-5 million open positions and 8.5 million unemployed candidates. When one surveys jobseekers, 81% show lack of knowledge of skills needed for particular jobs and do not know the level of their skills. If we can fill this information gap credibly and automatically, there is hope! How do we map titles to skills scientifically – wait for another blog post!
(Work done together with Vinay Shashidhar and Shashank Srikant)
It is an open secret that data science is becoming pervasive. What was once the preserve of statisticians and computer scientists – deft at trudging through mountains of data – has found its tools and techniques percolating into every industry and every level. Peer into the crystal ball and you don’t need to suspend reality too much to imagine a future in which a factory manager looks at production data to predict what machine might break-down soon. A cab-operator analyzes his Uber receipts to figure out where he should drive to make the most money. A sales manager looks at what kinds of customers his sales agents are most successful with to ascertain who to deploy where. Decidedly, the future belongs to the data scientist. Where will these data scientists come from? Who is going to train them?
The very nature of the subject eschews traditional learning modes. The data scientist must have the ability to learn quickly the context of the data, build hypotheses, have the ability to use techniques to confirm his suspicions and then construct predictors or automated systems. It marries technology with knowledge; intuition with scientific rigor. Our education systems will be slow to adapt – they will have to devise new methodologies, develop syllabi and learn to simultaneously involve multiple teachers. In the meanwhile, a whole generation of students might graduate who do not have the skills that industry expects from them in a data rich environment.
At Aspiring Minds, we’re passionate about helping students reach their full potential. We plan to pursue a series of initiatives to help advance data science education in India and around the world. As a first step, we held a data science camp for elementary school students! The participants continuously surprised us – with their knowledge, their understanding and even their wit. Two things became clear quickly – a. kids seldom confront open-ended problems and it took some getting used-to the idea of there being no one correct, pre-decided answer and b. with some guidance, they learn astonishingly quickly.
Read more about our exciting and rewarding weekend here!
At the end of the camp, the participating kids blogged about their experiences and the plots/analysis that they came up with. Read about them here.
Our team got enthusiastically involved in mentoring the students through the exercise and ended up learning more about their own teaching styles in the process.
We’ve also put out the exercises and resources we used for the camp for you to replicate it in your school/university/workplace. If the thought of indulging high schoolers in data-science seems absurd to you, snap out of it! It is possible; we tried it and the kids had a fun time picking up these concepts.
Let us know what you thought of our data camp. Please do write to us if you go ahead and try this out with students around you. We’ll eagerly look forward to that!
Research Intern, Aspiring Minds
Class of 2017, Computer science, Harvard.
Some more good news!
Soon after our recent acceptance of our spoken English grading work at ACL, our work on learning models for job selection and personalized feedback gets accepted at the workshop Machine Learning for Education at ICML! Some results from this paper were discussed in one of our previous posts. The tool was built five years ago and has since helped a couple of million students get personalized feedback and aided 200+ companies hire better. I shall also be giving an invited talk at this workshop.
Earlier this month, we also got a paper at KDD accepted, which builds on our previous work in spontaneous speech evaluation. We find how well we can grade spontaneous speech of natives of different countries and also analyze the benefits the industry gets with such an evaluation system.
Busy year ahead it seems – paper presentations at France, Beijing, Australia and finally New Jersey, where we’re organizing the second edition of ASSESS, our annual workshop on data mining for educational assessment and feedback. It’s being organized at ICDM 2015 this winter. July 20th is the submission deadline for the workshop. Here is a list of submissions we saw in our workshop last year, at KDD. Spread the word!
- AI powered coding platform ‘understands’ programs that do not compile!
- Plan what NOT to do in 2017!
- World’s first automated motor skill test – exploiting the power of touch tablets
- The first interactive US Skill Demand Map- A big data approach
- Scaling up machine learning to grade computer programs for 1000s of questions in multiple languages
- assessment research
- Big Data
- Computer Program Assessments
- Data science
- decision trees
- hiring assessment
- hiring test
- item difficulty
- Kids learning
- Machine Learning
- motor skill test
- online hiring assessment
- online hiring test
- programming assessments
- programming test
- Test Cases
- testing research