80-80 precision recall with 1500+ classes!

craxy_titles

Some crazy titles we had to deal with!

What is your job title/role/designation? If I ask this question to everyone in the world, probably I will get a million different answers. But, there are not a million different job roles, probably 2500-3500 of them, called by various names. We want to classify these many job titles into our internal taxonomy of 1500+ job roles. Why? We want to be able to tell what skills they need for each of the many jobs listed on the web and provide this information to the labor market. We want to tell jobseekers what jobs in the open market match them based on their AMCAT scores and what skills they need to improve for particular jobs. Similarly, we want to tell companies what candidates are suitable for their open job roles. This will not only lead to more efficient matching, but illumine training needs in the labor market.

Well, this becomes a 1500+ class classification problem! For every job role title and its job description (which may not exist for every title), we need to come up with one of our 1500+ job titles or just say we cannot do it. This is a challenging problem – large number of classes means needing huge amounts of labeled data for supervised learning; getting accurate expert ratings isn’t easy. Imagine doing a naive bayes on it – we will need at least 100 points per class, meaning a total requirement of 150K labeled data points. One thus needs to go with a mix of unsupervised and supervised learning to tackle a problem such as this.

We took this challenge up six months back with the usual toolbox of unigram/bigram frequency counts, SVDs, stemming, various distance metrics and so on. Good news! We came back with an 80-80 precision recall on test set (and improving): We can tell the title for 80% of the jobs and 80% times we were correct. Did our toolbox help – yes indeed, but with a lot of other innovations. The key learning: for real world application, general machine learning techniques do work, but is usually coupled with a lot of other smart engineering and innovations to make it happen – the best algorithm is a mix of rules from human intuition and statistical techniques. One needs to understand the problem domain well – to save oneself us from the no free lunch theorem by throwing a generic ML technique. So let us look at few of these innovations:

a. Vocabulary filter: Consider the titles “Junior CAD Operator-Northeast Commercial” or “PT M-F Summer Nanny for 4mo twins in Midtown West”. A lot of these words do not tell us about the job role, definitely not ‘Northeast Commercial’ or ‘Midtown West’. So could we just filter them out and save our ML algorithm the effort to figure out the right ‘features’? Second, we could tag the words semantically in two lists: one which tells us about the job function, say “CAD operator” and the other the level, junior/senior/manager etc. We do our distances separately on the function and level lists and then combine them in creative ways to get good results. Works wonders!

b. Title vs. job description: How do we use the two pieces of information optimally – the job title and job description. Interestingly we find that job descriptions help you save yourself from large mistakes; it does a good coarse comparison and gets you to the right set of jobs to consider. On the other hand, title comparisons can go grossly wrong at times (matching an attorney to a lawyer) but they do better on pinpointing the right title. A job description match has higher accuracy if we consider whether the right title was among the top 10 predicted titles, but a title match has higher accuracy if we look at the top title we could match. This makes intuitive sense. But, we can combine these two to better either!

c. Logical match vs. distance: We get one input from our various fancy projected distances between the title that’s queried and the set of titles we want to map to. Besides this, we also pull out a simple input from logical matching – is the query title a subset, exact match (has all the same words), superset or overlaps with one or more of our internal list of titles. This is useful information. It helps get some stuff absolutely right by a simple way and doesn’t let our creative statistical distances ruin it! For instance, if it is an exact match with a single internal title, we just choose it. More importantly, it helps create a decision tree on what to do with each ‘kind’ of query title according toss its logical match and use the matching titles as an input for further processing. Furthermore, it provides a guide on when to recall.

d. Crowdsourcing: Creating labeled data is a tough one here – it is subjective and needs some expert oversight. Crowd would not do a good job of selecting the matching title from a list of 1500 titles! Interestingly, we use the crowd, the expert and the ML predictor all feeding into each other creating a living system, which continuously improves itself (our previous work using the crowd innovatively). For instance, the ML predictor tells us the top 5 guesses, which we feed to the crowd. They tell us whether the title is one of these or not. If not, the expert jumps in. This helps us build a system where we continue to create new labeled data, benchmark our performance and improve the ML algorithm.

And this is the tip of the iceberg. One can solve seemingly very hard problems by smartly using machine learning techniques together with human intelligence — knowledge from the problem domain and the crowd. And this can create a lot of value – like in our case in the labor markets, if you look at USA, there are almost 4-5 million open positions and 8.5 million unemployed candidates. When one surveys jobseekers, 81% show lack of knowledge of skills needed for particular jobs and do not know the level of their skills. If we can fill this information gap credibly and automatically, there is hope! How do we map titles to skills scientifically – wait for another blog post!

-Varun
(Work done together with Vinay Shashidhar and Shashank Srikant)

Data science camp for kids!

It is an open secret that data science is becoming pervasive. What was once the preserve of statisticians and computer scientists – deft at trudging through mountains of data – has found its tools and techniques percolating into every industry and every level. Peer into the crystal ball and you don’t need to suspend reality too much to imagine a future in which a factory manager looks at production data to predict what machine might break-down soon. A cab-operator analyzes his Uber receipts to figure out where he should drive to make the most money. A sales manager looks at what kinds of customers his sales agents are most successful with to ascertain who to deploy where. Decidedly, the future belongs to the data scientist. Where will these data scientists come from? Who is going to train them?

The very nature of the subject eschews traditional learning modes. The data scientist must have the ability to learn quickly the context of the dataData science camp!, build hypotheses, have the ability to use techniques to confirm his suspicions and then construct predictors or automated systems. It marries technology with knowledge; intuition with scientific rigor. Our education systems will be slow to adapt – they will have to devise new methodologies, develop syllabi and learn to simultaneously involve multiple teachers. In the meanwhile, a whole generation of students might graduate who do not have the skills that industry expects from them in a data rich environment.

At Aspiring Minds, we’re passionate about helping students reach their full potential. We plan to pursue a series of initiatives to help advance data science education in India and around the world. As a first step, we held a data science camp for elementary school students! The participants continuously surprised us – with their knowledge, their understanding and even their wit. Two things became clear quickly – a. kids seldom confront open-ended problems and it took some getting used-to the idea of there being no one correct, pre-decided answer and b. with some guidance, they learn astonishingly quickly.

Read more about our exciting and rewarding weekend here!

At the end of the camp, the participating kids blogged about their experiences and the plots/analysis that they came up with. Read about them here.

Our team got enthusiastically involved in mentoring the students through the exercise and ended up learning more about their own teaching styles in the process.

We’ve also put out the exercises and resources we used for the camp for you to replicate it in your school/university/workplace. If the thought of indulging high schoolers in data-science seems absurd to you, snap out of it! It is possible; we tried it and the kids had a fun time picking up these concepts.

Let us know what you thought of our data camp. Please do write to us if you go ahead and try this out with students around you. We’ll eagerly look forward to that!

Samarth Singal
Research Intern, Aspiring Minds
Class of 2017, Computer science, Harvard.

Paper accepts at ICML and KDD!

Some more good news!

Soon after our recent acceptance of our spoken English grading work at ACL, our work on learning models for job selection and personalized feedback gets accepted at the workshop Machine Learning for Education at ICML! Some results from this paper were discussed in one of our previous posts. The tool was built five years ago and has since helped a couple of million students get personalized feedback and aided 200+ companies hire better. I shall also be giving an invited talk at this workshop.

Earlier this month, we also got a paper at KDD accepted, which builds on our previous work in spontaneous speech evaluation. We find how well we can grade spontaneous speech of natives of different countries and also analyze the benefits the industry gets with such an evaluation system.

Busy year ahead it seems – paper presentations at France, Beijing, Australia and finally New Jersey, where we’re organizing the second edition of ASSESS, our annual workshop on data mining for educational assessment and feedback. It’s being organized at ICDM 2015 this winter. July 20th is the submission deadline for the workshop. Here is a list of submissions we saw in our workshop last year, at KDD. Spread the word!

– Varun

What we learn from patterns in test case statistics of student-written computer programs

Test cases evaluate whether a computer program is doing what it’s supposed to do. There are various ways to generate them – automatically based on specifications, say by ensuring code coverage [1] or by subject matter experts (SMEs) who think through conditions based on the problem specification.

We asked ourselves whether there was something we could learn by looking at how student programs responded to test cases. Could this help us design better test cases or find flaws in them? By looking at such responses from a data-driven perspective, we wanted to know whether we could .a. design better test cases .b. understand whether there existed any clusters in the way responses on test cases were obtained and .c.  whether we could discover salient concepts needed to solve a particular programming problem, which would then inform us of the right pedagogical interventions.

A visualization which shows how our questions cluster by the average test case score received on them. More info on this in another blog :)

A visualization which shows how our questions cluster by the average test case score received on them. More on this in another post :)

We built a cool tool which helped us look at statistics on over 2500 test cases spread across over fifty programming problems attempted by nearly 18,000 students and job-seekers in a span of four weeks!

We were also able to visualize how these test cases clustered for each problem, how they correlated with other cases across candidate responses and were also able to see what their item response curves looked like. Here are a couple of things we learnt in this process:

One of our problems required students to print comma-separated prime numbers starting from 2 till a given integer N. When designing test cases for this problem, our SMEs expected there to be certain edge cases (when N was less than 2) and some stress cases (when N was very large) while expecting the remainder of the cases to check the output for random values of N, without expecting them to behave any differently. Or so they thought. :) On clustering the responses obtained on each of the test cases for these problems (0 for failing a case and 1 for passing it), we found two very distinct clusters being formed (see figure below) besides the lone test case which checked for the edge condition. A closer look at some of the source codes helped us realize that values of N which were not prime numbers had to be handled differently – a trailing comma remained at the very end of the list and lots of students were not doing this right!

A dendogram depicting test case clustering for the prime-print problem

A dendogram depicting test case clustering for the prime-print problem

This was interesting! It showed that the problem’s hardness was not only linked to the algorithm of producing prime numbers till a given number, but also linked to the nuance of printing it in a specific form. In spite of students getting the former right, a majority of them did not get the latter right. There are several learnings from this. If the problem designer just wants to assess if students know the algorithm to generate primes till a number, s/he should drop the part to print them in a comma separated list – it adds an uncalled for impurity to the assessment objective. On the other hand, if both these skills are to be tested, our statistics is a way to confirm the existence of these two different skills – getting one right does not mean the other is doable (say, can this help us figure out dominant cognitive skills that are needed in programming?). By separating the test cases to check the trailing comma case and reporting a score on it separately, we could ideally give an assessor granular information on what the code is trying to achieve. Contrast this to when test cases were simply bundled together and it wasn’t clear what aspect the person got right.

More so, when we designed this problem, the assessment objective was to primarily check the algorithm for generating prime numbers. Unfortunately, the cases that did not handle the trailing comma went down on their test case scores in spite of having met our assessment criterion. The good news here was that our machine learning algorithm [2] niftily picked it up and was able to say by the virtue of their semantic features that they were doing the right job!

We also fit 3-PL models from Item Response Theory (more info) on each test case for some of our problems and have some interesting observations there on how we could relate item-parameters to test case design – more on this in a separate post!

Have ideas on how you could make use of such numbers and derive some interesting information? Write to us, or better, join our research group! :)

Kudos to Nishanth for putting together the neat tool to be able to visualize the clusters! Thanks to Ramakant and Bhavya for spotting this issue in their analysis.

– Shashank and Varun

 References -

[1] KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs, Cadar, Cristian, Daniel Dunbar, and Dawson R. Engler. OSDI. Vol. 8. 2008.

[2] A system to grade computer programming skills using machine learning, Srikant, Shashank, and Varun Aggarwal. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.

Work on spoken English grading gets accepted at ACL, AM-R&D going to Beijing!

Good news! Our work on using crowdsourcing and machine learning to grade spontaneous English has been accepted at ACL 2015.

  • Ours is the first semi-automated approach to grade spontaneous speech.
  • We propose a new general technique which sits between completely automated grading techniques and peer grading: We use crowd for doing the tough human intelligence task, derive features from it and use ML to build high quality models.
  • We think, this is the first time anyone used crowdsourcing to get accurate features that are then fed into ML to build great models. Correct us if we are wrong!

Design of our Automated Spontaneous Speech grading system.

Figure 1: Design of our Automated Spontaneous Speech grading system.

The technique helps scale spoken English testing, which means super scale spoken English training!

Great job Vinay and Nishant.

PS: Also check out our KDD paper on programming assessment if you already haven’t.

- Varun