Written by Evan Hu
Text classification is a common task in Natural Language Processing. The main approach is tied around representing the text in a meaningful way— whether through TF-IDF, Word2Vec, or more advanced models like BERT— and training models on the representations as labeled inputs. Sometimes, however, either labeling the data is impractical, or there just isn’t enough labeled data to build an effective multi-classification model. Instead, we are forced to leverage unsupervised methods of learning in order to accomplish the classification task.
This article will outline the process used to build an unsupervised text classifier for the dataset of interview questions at Interview Query. This would be greatly beneficial to us for several reasons.
One, we want to be able to offer more insightful information for our users about the companies that they are applying to, as well as the functionality to practice only certain question types. Most importantly, it would enable us to “characterize” different companies by the types of questions that they ask.
Our task is to classify a given interview question into a category: machine learning, statistics, probability, Python, product management, SQL, A/B testing, algorithms, or take-home. The most practical approach would be to first extract as many relevant keywords as possible from the corpus and then manually assign the resulting keywords into “bins” corresponding to our desired classifications. Finally, we would iterate through each interview question in the dataset and compare the total counts of keywords in each bin in order to classify them.
The possibility of using Latent Dirichlet Allocation was considered in order to generate topic models and retrieve relevant keywords relating to each topic without having to manually assign them, as well as K-means clustering. These proved to be difficult and less effective than simply counting keywords, given the wide and disparate range of our classifications.
First, the data had to be cleaned and preprocessed. SpaCy was used to tokenize, lemmatize, lowercase, and remove stop-words from the text.
Then came the problem of selecting a method to extract keywords from the corpus of text. Since the corpus consisted of a massive number of small “documents,” each one a different interview question, the keywords were extracted from each document separately rather than combining the data and sorting unique keywords from the resulting list by frequency.
Then, testing began. Various methods, such as TF-IDF, RAKE, as well as some more recent, state-of-the-art methods such as SGRank, YAKE, and TextRank, were considered. We were also curious enough to try Amazon Comprehend, an auto-ML solution, to see how competent it was.
Unfortunately, the results were unsatisfactory, as the combination of high level abstraction with the granularity of the NLP task proved still yet impractical. In the end, after comparing the keywords produced by each method, SGRank proved to be the most effective, providing the best results by the highest quantity of relevant keywords.
Finally, we sorted unique keywords by frequency in order to get the most salient ones.
The result was around 1900 words, which we then manually combed through and assigned the top 200 most relevant ones to our bins.
Lastly, with the final list of categorized keywords, we can classify each interview question as one of 8 different types by counting the appearance of keywords in each question. We then generated “personality” profiles for different companies, which are displayed on our website.
Thanks for reading! Check out our new articles on hiring data scientists and the ByteDance Data Scientist Interview and be sure to visit Interview Query to join the thousands of data scientists practicing for their next interview today.