Text classification, a part of natural language processing, is a useful way to capture insights from the vast array of unstructured online and digital text that exists on the Web. But doing this effectively can be costly and time consuming.
Expert data scientists Michael Tamir from Galvanize, a tech education company, and Daniel Hansen from Personagraph, an audience intelligence company, discussed at the recent Predictive Analytics World event in San Francisco how Google’s Word2Vec addresses this problem.
Word2Vec, which Google open sourced in 2013, is a tool that contains a couple of algorithms and uses neural networks to learn the vector representation of a word that is useful for predicting other words in a sentence.
“What we’re doing is … running Word2Vec on a very large corpus of text, which includes 100 billion words or something on that scale,” Hansen said.
“So you have a topic [vector], which will have some length and direction. What we’ve seen in practice is that if you look at other words that have vectors in this neighbourhood of your topic vector, they’ll be highly correlated. So the vector of music will be very close to the vector of song or tune or melody, both in length and direction.”
Tamir said that working with a space of vectors and not just a bag of individual terms means you can make use of all sorts of mathematical structures, whereas you cannot necessarily do this with a set of words.
“You’re mapping from terms into a vector space. And that vector space has typically around 300-400 dimensions.
“I can subtract vectors, I can rotate the vectors, I can look at how far one vector is from another. So by embedding these words into a vector space, we can capture a lot of structure,” he said.
Other ways of going about text classification can be expensive, Tamir said. Auto-encoder for feature compression when pre training the data, for example, scales with the order of taxonomy node counts. This means it can get expensive when it starts to reach 5000 to 10,000 nodes that are trained in the taxonomy tree for classifying each document, he said.
“And you are going to need a lot of training data for each one of those nodes,” he added. “That means finding a way to pay for the data, getting the data on your own, paying an intern to do it. It can very expensive if you want to do this directly through a traditional supervised [neural] net.”
Hansen said Word2Vec has made pre-training “very easy” and is a relatively low investment entry into deep learning for text classification.
“It avoids the expense of state of the art neural net technologies. You end up with something that is much better than a standard bag of words representation for doing classification on because you get this vector compression effect for free. And this is very nice to do on a large taxonomy. You can scale up your taxonomy very easily and you don’t have all these pre-training costs for each individual topic.”
He added that the bag of words model in natural language processing performs quite poorly when there’s a small training sample.
“Whereas with Word2Vec you can see there’s substantial lift in terms of area under the curve, [a measure for classification accuracy].”
Label imbalance is another issue that Word2Vec can help addresses, Hansen said.
“You have a model that is 95 per cent accurate on a balanced test set, and say you have a 100 examples of ‘class struggle’ documents and 100 random documents and you can classify 95 per cent of them correctly.
“If you then use that model in the wild where ‘class struggle’ related documents are one in 10,000, your accuracy will drop significantly to less than 1 per cent. It’s a fundamental problem with imbalance, it hampers any success you can get with text classification.”
He compared Word2Vec (W2V) with bag of words (BOW) to see how they performed in precision and recall, known as the F1 score which measures accuracy of a model. With 300 features, the BOW received an F1 score of about 93 per cent, and then dropped significantly to about 88 per cent when subsetting down to 10 features.
“We do further feature selection we sharpen those distributions up, which is something necessary to deal with the imbalance problem. However, in doing so we significantly reduced our F1 score. So that’s unfortunate and if you’re going to work with a bag of words representation it’s an intrinsic struggle to deal with,” Hansen said.
When applying W2V on the same features, the F1 score overall is higher and doesn’t drop significantly when subsetting the features. With 300 features, it received a F1 score of about 96 per cent, and with 10 features about 94 per cent.
“The results are competitive with a highly optimised bag of words representation. And for the situation where you have a small number of training examples completely dominate, it beat it by a mile.”