With more organisations looking beyond structured data to get a bigger picture or solve more pieces to the puzzle, harnessing unstructured data is becoming a top focus area in driving business value. However, nothing good is easy.
Speaking at the CeBIT conference in Sydney, CIO of the Australian Crime Commission (ACC), Dr Maria Milosavljevic, pointed out several perils in analysing unstructured textual data.
Projects in this field often fail because the analysis is not tied to particular problems, Milosavljevic said. “What you need to do is… work out what it is that you are trying to achieve. There are many things that might be interesting to find in text, but are they actually useful to you and for your business?
“You might be told that extracting entities from your text is all you need to do, but it’s not. Tailoring text analytics to your business needs is what it’s all about. If you walk away today understanding that there is no one size fits all to the problem, then I’ve done my job.”
Using an example of text from a newspaper, Milosavljevic pointed out problems when it comes to rules for analysis.
“You might say ‘I’m going to go look for capitalised words [for names of people, places and companies]’. That might work, to some extent. But first words of sentences tend to be capitalised, so are they all names? You might say ‘let’s only look for the ones that have the title before it’. But what about Mr or Mrs average? And are they a name of a person or place? There are many companies that are people’s names, so how do you know the difference?”
Annotating text is another approach to analysing unstructured textual data. However, there’s a huge amount of work involved in this, Milosavljevic said.
“In the end it’s a lot better than maintaining a whole lot of rules. But the problem with that approach is if your system has never seen something before, it’s going to assume it looks like something else. The example I’ve got is Tim Tams - it’s not a person but the system thinks it is.”
Sentiment analysis can often prove to be useful when analysing unstructured textual data. However, Milosavljevic said it can also be “highly inaccurate”.
“My favourite example is a movie review that said a film was ‘wonderfully horrid’. Wonderful is a positive word, horrid is a bad word. But ‘wonderfully horrid’, that’s interesting. And a machine is not going to get all that right, but you all did because you laughed.”
Despite the many difficulties in harnessing unstructured data, it’s a challenge worth pursuing as it can offer much value to any organisation, Milosavljevic said. Unstructured data gives the ACC “additional pieces of the puzzle that we can’t see in other data”, and language technology is the “holy grail of artificial intelligence”, she said.
“The biggest risk, I believe, that we have is we can share but we don’t have the ability to exploit it. If something happens and we had the information available but weren’t in fact able to join the dots, then that would not be good and your safety would be at risk. That means we have to get better at finding the needles in the haystack, and unstructured data is a big part of helping with that problem.”