Most buzzwords circulating right now describe very attention-grabbing products: virtual reality headsets, smart watches, internet-connected toasters. Big Data is the prime example of this: many firms are marketing themselves to be associated with this term and its technologies while it’s ‘of the moment’, but are they really innovating or simply adding some marketing hype to their existing technology? Just how ‘big’ is their Big Data?
On the surface of it one would expect semantic technology to face similar problems, however the underlying technology requires a much more subtle approach. The technology is at its best when it’s transparent, built into a set of tools to analyse, categorise and retrieve content and data before it’s even displayed to the end user. While this means it may not experience as much short term media buzz, it is profoundly changing the way we use the internet and interact with content and data.
This is much bigger than Big Data. But what is semantic technology? Broadly speaking, semantic technologies encode meaning into content and data to enable a computer system to possess human-like understanding and reasoning. There are a number of different approaches to semantic technology, but for the purposes of this article we’ll focus ‘Linked Data’. In general terms this means creating links between data points within documents and other forms of data containers, rather than the documents themselves. It is in many ways similar what Tim Berners-Lee did in creating the standards by which we link documents, just on a more granular scale.
Existing text analysis techniques can identify entities within documents. For example, in the sentence “Haruhiko Kuroda, governor of Bank of Japan, announced 0.1 percent growth,” ‘Haruhiko Kuroda’ and ‘Bank of Japan’ are both entities, and they are ‘tagged’ as such using specialised markup language. These tags are simply a way of highlighting that the text has some significance; it remains with the human user to understand what the tags mean.
Once tagged, entities can then be recognised and have information from various sources associated with them. Groundbreaking? Not really. It’s easy to tag content such that the system knows that “Haruhiko Kuroda” is a type of ‘person’, however this still requires human input.
Where semantics gets more interesting is in the representation and analysis of the relationships between these entities. Using the same example, the system is able to create a formal, machine-readable relationship between Haruhiko Kuroda, his role as the governor, and the Bank of Japan.
In order for this to happen, the pre-existing environment must be defined. In order for the system to understand that ‘governor’ is a ‘job’ which exists within the entity of ‘Bank of Japan’, a rule must exist which states this as an abstraction. This is called an ontology.
Think of an ontology as the rule-book: it describes the world in which the source material exists. If semantic technology was used in the context of pharmaceuticals, the ontology would be full of information about classifications of diseases, disorders, body systems and their relationships to each other. If the same technology was used in the context of the football World Cup, the ontology would contain information about footballers, managers, teams and the relationships between those entities.
What happens when we put this all together? We can begin to infer relationships between entities in a system that have not been directly linked by human action.
An example: a visitor arrives on the website of a newspaper and would like information about bank governors in Asia. Semantic technology allows the website to return a much more sophisticated set of results from the initial search query. Because the system has an understanding of the relationships defining bank governors generally (via the the ontology), it is able to leverage the entire database of published text content in a more sophisticated way, capturing relationships that would have been overlooked by computer analysis alone. The result is that the user is provided with content more closely aligned to what they are already reading.
Read the sentence and answer the question: “What is a ‘Haruhiko Kuroda’?” As a human the answer is obvious. He is several things: human, male, and a governor of the Bank of Japan. This is the type of analytical thought process, this ability to assign traits to entities and then use these traits to infer relationships between new entities, that has so far eluded computer systems. The technology allows the inference of relationships that are not specifically stated within the source material: because the system knows that Haruhiko Kuroda is governor of Bank of Japan, it is able to infer that he works with other employees of the Bank of Japan, that he lives in Tokyo, which is in Japan, which is a set of islands in the Pacific.
Companies such as the BBC, which Ontotext has worked with, are sitting on more text data than they have ever experienced before. This is hardly unique to the publishing industry, either. According to Eric Schmidt, former Google CEO and executive chairman of Alphabet, every two days we create as much information as was generated from the dawn of civilisation up until 2003 – and he said that in 2010. Five years later and businesses of all sizes are waking up to this fact – they must invest in the infrastructure to fully take advantage of their own data. You may not be aware of it, but you are already using semantic technology every day. Take Google search as an example: when you input a search term, for example ‘Bulgaria’, two columns appear. On the left are the actual search results, and on the right are semantic search results: information about the country’s flag, capital, currency and other information that is pulled from various sources based on semantic inference.
Written by Jarred McGinnis, UK managing consultant at Ontotext