Semantic Search – For Reals?

S

Recently a project I am involved in at Stanford participated and won the Semantic Web Challenge at the International Semantic Web Conference. The Semantic Web Challenge is a competition for Semantic Web applications. For the uninformed, the Semantic Web “is a group of methods and technologies to allow machines to understand the meaning – or ‘semantics’ – of information on the World Wide Web”. Essentially, the Semantic Web hopes to move web applications beyond simple syntax to actually supporting the “meaning” behind content.

In the challenge, the focus is not so much on the technical or research aspects of the tools, but more on their usefulness for target users. The challenge helps demonstrate what semantic technologies can bring to society.

Our application, NCBO Resource Index: Ontology – Based Search and Mining of Biomedical Resources, is a semantic search application for biomedical researchers (check out the video below). I designed and developed the user experience and interface.

The application allows researchers in the biomedical sciences to perform “concept-based search” over 22 different biomedical resource databases. Thus, rather than a researcher typing in keywords, the researcher would type in concepts from their domain of interest, those concepts come from ontologies, the backbone of the Semantic Web. For the sake of simplicity, an ontology is a structured terminology that describes the terms within a specific domain, the properties of those terms and the relationships between them.

In terms of user interaction and searching, the behavior is quite similar to conventional search engines like Google. The difference being that when you select a term from the auto-completion drop down, that term comes from an ontology that was used to index elements within the various biomedical resources. I also included helpful tag clouds (see figure below) for visualizing related concepts to your current search query and I use color intensity to represent more relevant resources based on your current search terms.


The power of the search tool comes from how the index gets constructed. The National Center for Biomedical Ontology maintains a project called BioPortal. This project is an open library of more than 200 ontologies in biomedicine. Using the terms from these ontologies as our term-base, we automatically annotate or “tag” textual descriptions of the data residing within the elements of the 22 biomedical resources. These could be things like patient records, gene expression data, research articles, clinical trials, etc.

These annotations act as a link connecting an ontology term to a data element. The really useful part of this process and also where the semantics begin to play a role, is that we can use the structure of the ontology to expand these annotations. So, rather than getting a simple keyword like “breast cancer” mapping to a particular clinical trial, we also know any synonyms of “breast cancer” described in the ontology. We can also use the hierarchy of the ontology to link more general or specific terms to the resource like “cancer” or “melanoma”. Finally, we can use mappings between multiple ontologies to discover other related terms.

This is truly where the power of ontologies can be seen and where a semantic approach to search can be very useful. For example, searching for the keywords “retroperitoneal neoplasm” within the Gene Expression Omnibus website will return zero results. However, the same search in our tool will retrieve relevant results annotated by the child term of “retroperitoneal neoplasm” from the NCI Thesaurus, “pheochromocytoma”. Results are scored based on the distance of the matching annotation from a given search concept.

Another big advantage to our application is that the ontologies form a “semantic bridge” between very different biomedical resources. One search execution automatically gives you access to 22 different databases, allowing a researcher to explore relationships between things like gene expressions and clinical trials relevant to a specific concept. Without this, researchers are forced to open up multiple web pages and search each database independently.

Of course all these annotations take a long time to index and it results in a boat load of data. The current index is stored in a 1.5 terabyte MySQL database that contains 16.4 billion annotations, 2.4 million ontology terms, and 3.5 million data elements (stylized graphic below). Other members of the team have worked to figure out clever ways to do a lot of this indexing rather efficiently. You can read about it in Paea Lependu’s paper Optimize First, Buy Later: Analyzing Metrics to Ramp-up Very Large Knowledge Bases.


We are working to include more resources in the index and also speed up the search support. If you’re interested in playing with the application, check it out in the BioPortal integration.

About the author

Sean Falconer

6 Comments

By Sean Falconer

Sean Falconer

Get in touch

I write about programming, developer relations, technology, startup life, occasionally Survivor, and really anything that interests me.