What Search Engines Can’t Do. Holistic Entity Search on Web Data

TitleWhat Search Engines Can’t Do. Holistic Entity Search on Web Data
Publication TypeThesis
Year of Publication2015
AuthorsHomoceanu, S.
Academic DepartmentCarl-Friedrich-Gauß-Fakultät
UniversityTechnische Universität Braunschweig
Thesis TypeDoctoral Thesis

More than 50% of all Web queries are entity related. Users search either for entities or for entity information. Still, search engines do not accommodate entity-centric search very well.

Building on the concept of the semiotic triangle from cognitive psychology, which models entity types in terms of intensions and extensions, we identified three types of queries for retrieving entities: type-based queries - searching for entities of a given type, prototype-based queries - searching for entities having certain properties, and instance-based queries - searching for entities being similar to a given entity. For type-based queries we present a method that combines query expansion with a self-supervised vocabulary learning technique built on both structured and unstructured data. Our approach is able to achieve a good tradeoff between precision and recall. For prototype-based queries we propose ProSWIP, a property-based system for retrieving entities from the Web. Since the number of properties given by the users can be quite small, ProSWIP relies on direct questions and user feedback to expand the set of properties to a set that captures the user’s intentions correctly. Our experiments show that within a maximum of four questions the system achieves perfect precision of the selected entities. In the case of instance-based queries the first challenge is to establish a query form that allows for disambiguating user intentions without putting too much cognitive pressure on the user. We propose a minimalistic instance-based query comprising the example entity and intended entity type. With this query and building on the concept of family resemblance we present a practical way for retrieving entities directly from the Web. Our approach can even cope with queries which have proven problematic for benchmark tasks like related entity finding. Providing information about a given entity, entity summarization is another kind of entity-centric query. Google’s Knowledge Graph is the state of the art for this task. But relying entirely on manually curated knowledge bases, the Knowledge Graph does not include all new and less known entities. We propose to use a data-driven approach. Our experiments on real-world entities show the superiority of our method.

We are confident that mastering these four query types enables holistic entity search on Web data for the next generation of search engines.

Diss_Homoceanu_Silviu.pdf4.6 MB