Conceptual Search: I have performed most of the research in this field during my activity in the CONCEPT project.
Technologies of interest: Data Mining and Machine learning on Big Data. Natural language processing techniques, opinion mining, flat or hierarchical, multidimensional clustering, and classification methods like Adaptive Boosting, Bayesian Classification, Support Vector Machines and Decision Trees belong to my area of expertise. Typical data sets I work with, are the well-known ClueWeb09 and ClueWeb12 corpora comprising about 60TB of uncompressed documents harvested from the Web and a locally hosted Virtuoso cache with over 5 billion triples from DBpedia, Freebase, YAGO, LinkedMDB, NewYork Times, DrugBase, MusicBrainz, GeoNames, DBTropes, CiteSeer and ACM data stores (this represents about 10% of all linked data available on the Web). To manage this volume of data I use graph databases like Neo4J, cloud based Lucene like inverted index technology like Elastic Search and MapReduce paradigms (Hadoop).
Conceptual Queries in Entity-Centric Search: According to reports published by search engines like Yahoo! about 50% of the Web queries today, involves searching for entities. While simple, keyword-based search can very well be mastered with state-of-the-art boolean search, searching for entities by means of concepts, like for instance city car, gaming laptop or a business cellphone are not well supported by such techniques. Given a concept like city car, a person would immediately think of a small sized vehicle, easy to park and with low fuel consumption, something like the Volkswagen Polo or the Mercedes Smart. But for a machine, such concepts are nothing more than keywords.
A lot of work has been invested by the artificial intelligence (AI) community to build a system that is capable of reasoning much like a human. Cyc for instance is a well-known AI project attempting to assemble a comprehensive global ontology and knowledge base of common sense knowledge. This would empower machines to understand concepts and render human-like reasoning possible. Unfortunately, 30 years later, after investing 350 man-years of effort in teaching Cyc common sense knowledge, no real advances have been achieved. In contrast to such approaches, we believe flexible, contextual-based knowledge (and not one global ontology) is a better approach for this task. Fostered by the massive amount of information available today, such knowledge could be learned directly from the Web.
The outcome of this project will provide essential insights into how the meaning of concepts can be learned from a large volume of noisy information like it is the case with data on the Web. This raises multiple research questions: What definition of a concept is more suitable for this task? Is an intensional representation of a concept (through typical properties) helpful for nailing its meaning? How can property typicality be quantified? What about extensional concept representation? How can such representations be efficiently learned from huge volumes of heterogeneous data? What learning methods are suitable for these tasks?
Summary to date: 9 publications to international conferences, 11 Bachelor/Master theses, 3 software development projects (8 students per team and project) for building prototypes.
Student theses I coordinated for the CONCEPT project:
||Analyse des Transitivitätsproblems von Instance Matching Verfahren auf Linked Data
||Product Search by Means of Natural Language
||Extracting Ontologies for Supporting Implicit Feature Resolution from Product Reviews
||Turmo, Juan Jose
||Mining Semantic Related Terms for Product Features from Structured and Unstructured Data
||Opinion Mining & Sentiment Analysis in Reviews
||Einfluß von Typischen Entitäten auf die Festlegung von geeigneten Kategorien für den Entitätstyp
||Analyzing User's Point of View in Feature based Opinion Mining
||Auswirkungen von Datenqualität in Business Warehouse Umgebungen
||Analyse der Akzeptanz und Breitenverwendung der auf schema.org zur Verfügung stehenden Schemata im Web
||Analyse von Paraphrasen für OpenIE Triple
||Establishing Proximity Boundaries for Concept Extraction in Product Reviews
Software development projects:
- Movie Genie is a system that can "read" queries about movies, written in natural language, as they would be addressed to a human video rental sales person. The system interprets the query, it extracts hard facts like the movie genre, and soft features like a "good story" and it generates a ranked list. It considers user feedback in form of 'I have seen this movie and liked it'/'I have seen this movie and dis-liked it' such that the so marked movies are eliminated from the result list, and the ranking will be restored considering what the user liked and did not like. This project won the first prize at TDSE 2012.
- Movie Miner is a Web service for navigation through movie data. It extracts typical movie features like 'acting performance', 'special effects', 'suspense', 'character depth', 'plot', 'story', etc. users talk about in movie reviews on IMDb it analyses user opinion with respect to these features and it displays them in an intuitive polarity profile. This project won the first prize at TDSE 2011.
Experiments data: Instance Matching Data
- Summer Semester 2014
- Summer Semester 2012/2013
- Sommer Semester 2013
- Winter Semester 2012/2013
- Summer Semester 2012
- Winter Semester 2011/2012
- Summer Semester 2011
- Winter Semester 2010/2011
- Summer Semester 2010
- Winter Semester 2009/10
- Lecture "Multimedia Databases" (teaching assistant)
- Summer Semester 2009
- Lecture "Data Warehousing and Data Mining Techniques" (teaching assistant)
- Summer Semester 2008
- Lab "Computer Network Administration" (student teaching assistant)