Enabling semantically rich query paradigms is one of the core challenges of current information systems research. In this context, due to their importance and ubiquity in natural language, analogy queries are of particular interest. Current developments in natural language processing and machine learning resulted in some very promising algorithms relying on deep learning neural word embeddings which might contribute to finally realizing analogy queries. However, it is still quite unclear how well these algorithms work from a semantic point of view. One of the problems is that there is still no clear consensus on the intended semantics of analogy queries. Furthermore, there are no suitable benchmark dataset available respecting the semantic properties of real-life analogies. Therefore, in this, paper, we discuss the challenges of benchmarking the semantics of analogy query algorithms with a special focus on neural embeddings. We also introduce the AGS analogy benchmark dataset which rectifies many weaknesses of established datasets. Finally, our experiments evaluating state-of-the-art algorithms underline the need for further research in this promising field.