Title: Evaluating Conversational AgentsThere has been a renewed focus on dialog systems, including non-task driven conversational agents. Dialog is a challenging problem since it spans multiple conversational turns. To further complicate the problem, there are many possible valid utterances which may be semantically different. This makes automatic evaluation difficult, which is why the current best practice for analyzing and comparing dialog systems is the use of human judgments. This talk focuses on evaluation, presenting a theoretical framework for the systematic evaluation of open-domain conversational agents, including the usage of Item Response Theory (Lord, 1968) for efficient chatbot evaluation and evaluation set creation. We introduce ChatEval (
https://chateval.org) a unified framework for human evaluation of chatbots that augments existing tools and provides a web-based hub for researchers to share and compare their dialog systems.