02 Jul Recent Evolution of QA Datasets and Going Forward
Even from the early 1960s, scientists have worked on using computers to answer textual questions – yet it is still in progress. In 2011, IBM’s Watson – a Question Answering (QA) computer system capable of answering questions posed in natural language – entered the quiz show Jeopardy!, and defeated two former champions. Nevertheless, scientists were unsatisfied with the 2011 IBM’s Watson technology, mainly due to the absence of learning. Today, they are much more ambitious about solving QA tasks through the process of learning.
The topic of a general QA in the open domain has always existed, but it has been gaining enormous interest as of late. This is easily seen by examining the past and current QA datasets. It is astonishing that more than ten QA datasets were released since 2015 by various academic groups. In this post, I will summarize and expound over the evolution of recent QA datasets with some bonus in the end.
Broadly speaking, we have learned that there are two paradigms that can be used in answering questions. One paradigm is knowledge-based QA and the other is information retrieval (IR)-based QA. These two paradigms are quite self-explanatory. Interestingly, by observing the evolution of QA datasets, one can recognize the trend moving from knowledge-based QA to IR-based QA over time.
SimpleQA  contains 100K QA pairs and it relies on Freebase as a knowledge database to inquiry for answering questions. There are other several datasets based on Knoweldge-Base (KB) proposed by [2,3,4,18] and yet, the complexity and generalizability of the question and answers are extremely limited.
Some of the QA datasets are formed for the purpose of reading comprehension research. Researchers were interested in teaching machines to comprehend the contents of documents. The obvious metric is to use different forms of questions and then measure the corresponding accuracy of the answers. For example, The Children’s Book Test & CNN / Daily News articles & WebQA [5,6,7] tasks are in the form of fill-in-the-blank. MCTest & Science [8,9] are the multiple choice type of QA datasets. The drawback of these datasets is that the questions are not in the natural form that people would ask in the real world.
SQuAD  is another reading comprehension dataset that requires answering the question from given Wikipedia passages (i.e. the dataset comes with question pair and a document (q,a,d)_i for N data points). This dataset consists of 100K question-answer pairs with 536 Wikipedia articles in total. The questions are posed by crowdworkers. Most of the questions are fact based questions and the answers span a few words (probability average about 1.x tokens) and you can always find the answers from the corresponding passage. You can see the state-of-the-art scores from their leaderboard.
NEWSQA  is similar to SQuAD. It was created by Maluuba and it uses 120K question and answer pairs based on the new articles gathered by DeepMind. These questions are also crowdsourced for the purpose of reading comprehension tasks. One catch is that the questions are generated based on looking at the summary points of the documents rather than looking at the full document.
MS Marco  MS Marco contains 100,000 questions and 1M passages (512471 unique URLs). Each question contains top 10 contextual passages extracted from public web documents. Their questions are derived from search logs and the answers are generated by crowdworkers rather than the span of passage. Note that these passages are not from search engine results. Interestingly, one of the selling points in their paper was to release 1M queries (questions) and the corresponding answers in the future. You can see the state-of-the-art scores from their leaderboard.
The common factor between these reading comprehension datasets is that some of the answers must be inferred from incomplete information in the articles. Interestingly, the performance has improved a lot since, “Machine Comprehension Using Match-LSTM and Answer Pointer”  introduced spanning mechanism (answer pointer). Many other models adapted the spanning mechanism to their models. It is still questionable whether spanning mechanism is the correct way to go, especially as the complexity of the question increases. Additional performances were also gained from using both character and word level embeddings rather than just word level embeddings. Anyhow, the performance of these datasets have rapidly improved (see figures below). Thus, the difficulty level of the QA tasks have increased, by transferring real world data from reading comprehension tasks to the open domain QA.
Open Domain Question Answering
The common factor of the next two datasets is that the question-answer pairs come with search engine snippets. Also, they were released only a few month ago.
SearchQA  Rather than generating questions through crowdsourcing, SearchQA questions are retrieved from the quiz show, Jeopardy!. SearchQA has 140k question-answer pairs with 49.6 snippets of Google search (on average per question). The answers are about 1.47 tokens long. They also separate the dataset in a way that the training question-answer pairs are from the years before the validation and test pairs.
TriviaQA  is the newest QA dataset that has been released. The questions are special in a sense that they are taken from trivia games, and the answers are generated from trivia enthusiasts. There are 95K question-answer pairs. One of their selling points is that the questions are naturally generated. This avoids trivial regularity in the questions and “potential bias in question style or content”
All of these QA frameworks are in the closed-world environment, even if the documents in the dataset were gathered from the open-world environment. In an ideal scenario, we want to solve any of these QA pairs (from above datasets) in IR fashion under the open-world environment. Some interesting questions to ask are: (1) how do we measure which dataset is better or worse?; (2) which dataset contains realistic/practical questions?; (3) are they sufficient for applying to a real search engine?
This is how I foresee the future:
Previously, we tried to capture complex patterns (information) from datasets through training deep neural networks or some other ML models. Nevertheless, it is hard to build a dataset that embraces large amounts of information about the world in practice, and storing these large amounts of information into the neural network is yet another challenge. Later, researchers implanted memory modules into the deep neural networks. Some of the earlier works of this kind are Neural Turing Machines , Memory Networks , and NKLM , which gives the memory to the neural network so it can hold the information for a long time.
The following is a simple analogy of why the above approach makes sense. Say you are giving an exam under two different settings. In the first setting, the student will take the exam in the standard way. In the second setting, the student will get a cheat sheet where he can write things down. The first student will be required to understand the entire material over night, whereas the second student can write down a portion of the material to the cheat sheet. Clearly, the second student has the advantage, since he does not have to expend as much effort to study for the exam than the first student. This advantage of the second student is analogous to the effect of directly implanting memory modules into the neural network.
Now, consider going one step further – it would be the ability to take all the materials into the exam, or having internet access during the exam.
Accordingly, the natural next step is to give search engine access to a neural network. By leveraging the search engine, there are numerous things that can be improved. The obvious one would be the QA task. Thus, we are talking not only about learning to understand the information in a dataset, but also learning to access the information from an external database. People sometimes describe neural networks as black box models. However, in this case, the search engine becomes the black box model for neural networks where the neural network can only observe the output based on its input. Over time, the neural network will learn to make good use of this black box (the search engine).
There have been some works that hint at this. For example, “Reading Wikipedia to answer Open-Domain Questions” or “Search Engine Guided Non-Parametric Neural Machine Translation”. These papers mention leveraging a search engine to improve QA tasks or neural machine translation. Unfortunately, they do not actually use a search engine.
Hopefully, by now, you are excited to integrate a search engine into your neural network. You may also be wondering which search engine you should use. If so, you should give the AIFounded search engine a try (search.aifounded.com). It’s currently in alpha, but as we progress, we plan to customize it for ML and RL purposes. Our search API is also very easy to integrate. Our search engine covers a variety of global topics and also ensures that all passages in the QA datasets above are included. Additionally, we are currently providing a free search engine API key to university research labs. If you are interested, please e-mail firstname.lastname@example.org.
Daniel Jiwoong Im
Founder and CEO, AIFounded
Acknowledgements : I am very thankful to Seth Lim, Sungjin Ahn, Zhouhan Lin, and Carolyn Augusta for great feedbacks.
 A. Bordes, N. Usunier, S. Chopra, and J. Weston. Large-scale simple question answering with memory networks. arXiv:1506.02075 2015
 A. Fader, S. Soderland, and O. Etzioni. 2011. Identifying relations for open information extraction. EMNLP 2011
 J. Berant, A. Chou, R. Frostig, and P. Liang. 2013. Semantic parsing on Freebase from question-answer pairs. EMNLP 2013
 A. Bordes, S. Chopra, and J Weston. 2014a. Question answering with subgraph embeddings. EMNLP 2014
 F. Hill, A. Bordes, S. Chopra, and J. Weston. 2015. The goldilocks principle: Reading children’s books with explicit memory representations. arXiv:1511.02301 2015
 K.M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. NIPS 2015
 P. Li, W. Li, Z. He, X. Wang, Y. Cao, Jie Zhou, and Wei Xu. Dataset and neural recurrent sequence labeling model for open-domain factoid question answering. arXiv:1607.06275 2016.
 M. Richardson, C. J.C. Burges, and E. Renshaw. MCTest: A challenge dataset for the open-domain machine comprehension of text. EMNLP 2013.
 P. Clark, O. Etzioni. My computer is an honor student but how intelligent is it? standardized tests as a measure of AI. AI Magazine 2016
 P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text EMNLP 2016
 A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. NewsQA: A machine comprehension dataset. arXiv:1611.09830. 2016
 T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng. MS MARCO: A human generated machine reading comprehension dataset. NIPS Workshop 2016
 M. Dunn, L. Sagun, M. Higgins, U. Guney, V. Cirik, and K. Cho. SearchQA: A new q&a dataset augmented with context from a search engine.arXiv:1704.05179 2017
 M. Joshi, E. Choi, D.S. Weld, and L. Zettlemoyer. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension.arXiv:1705.03551 2017
 A. Graves, G. Wayne, I. Danihelka. Neural Turing Machines arXiv:1410.5401 2014
 J. Weston, S. Chopra, A. Bordes, Memory Networks arXiv:1410.3916 2014
 S. Ahn, H. Choi, T. Parnamaa, Y. Bengio. A Neural Knowledge Language Model. arXiv:1608.00318 2016
 J. V. Serban, A. G. Duran, C. Gulcehre, S. Ahn, S. Chandar, A. Courvill, Y. Bengio. Generating Factoid Questions With Recurrent Neural Networks: The 30M Factoid Question-Answer Corpus arXiv:1603.06807 2016
 S. Wang and J. Jiang, Machine Comprehension Using Match-LSTM and Answer Pointer arXiv:1608.07905 2016
 C. Buck, J. Bulian, M. Ciaramita, A. Gesmundo, N. Houlsby, W. Gajewski, Ask the Right Questions: Active Question Reformulation with Reinforcement Learning, arXiv:1705.07830 2017