Automatic question generation and answer assessment: a survey

Das, Bidyut; Majumder, Mukta; Phadikar, Santanu; Sekh, Arif Ahmed

doi:10.1186/s41039-021-00151-1

Research and Practice in Technology Enhanced Learning

Table 5 Dataset description

From: Automatic question generation and answer assessment: a survey

Dataset	Description
QGSTEC	A corpus of over 1000 questions. The questions are generated from individual sentences or a paragraph.
TabMCQ	The dataset contains a large set of crowd-sourced MCQs covering the facts in the 65 hand-crafted tables.
SQuAD	The dataset consists of 100K+ samples collecting from Wikipedia articles. Each sample consists of question-answer pairs with a passage. The answer is a part of the text from the passage.
30MQA	The corpus consists of 30M question-answer pairs created by humans and their corresponding Freebase fact which represents by a triple. A triple consists of a subject, a relationship, and an object which is converted into a question with this subject and object where the object is the correct answer.
MS MARCO	The dataset covers 1,010,916 questions from the query log of Bing’s search with human-generated answers.
RACE	The dataset consists of a large set of questions (nearly 100K), answers and associated passages generated by human experts.
NewsQA	A large-scale dataset contains over 100K human-generated question-answer pairs based on a set of over 10K news articles.
TriviaQA	The dataset contains over 650K question-answer-evidence documents triples. The documents are collected from web search and Wikipedia pages.
SciQ	The dataset consists of 13.7K crowdsourced multiple-choice science questions. Every MCQ has one correct answer with three distractors, and one additional passage to support the evidence of the correct answer. Most instances get from the passages used to generate the question.
MCQL	The dataset has crawled from the Web and contains 7.1K MCQs. Each MCQ associates with four fields - sentence, answer, distractors, and the number of distractors.
NarrativeQA	The dataset contains a large number of question-answer pairs from a smaller collection of large documents. The dataset has designed for answering the questions correctly that require much understanding of the underlying narrative rather than just pattern matching.
LearningQ	The dataset contains 230K+ document-question pairs created by instructors and learners.

Back to article page