Skip to main content

Table 5 Dataset description

From: Automatic question generation and answer assessment: a survey

Dataset

Description

QGSTEC

A corpus of over 1000 questions. The questions are generated from individual sentences or a paragraph.

TabMCQ

The dataset contains a large set of crowd-sourced MCQs covering the facts in the 65 hand-crafted tables.

SQuAD

The dataset consists of 100K+ samples collecting from Wikipedia articles. Each sample consists of question-answer pairs with a passage. The answer is a part of the text from the passage.

30MQA

The corpus consists of 30M question-answer pairs created by humans and their corresponding Freebase fact which represents by a triple. A triple consists of a subject, a relationship, and an object which is converted into a question with this subject and object where the object is the correct answer.

MS MARCO

The dataset covers 1,010,916 questions from the query log of Bing’s search with human-generated answers.

RACE

The dataset consists of a large set of questions (nearly 100K), answers and associated passages generated by human experts.

NewsQA

A large-scale dataset contains over 100K human-generated question-answer pairs based on a set of over 10K news articles.

TriviaQA

The dataset contains over 650K question-answer-evidence documents triples. The documents are collected from web search and Wikipedia pages.

SciQ

The dataset consists of 13.7K crowdsourced multiple-choice science questions. Every MCQ has one correct answer with three distractors, and one additional passage to support the evidence of the correct answer. Most instances get from the passages used to generate the question.

MCQL

The dataset has crawled from the Web and contains 7.1K MCQs. Each MCQ associates with four fields - sentence, answer, distractors, and the number of distractors.

NarrativeQA

The dataset contains a large number of question-answer pairs from a smaller collection of large documents. The dataset has designed for answering the questions correctly that require much understanding of the underlying narrative rather than just pattern matching.

LearningQ

The dataset contains 230K+ document-question pairs created by instructors and learners.