Crowdsourced dataset to study the generation and impact of text highlighting in classification tasks

Objectives Text classification is a recurrent goal in machine learning projects and a typical task in crowdsourcing platforms. Hybrid approaches, leveraging crowdsourcing and machine learning, work better than either in isolation and help to reduce crowdsourcing costs. One way to mix crowd and machine efforts is to have algorithms highlight passages from texts and feed these to the crowd for classification. In this paper, we present a dataset to study text highlighting generation and its impact on document classification. Data description The dataset was created through two series of experiments where we first asked workers to (i) classify documents according to a relevance question and to highlight parts of the text that supported their decision, and on a second phase, (ii) to assess document relevance but supported by text highlighting of varying quality (six human-generated and six machine-generated highlighting conditions). The dataset features documents from two application domains: systematic literature reviews and product reviews, three document sizes, and three relevance questions of different levels of difficulty. We expect this dataset of 27,711 individual judgments from 1851 workers to benefit not only this specific problem domain, but the larger class of classification problems where crowdsourced datasets with individual judgments are scarce.


Objective
In this paper, we introduce datasets derived from multiple crowdsourcing experiments for document classification tasks. These experiments resemble a two-step pipeline that first highlights relevant passages and then classifies the documents. The datasets include the individual judgments provided by the workers for both steps of our pipeline, totaling 27,711 judgments from 1851 workers.
Research has shown the feasibility of leveraging nonexpert annotators in complex NLP tasks [1]. Text classification, in particular, is a recurrent goal of machine learning (ML) projects, and a typical task in crowdsourcing platforms. Hybrid approaches, combining ML and crowd efforts, have been proposed to boost accuracy and reduce costs [2][3][4]. One possibility is to use automatic techniques for highlighting relevant excerpts in the text and then ask workers to classify. And in doing so, workers could rely on the highlights, and avoid reading parts of the text, or ignore the highlighting and read the full text. In this context, we run crowdsourcing experiments to study the effects that text highlighting has on human performance in classification tasks [5]. In these experiments, we focused on two crowdsourcing tasks: gathering the text highlights, and classification. The highlighting gathering task produced a dataset containing crowd-generated highlights that could serve, for example, researchers in studying automatic techniques such as

Open Access
BMC Research Notes *Correspondence: jorge.ramirezmedina@unitn.it 1 Department of Information Engineering and Computer Science, University of Trento, Via Sommarive, 9, Povo, 38123 Trento, TN, Italy Full list of author information is available at the end of the article text summarizers and question-answering models. The classification datasets could benefit researchers from the human computation community working on problems such as assessing and assuring quality [6], budget optimization [7,8], and worker behavior [9], as well as further investigating highlighting support.

Data description
In the following we described the crowdsourcing experiments that generated the dataset as well as the dataset structure.

Task
In our experiments, we asked workers to assess whether a document is relevant to a given question (predicate), augmenting the task design found in the literature [10,11]. The documents come from two different domains systematic literature reviews (SLR) and amazon reviews. For the SLR domain, we considered two predicates "Does the paper describe a study that involves older adults (60+)?" (OA), and "Does the paper describe a study that involves technology for online social interactions?" (Tech). For Amazon reviews, we asked, "Is this review written on a book?" (AMZ).
All tasks were run in the crowdsourcing platform Figure Eight (https ://www.figur e-eight .com/). And personal information was not requested to workers; we only collected class labels and statistics related to effort.

Gathering text highlights
The first step is to generate highlights. This step serves as the basis of our study on text highlighting as an aid to workers in the classification tasks. We considered crowdsourcing and ML to generate the highlighted excerpts. For crowd-generated highlights, we asked workers to classify documents and to justify their decisions by highlighting passages from the text. For machine-generated highlights we used state-of-the-art extractive summarization and question-answering models. Two experts judged the quality of the highlights provided by the crowd and automatic techniques (Kappa was 0.87 for OA, 0.72 for Tech and 0.66 for AMZ). Table 1 shows the files containing the generated highlights (crowd and ML); both datasets include the individual highlights and associated quality. Classification with highlighting support Experiment 1 In this experiment, we asked workers to classify documents, giving additional support by highlighting passages from the text. Workers proceeded on pages of three documents each, up to six pages (3 × 6 layout). We categorized the available crowdsourced highlights according to their quality and derived six experimental conditions for our study. The baseline condition does not show any highlighted text. The 0%, 33%, 66% and 100% show highlights of varying quality. For example, on a page with three documents, the 33% condition shows one highquality highlight and two low-quality ones. Finally, the aggregation condition combines multiple highlights similar to aggregating votes in crowdsourcing tasks.

Experiment 2
This experiment focused on longer documents and pages, using 6 × 6 and 3 × 12 layouts and crowd-generated highlights. We keep the baseline as one experimental condition, and we introduce the 83% quality as the other.

Experiment 3
This experiment used machine-generated highlights, using a 3 × 6 layout and six experimental conditions: BertSum, Refresh, Bert-QA, AggrML, 100%ML, baseline. BertSum [12] and Refresh [13], are extractive summarization techniques, while Bert-QA [14] is a questionanswering model. AggrML aggregates the output from the three algorithms, and 100%ML only uses machinegenerated highlighting assessed by experts as being of good quality. We encourage readers to check [5] for a more indepth explanation of the experimental settings. Table 1 overviews the available datasets derived from our experiments.

Limitations
The dataset described in this paper features a set of dimensions that allow for an exploration of approaches, but that cannot be considered comprehensive. The dataset is still limited to two types of classification tasks, includes only the most widely used state-of-the-art algorithms for highlight generation, and relies on two task designs for crowd classification. Besides, the experiments with longer pages and documents (Experiment 2) are extensions of the first experiment and focus only on one relevance question.
These alternatives have been carefully selected, but more systematic studies will require a more in-depth investigation of each of these dimensions.
Abbreviations ML: machine learning; SLR: systematic literature reviews; OA: relevance question: "Does the paper describe a study that involves older adults (60+)?"; Tech: relevance question: "Does the paper describe a study that involves technology for online social interactions?"; AMZ: relevance question: "Is this review written on a book?".