Genomic, transcriptomic and epigenomic sequencing data of the B-cell leukemia cell line REH

Objectives The aim of this data paper is to describe a collection of 33 genomic, transcriptomic and epigenomic sequencing datasets of the B-cell acute lymphoblastic leukemia (ALL) cell line REH. REH is one of the most frequently used cell lines for functional studies of pediatric ALL, and these data provide a multi-faceted characterization of its molecular features. The datasets described herein, generated with short- and long-read sequencing technologies, can both provide insights into the complex aberrant karyotype of REH, and be used as reference datasets for sequencing data quality assessment or for methods development. Data description This paper describes 33 datasets corresponding to 867 gigabases of raw sequencing data generated from the REH cell line. These datasets include five different approaches for whole genome sequencing (WGS) on four sequencing platforms, two RNA sequencing (RNA-seq) techniques on two different sequencing platforms, DNA methylation sequencing, and single-cell ATAC-sequencing.

development [5].At the same time, next-generation sequencing has become an invaluable tool for cancer research [6,7], while long-read technology increasingly offers novel insights into complex oncological aberrations [8,9].Therefore, a multi-faceted dataset encompassing the genomics, transcriptomics and epigenomics of a cell line such as REH can be a valuable resource for leukemia researchers.Likewise, developers of bioinformatic analysis software stand to benefit from the availability of publicly available reference datasets [10,11].
A subset of the datasets in this project were used for downstream analysis with the purpose of cataloging the structural variants and fusion genes of the REH cell

Objective
Human cell lines are commonly used by researchers as accessible models of disease [1,2].The REH cell line, derived from a fifteen-year old female patient at relapse, is frequently used in the study of ALL, the most common cancer in children [3,4], as well as for method line [12].For this project, mapping was performed to the human reference genome GRCh38.Additionally, the long-read WGS datasets were subjected to de-novo assembly.
Here, we present the raw sequencing reads as well as assemblies and mapped BAM files in order to make the data available to the research community.

Genomic datasets
The genomic data consists of short-and long-read sequencing datasets, including de-novo assemblies, providing a combination of generous coverage and contiguity that allows for the in-depth analysis of the genomic variation present in this cell line.Included are FASTQ files from the short-read WGS sequencing of two lanes prepared with the Illumina TruSeq DNA PCR-Free kit and sequenced on the HiSeq X sequencer with PE150 read-length, as well as a BAM file of the reads mapped to human reference genome GRCh38.
Long-read WGS datasets include FASTQ files generated from a CLR library and a HiFi library sequenced on the PacBio Sequel II, as well as six ONT libraries prepared with three different kits using DNA selected to varying sizes and sequenced on the PromethION 24.BAM files mapping reads generated from both PacBio libraries and the ONT ultralong library to GRCh38 are included, as are three de-novo assemblies generated from these reads using hifiasm and flye.
Additionally, there are FASTQ files from one WGS library prepared with BGI's MGIEasy stLFR kit and sequenced on the MGISEQ-2000RS, as well as from two linked-read WGS libraries prepared using the 10x Genomics Gemcode kit and sequenced on the Illumina HiSeq 2500.

Chromatin accessibility datasets
Single-cell ATAC-seq enables the selective sequencing of chromatin-accessible genomic regions, allowing for the determination of chromatin accessibility profiles on a cellular level.A library was prepared using the Chromium Single Cell ATAC Reagent Kit from 10X Genomics and sequenced on an SP flowcell on an Illumina NovaSeq 6000 instrument.FASTQ data, plus a BAM file mapping this data to GRCh38, are included among the datasets.

Epigenomic datasets
Methylome analysis of the REH cell line can be performed using the epigenomic data sets, which identify 5-mC or 5-hmC modifications to DNA.Two such libraries were prepared with 10 ng and 100 ng input DNA using the NEBNext enzymatic methyl-seq kit (EM-seq).The libraries were sequenced on an Illumina NovaSeq 6000 on an S4 flowcell.

Transcriptomic datasets
The datasets include both short-read and long-read transcriptomic data, allowing insight into gene expression and aberrations such as fusion genes, as well as detailed transcript splicing information.The RNA-seq datasets include FASTQ files from the short-read sequencing of two lanes prepared with the Illumina TruSeq Stranded Total RNA kit and sequenced PE-100 on a NovaSeq 6000 instrument, as well as a BAM file of these reads mapped to GRCh38.The long-read RNA-seq data consists of two IsoSeq libraries, with a varying bead ratio used to generate one library with standard-length transcripts and one library with full-length transcripts.An additional dataset containing resulting FLNC reads and a BAM file mapping them to GRCh38 is included for each of the IsoSeq libraries.

Limitations
• The 10x Genomics Gemcode linked-read sequencing technology is discontinued.• The MGISEQ WGS data was sequenced to low (~ 10x) sequencing depth.• The REH cells used to generate the datasets herein were obtained from a single source.Given that cell lines may undergo alterations during proliferation, leading to genetic heterogeneity within the cell population, these data may not serve as a universal reference for all REH cultures.