- Data Note
- Open access
- Published:
Genomic, transcriptomic and epigenomic sequencing data of the B-cell leukemia cell line REH
BMC Research Notes volume 16, Article number: 265 (2023)
Abstract
Objectives
The aim of this data paper is to describe a collection of 33 genomic, transcriptomic and epigenomic sequencing datasets of the B-cell acute lymphoblastic leukemia (ALL) cell line REH. REH is one of the most frequently used cell lines for functional studies of pediatric ALL, and these data provide a multi-faceted characterization of its molecular features. The datasets described herein, generated with short- and long-read sequencing technologies, can both provide insights into the complex aberrant karyotype of REH, and be used as reference datasets for sequencing data quality assessment or for methods development.
Data description
This paper describes 33 datasets corresponding to 867 gigabases of raw sequencing data generated from the REH cell line. These datasets include five different approaches for whole genome sequencing (WGS) on four sequencing platforms, two RNA sequencing (RNA-seq) techniques on two different sequencing platforms, DNA methylation sequencing, and single-cell ATAC-sequencing.
Objective
Human cell lines are commonly used by researchers as accessible models of disease [1, 2]. The REH cell line, derived from a fifteen-year old female patient at relapse, is frequently used in the study of ALL, the most common cancer in children [3, 4], as well as for method development [5]. At the same time, next-generation sequencing has become an invaluable tool for cancer research [6, 7], while long-read technology increasingly offers novel insights into complex oncological aberrations [8, 9]. Therefore, a multi-faceted dataset encompassing the genomics, transcriptomics and epigenomics of a cell line such as REH can be a valuable resource for leukemia researchers. Likewise, developers of bioinformatic analysis software stand to benefit from the availability of publicly available reference datasets [10, 11].
A subset of the datasets in this project were used for downstream analysis with the purpose of cataloging the structural variants and fusion genes of the REH cell line [12]. For this project, mapping was performed to the human reference genome GRCh38. Additionally, the long-read WGS datasets were subjected to de-novo assembly.
Here, we present the raw sequencing reads as well as assemblies and mapped BAM files in order to make the data available to the research community.
Data description
The project consists of 33 sequencing datasets generated from the ALL cell line REH, which was obtained from DSMZ (ACC 22) and cultured according to the supplier’s specifications (see Supplemental Methods) [13, 14]. The cell line’s authenticity was confirmed by karyotyping [12] and STR analysis [15]. The datasets are divided into nine library types. Of the seven library types using DNA as input, five are whole genome sequencing (WGS) methods producing genomic data [16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34], one is a method producing chromatin accessibility data (single-cell ATAC-seq) [35,36,37,38,39], and one is a whole genome methylome sequencing method producing epigenomic data (EM-seq) [40, 41]. The WGS methods include Illumina TruSeq DNA PCR-Free, PacBio SMRT, Oxford Nanopore (ONT), MGISEQ stLFR and linked-read WGS (10x Genomics), while RNA was used as input to two RNA-seq methods, Illumina TruSeq Stranded Total RNA and PacBio IsoSeq [42,43,44,45,46,47,48]. The datasets include raw sequencing reads in FASTA and FASTQ formats, reference-mapped BAM files, and de-novo assemblies (Table 1).
Genomic datasets
The genomic data consists of short- and long-read sequencing datasets, including de-novo assemblies, providing a combination of generous coverage and contiguity that allows for the in-depth analysis of the genomic variation present in this cell line. Included are FASTQ files from the short-read WGS sequencing of two lanes prepared with the Illumina TruSeq DNA PCR-Free kit and sequenced on the HiSeq X sequencer with PE150 read-length, as well as a BAM file of the reads mapped to human reference genome GRCh38.
Long-read WGS datasets include FASTQ files generated from a CLR library and a HiFi library sequenced on the PacBio Sequel II, as well as six ONT libraries prepared with three different kits using DNA selected to varying sizes and sequenced on the PromethION 24. BAM files mapping reads generated from both PacBio libraries and the ONT ultralong library to GRCh38 are included, as are three de-novo assemblies generated from these reads using hifiasm and flye.
Additionally, there are FASTQ files from one WGS library prepared with BGI’s MGIEasy stLFR kit and sequenced on the MGISEQ-2000RS, as well as from two linked-read WGS libraries prepared using the 10x Genomics Gemcode kit and sequenced on the Illumina HiSeq 2500.
Chromatin accessibility datasets
Single-cell ATAC-seq enables the selective sequencing of chromatin-accessible genomic regions, allowing for the determination of chromatin accessibility profiles on a cellular level. A library was prepared using the Chromium Single Cell ATAC Reagent Kit from 10X Genomics and sequenced on an SP flowcell on an Illumina NovaSeq 6000 instrument. FASTQ data, plus a BAM file mapping this data to GRCh38, are included among the datasets.
Epigenomic datasets
Methylome analysis of the REH cell line can be performed using the epigenomic data sets, which identify 5-mC or 5-hmC modifications to DNA. Two such libraries were prepared with 10 ng and 100 ng input DNA using the NEBNext enzymatic methyl-seq kit (EM-seq). The libraries were sequenced on an Illumina NovaSeq 6000 on an S4 flowcell.
Transcriptomic datasets
The datasets include both short-read and long-read transcriptomic data, allowing insight into gene expression and aberrations such as fusion genes, as well as detailed transcript splicing information. The RNA-seq datasets include FASTQ files from the short-read sequencing of two lanes prepared with the Illumina TruSeq Stranded Total RNA kit and sequenced PE-100 on a NovaSeq 6000 instrument, as well as a BAM file of these reads mapped to GRCh38. The long-read RNA-seq data consists of two IsoSeq libraries, with a varying bead ratio used to generate one library with standard-length transcripts and one library with full-length transcripts. An additional dataset containing resulting FLNC reads and a BAM file mapping them to GRCh38 is included for each of the IsoSeq libraries.
Limitations
-
The 10x Genomics Gemcode linked-read sequencing technology is discontinued.
-
The MGISEQ WGS data was sequenced to low (~ 10x) sequencing depth.
-
The REH cells used to generate the datasets herein were obtained from a single source. Given that cell lines may undergo alterations during proliferation, leading to genetic heterogeneity within the cell population, these data may not serve as a universal reference for all REH cultures.
Data Availability
References
Gillet J-P, Varma S, Gottesman MM. The clinical relevance of Cancer Cell Lines. JNCI J Natl Cancer Inst. 2013;105:452–8.
Gazdar AF, Minna JD. Cell lines as an investigational tool for the study of biology of small cell lung cancer. Eur J Cancer Clin Oncol. 1986;22:909–11.
Rosenfeld C, Goutner A, Choquet C, Venuat AM, Kayibanda B, Pico JL, et al. Phenotypic characterisation of a unique non-T, non-B acute lymphoblastic leukaemia cell line. Nature. 1977;267:841–3.
Cortes JE, Kantarjian HM. Acute lymphoblastic leukemia a comprehensive review with emphasis on biology and therapy. Cancer. 1995;76:2393–417.
Raine A, Manlig E, Wahlberg P, Syvänen A-C, Nordlund J. SPlinted Ligation Adapter Tagging (SPLAT), a novel library preparation method for whole genome bisulphite sequencing. Nucleic Acids Res. 2017;45:e36–6.
Shyr D, Liu Q. Next generation sequencing in cancer research and clinical application. Biol Proced Online. 2013;15:4.
LeBlanc VG, Marra MA. Next-generation sequencing approaches in Cancer: where have they brought us and where will they take us? Cancers. 2015;7:1925–58.
Sakamoto Y, Sereewattanawoot S, Suzuki A. A new era of long-read sequencing for cancer genomics. J Hum Genet. 2020;65:3–10.
Rausch T, Snajder R, Leger A, Simovic M, Giurgiu M, Villacorta L, et al. Long-read sequencing of diagnosis and post-therapy medulloblastoma reveals complex rearrangement patterns and epigenetic signatures. Cell Genomics. 2023;3:100281.
Fang LT, Zhu B, Zhao Y, Chen W, Yang Z, Kerrigan L, et al. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. Nat Biotechnol. 2021;39:1151–60.
Ren L, Duan X, Dong L, Zhang R, Yang J, Gao Y et al. Quartet DNA reference materials and datasets for comprehensively evaluating germline variants calling performance. preprint. Bioinformatics; 2022.
Lysenkova Wiklander M, Arvidsson G, Bunikis I, Lundmark A, Raine A, Marincevic-Zuniga Y et al. A complete digital karyotype of the B-cell leukemia REH cell line resolved by long-read sequencing. preprint. Cancer Biology; 2023.
Lysenkova Wiklander M. REH Data Note - Overview of REH sequencing datasets. 2023. https://doi.org/10.6084/m9.figshare.23966340. Accessed 16 Aug 2023.
Lysenkova Wiklander M. REH Data Note - Supplemental Methods.pdf. 2023. https://doi.org/10.6084/M9.FIGSHARE.22643065. Accessed 11 May 2023.
Lysenkova Wiklander M. REH Data Note - Data File 3. Short Tandem Repeat Analysis of the REH cell line. 2023. https://doi.org/10.6084/m9.figshare.24131670
NCBI Sequence Read Archive. WGS of REH (Illumina TruSeq DNA PCR-Free) - Illumina HiSeq X - Lane 1. 2023. https://identifiers.org/insdc.sra:SRR10882610
NCBI Sequence Read Archive. WGS of REH (Illumina TruSeq DNA PCR-Free) - Illumina HiSeq X - Lane 2. 2023. https://identifiers.org/insdc.sra:SRR10882609
NCBI Sequence Read Archive. WGS of REH (Illumina TruSeq DNA PCR-Free) - mapped - hg38. 2023. https://identifiers.org/insdc.sra:SRR23704824
NCBI Sequence Read Archive. CLR WGS of REH (PacBio SMRT). 2023. https://identifiers.org/insdc.sra:SRR22805329
NCBI Sequence Read Archive. HiFi WGS of REH (PacBio SMRT). 2023. https://identifiers.org/insdc.sra:SRR19123265
NCBI Sequence Read Archive. HiFi/CLR WGS of REH (PacBio SMRT) - mapped - hg38. 2023. https://identifiers.org/insdc.sra:SRR23704823
NCBI Sequence Read Archive. De-novo REH assembly (hifiasm, PacBio HiFi and CLR WGS). 2023. https://identifiers.org/insdc.sra:SRR23704827
NCBI Sequence Read Archive. ONT WGS of REH, sheared to 10 kb. 2023. https://identifiers.org/insdc.sra:SRR22730978
NCBI Sequence Read Archive. ONT WGS of REH, sheared to 20 kb. 2023. https://identifiers.org/insdc.sra:SRR22444744
NCBI Sequence Read Archive. ONT WGS of REH, sheared to 30 kb. 2023. https://identifiers.org/insdc.sra:SRR23054498
NCBI Sequence Read Archive. ONT WGS of REH, sheared to 60 kb, size selected with Circulomics SRE. 2023. https://identifiers.org/insdc.sra:SRR22444743
NCBI Sequence Read Archive. ONT WGS of REH, size selected with Circulomics SRE. 2023. https://identifiers.org/insdc.sra:SRR22444742
NCBI Sequence Read Archive. ONT WGS of REH, Ultralong. 2023. https://identifiers.org/insdc.sra:SRR21147769
NCBI Sequence Read Archive. ONT WGS of REH, Ultralong - mapped - hg38. 2023. https://identifiers.org/insdc.sra:SRR23704822
NCBI Sequence Read Archive. De-novo REH assembly (flye/medaka, ONT Ultralong WGS). 2023. https://identifiers.org/insdc.sra:SRR23704826
NCBI Sequence Read Archive. De-novo REH assembly (flye/racon, ONT Ultralong and PacBio WGS). 2023. https://identifiers.org/insdc.sra:SRR23704825
NCBI Sequence Read Archive. MGISEQ WGS of REH (stLFR). 2023. https://identifiers.org/insdc.sra:SRR18907774
NCBI Sequence Read Archive. 10x GemCode linked-read WGS of REH (high molecular weight) - mapped - hg37. 2023. https://identifiers.org/insdc.sra:SRR10902121
NCBI Sequence Read Archive. 10x GemCode linked-read WGS of REH (standard DNA) - mapped - hg37. 2023. https://identifiers.org/insdc.sra:SRR10902122
NCBI Sequence Read Archive. Single cell ATAC sequencing of REH (10x Chromium) – 1 of 4. 2023. https://identifiers.org/insdc.sra:SRR22320001
NCBI Sequence Read Archive. Single cell ATAC sequencing of REH (10x Chromium) – 2 of 4. 2023. https://identifiers.org/insdc.sra:SRR22320000
NCBI Sequence Read Archive. Single cell ATAC sequencing of REH (10x Chromium) – 3 of 4. 2023. https://identifiers.org/insdc.sra:SRR22319999
NCBI Sequence Read Archive. Single cell ATAC sequencing of REH (10x Chromium) – 4 of 4. 2023. https://identifiers.org/insdc.sra:SRR22319998
NCBI Sequence Read Archive. Single-cell ATAC sequencing of REH - Illumina NovaSeq 6000 - mapped - hg38. 2023. https://identifiers.org/insdc.sra:SRR10907069
NCBI Sequence Read Archive. EM-seq of REH (NEBNext) – 100ng DNA. 2023. https://identifiers.org/insdc.sra:SRR23020114
NCBI Sequence Read Archive. EM-seq of REH (NEBNext) – 10ng DNA. 2023. https://identifiers.org/insdc.sra:SRR23020113
NCBI Sequence Read Archive. RNA-seq of REH (Illumina TruSeq stranded total RNA) - Illumina NovaSeq 6000 - Lane 1. 2023. https://identifiers.org/insdc.sra:SRR10882846
NCBI Sequence Read Archive. RNA-seq of REH (Illumina TruSeq stranded total RNA) - Illumina NovaSeq 6000 - Lane 2. 2023. https://identifiers.org/insdc.sra:SRR10882845
NCBI Sequence Read Archive. RNA-seq of REH (Illumina TruSeq stranded total RNA) - mapped - hg38. 2023. https://identifiers.org/insdc.sra:SRR23704830
NCBI Sequence Read Archive. HiFi RNA-seq of REH (PacBio IsoSeq) - standard-length transcripts. 2023. https://identifiers.org/insdc.sra:SRR22729869
NCBI Sequence Read Archive. HiFi RNA-seq of REH (PacBio IsoSeq) - long transcripts. 2023. https://identifiers.org/insdc.sra:SRR22729868
NCBI Sequence Read Archive. HiFi RNA-seq of REH (PacBio IsoSeq) - standard-length transcripts - FLNC and mapped - hg38. 2023. https://identifiers.org/insdc.sra:SRR23704829
NCBI Sequence Read Archive. HiFi RNA-seq of REH (PacBio IsoSeq) - long transcripts - FLNC and mapped - hg38. 2023. https://identifiers.org/insdc.sra:SRR23704828
Acknowledgements
The authors would like to acknowledge support of the National Genomics Infrastructure (NGI) unit in Uppsala for aiding in RNA/DNA extraction, library preparation, and sequencing and Susanne Reinsbach for bioinformatics support.
Funding
Open access funding provided by Uppsala University. This project was funded in part by the Swedish Research Council (#2019 − 01976 and #2019 − 0222), the Swedish Childhood Cancer Fund (#2019-0046 and #2022-0086), and the Göran Gustafsson Foundation. This project received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 824110 EASI-Genomics. Sequencing was performed at the National Genomics Infrastructure (NGI) at SciLifelab in Uppsala. NGI is funded by SciLifeLab, the Swedish Research Council RFI, and the Knut and Alice Wallenberg Foundation. The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) and the Swedish National Infrastructure for Computing (SNIC) at the Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) partially funded by the Swedish Research Council through grant agreements no. 2022–06725 and no. 2018–05973.
Open access funding provided by Uppsala University.
Author information
Authors and Affiliations
Contributions
MLW, AA, LF and JN conceived the research and constructed the experimental design. JN and LF acquired funding. MLW and JN wrote the paper. EÖ, JL, AR, ACW, JR, YMZ, HG, TM and UL prepared and sequenced short-read Illumina libraries. SE, RE and PL performed bioinformatics analysis of Illumina sequencing libraries. AP, MBM, SH,and SHK prepared and sequenced long-read libraries. IB performed bioinformatics analysis of long-read sequencing libraries.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Lysenkova Wiklander, M., Övernäs, E., Lagensjö, J. et al. Genomic, transcriptomic and epigenomic sequencing data of the B-cell leukemia cell line REH. BMC Res Notes 16, 265 (2023). https://doi.org/10.1186/s13104-023-06537-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13104-023-06537-2