Sequencing of E. coli strain UTI89 on multiple sequencing platforms

Objectives The availability of matched sequencing data for the same sample across different sequencing platforms is a necessity for validation and effective comparison of sequencing platforms. A commonly sequenced sample is the lab-adapted MG1655 strain of Escherichia coli; however, this strain is not fully representative of more complex and dynamic genomes of pathogenic E. coli strains. Data description We present six new sequencing data sets for another E. coli strain, UTI89, which is an extraintestinal pathogenic strain isolated from a patient suffering from a urinary tract infection. We now provide matched whole genome sequencing data generated using the PacBio RSII, Oxford Nanopore MinION R9.4, Ion Torrent, ABI SOLiD, and Illumina NextSeq sequencers. Together with other publically available datasets, UTI89 has a nearly complete suite of data generated on most second- and third-generation sequencers. These data can be used as an additional validation set for new sequencing technologies and analytical methods. More than being another E. coli strain, however, UTI89 is pathogenic, with a 10% larger genome, additional pathogenicity islands, and a large plasmid, features that are common among other naturally occurring and disease-causing E. coli isolates. These data therefore provide a more medically relevant test set for development of algorithms.

adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article' s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article' s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Objective
Control sequencing data across different sequencing platforms is extremely important for validation and effective comparison of sequencing platforms. A commonly sequenced sample that has been extensively used for these purposes is the MG1655 strain of E. coli [1]. However, the MG1655 genome is smaller and less complex than those of some pathogenic E. coli strains [2,3]. As part of control experiments, we have sequenced UTI89, a uropathogenic E. coli (UPEC) strain originally isolated from a patient suffering from an acute bladder infection [4], using several different sequencing technologies, including ABI SOLiD, Ion Torrent, PacBio, Oxford Nanopore, and Illumina. Our new data supplements previously published sequencing data generated using the Roche 454 [4], Illumina HiSeq [5], and the original Oxford Nanopore Technologies MinION [6]. With the inclusion of these new data sets, E. coli strain UTI89 now has a nearly complete set of raw sequence data generated using most second-and third-generation sequencers. For some of the technologies we have multiple data sets, such as for PacBio, which spans the first iteration of the RSII sequencing chemistry (XL/C2) in 2012 up to the P6-C4 chemistry (which was current in 2018), which led to a more than fivefold increase in mean read length.

BMC Research Notes
*Correspondence: slchen@gis.a-star.edu.sg 1 Genome Institute of Singapore, 60 Biopolis Street, Genome, #02-01, Singapore 138672, Singapore Full list of author information is available at the end of the article

Data description
The new data sets are summarized in Table 1. Details of library preparation and sequencing methods for the new datasets are presented below.

Library preparation
Genomic DNA was extracted from UTI89 grown overnight in Lysogeny Broth (LB) and used to generate Long Mate Pair (LMP) libraries. LMP libraries were generated using an insert size of 3-4 kb according to the manufacturer's instructions to produce a 375 bp library.

Sequencing
A 2x35bp LMP sequencing run was performed on two spots of an 8 spot slide using the Applied Biosystems SOLiD3 platform [7-9].

Ion Torrent Library preparation
Genomic DNA was extracted from UTI89 harbouring the pBAD33 plasmid [10] grown overnight in LB. Sequencing libraries were then generated using the Ion Xpress ™ Plus gDNA library preparation protocol according to the manufacturer's instructions.

Sequencing
A 200 bp sequencing run was performed on the personal genome machine (PGM) system using the Ion PGM ™ 200 Sequencing Kit with a 316 chip [11,12].

PacBio, RSII, XL/C2 Chemistry Library preparation
Genomic DNA was extracted from SLC-66 (UTI89 with a kanamycin cassette integrated into the phage HK022 integration site) grown overnight in LB. Large insert (15 Kb) native SMRTbell sequencing libraries were generated according to the manufacturer's protocols.

Library preparation
Genomic DNA was extracted from UTI89 grown overnight in LB. Sequencing libraries were built using the Illumina TruSeq Nano DNA LT kit according to the manufacturer's instructions, with shearing to 350 bp.

Sequencing
A 2x150bp sequencing run was performed using the Illumina NextSeq 500 and a NextSeq Mid Output flow cell and reagents [16,17].
Oxford Nanopore, MinION Mk1B Device, R9.4, 1D Ligation sequencing Library preparation Genomic DNA was extracted from UTI89 grown overnight in LB. 1 μg of unsheared DNA was used to prepare sequencing libraries using the Ligation sequencing kit 1D R9 version (SQK-LSK108) according to the manufacturer's instructions.

Sequencing
The prepared sequencing library was loaded onto a FLO-MIN106 R9.4 with Spot-ON and a 24 h sequencing run was performed. Base calling was subsequently performed using Oxford Nanopore's Albacore Sequencing Pipeline Software (version 1.

PacBio, RSII, P6-C4 Chemistry Library preparation
Genomic DNA was extracted from UTI89 grown overnight in LB. Large insert (20 Kb) native SMRTbell sequencing libraries were generated according to the manufacturer's instructions.

Limitations
The following are limitations of these data: 1. The data was collected over a period of several years, and thus all experimental steps were performed by different persons. 2. Some strains contain plasmids or other markers (see details above). 3. Not every generation of sequencing machine or library preparation method was used.