Metagenomic data for Halichondria panicea from Illumina and nanopore sequencing and preliminary genome assemblies for the sponge and two microbial symbionts

Objectives These data were collected to generate a novel reference metagenome for the sponge Halichondria panicea and its microbiome for subsequent differential expression analyses. Data description These data include raw sequences from four separate sequencing runs of the metagenome of a single individual of Halichondria panicea—one Illumina MiSeq (2 × 300 bp, paired-end) run and three Oxford Nanopore Technologies (ONT) long-read sequencing runs, generating 53.8 and 7.42 Gbp respectively. Comparing assemblies of Illumina, ONT and an Illumina-ONT hybrid revealed the hybrid to be the ‘best’ assembly, comprising 163 Mbp in 63,555 scaffolds (N50: 3084). This assembly, however, was still highly fragmented and only contained 52% of core metazoan genes (with 77.9% partial genes), so it was also not complete. However, this sponge is an emerging model species for field and laboratory work, and there is considerable interest in genomic sequencing of this species. Although the resultant assemblies from the data presented here are suboptimal, this data note can inform future studies by providing an estimated genome size and coverage requirements for future sequencing, sharing additional data to potentially improve other suboptimal assemblies of this species, and outlining potential limitations and pitfalls of the combined Illumina and ONT approach to novel genome sequencing.

These data were generated to create a reference metagenome for the emerging model sponge species, Halichondria panicea and its microbiome. The goal was then to use this reference to study changes in gene expression under different oxygen concentrations in order to understand how this species tolerates hypoxia [see 1]. During the process of data collection, Knobloch et al. [2] generated a reference genome for the dominant microbial symbiont 'Candidatus Halichondribacter symbioticus' , and the data presented here were not sufficient to construct a suitable reference genome for the sponge, limiting the scope of these data for a full research paper.
Given the considerable interest in H. panicea and its widespread distribution, we think that the data provided can inform future experiments and contribute to a more complete genome later. Finally, by sharing suboptimal data we aimed to identify some potential pitfalls for future genome projects, particularly those of poriferans.

Sample collection and DNA extraction
To limit assembly issues caused by allelic variation, a single individual of H. panicea (approximately 1 g of tissue [wet weight]) was collected from the side of a pier manually (while wearing gloves) within the inlet to Kerteminde Fjord in Denmark (decimal degrees: 55.449808, 10.661299) in 2018. The tissue was immediately cut on a sterile surface with a sterile scalpel, placed into sterile 1.5 mL cryovials, and flash frozen in liquid nitrogen. DNA was extracted and purified using a modified phenol-chloroform extraction (see [3] for full protocol) under sterile laboratory conditions. Microbes were not physically separated from sponge tissue before DNA extractions or sequencing. This protocol yielded the highest quality DNA and highest concentrations above 15,000 bp compared to five different extraction protocols (see supplemental material in [4]).
In total, nine micrograms of double stranded DNA were extracted and Nanodrop A260/A280 and A260/ A230 ratios were 1.79 and 2.17, respectively. The DNA integrity number (DIN) was 1.6, with high concentrations of DNA between 100 and 4000 base pairs (bp). A smearing pattern in gels was observed for all DNA extractions of H. panicea using various protocols (see supplementary material in [4]). This pattern could indicate high levels of degradation; however, a substantial amount of DNA was still intact and > 15,000 bp long in samples used for sequencing.
The first sequencing run using Oxford Nanopore Technologies (ONT) generated 1.26 million reads (3.4 Gbp, read N50: 2700 bp, longest read: 39,702 bp). For more details on the sequencing methods, see the supplemental material in [4].
Due to a low coverage of Opisthokonta contigs (from the Illumina data) in the nanopore reads, two additional rounds of nanopore sequencing were performed after whole genome amplifications (WGA, see supplementary material), generating 4.021 Gbp from the amplified H. panicea DNA. A summary of the public locations of all data generated is shown in Table 1.

Genome assembly and annotation Illumina metagenome assembly
Full details of quality control, binning, assembly and annotation of the metagenome are in the supplementary material. Three bins were produced including: (1) a large Opisthokonta bin, which was labeled as the sponge bin [11]; (2) a bin for a Gammaproteobacteria of the order 'HOC46' [12]; and (3) a Proteobacteria bin ( [13], Table 1). The sponge bin was highly fragmented (63,555 scaffolds) and contained only 51.57% of core metazoan genes (with 77.46% partial matches, Supplemental Table 1 in [4]) measured using BUSCOv5 [16]. More bins could potentially be extracted from these data in the future. The two bacterial genome bins were annotated using PROKKA v. 1.14 [17], and their completeness was estimated with CheckM [18] (see supplemental in [4] for more information about these two bins).

ONT and hybrid assemblies
Two additional metagenome assemblies were made using (1) ONT data from all three sequencing runs and (2) a combination of Illumina and ONT data. The second ONT sequencing run (following WGA) had high percentages of contamination (8%) and chimerism (5-10%). These ONT data were polished and filtered to remove these errors as described in the supplementary material. A summary of the nanopore-only metagenome assembly is shown in Supplemental Table 2 [4].

Limitations
Although the incorporation of long read nanopore data in the hybrid assembly did slightly increase the metagenome N50 and decrease the number of scaffolds in the assembly, the genome was still highly fragmented. A major limitation in sponge genomics that is often discussed but rarely written about is the difficulty in extracting high quality, high molecular weight DNA. This difficulty was likely either a result of some innate, highly efficient DNA degradation pathway in H. panicea or indicated the presence of DNA and/or degradation pathways from associated microorganisms or secondary metabolites. Obtaining high molecular weight DNA is paramount for successful long-read sequencing as well as genome assembly downstream regardless of sequencing technique. ONT sequencing can selectively sequence smaller DNA fragments if they are present. Additionally, microbial diversity within the metagenome and potential genetic variation caused by diploidy could also have limited genomic assembly.
This note represents the first attempt to sequence a sponge genome using Nanopore and Illumina sequencing, so improved genomic DNA recovery might validate this combination of methods, although it is unclear how DNA recovery could be improved. However, at least 9,000 Mbp long reads need to be generated. Similarly, the coverage of ONT reads would need to be increased to ~ 70× to permit a better assembly. Additionally, WGA should be used with caution due to the high rates of chimerism and contamination throughout the process. Improving coverage would also improve the assembly of prokaryotic genomes in the metagenome.
Recently, the generation of a near-chromosome level scaffolded genome assembly for the sponge Ephydatia muelleri was accomplished using PacBio, Chicago, and Dovetail Hi-C libraries sequenced to ~ 1490× coverage [20]. This sequencing method may therefore be the best for de novo genomes. The use of a sponge with limited microbial 'contamination' might also be critical for smooth genome assembly, although this effectively limits metagenomic projects. Finally, the use of a single haploid cell, like a sperm or egg cell, could improve future genome assembly performance by limiting allelic variation. However, single cell genomics could be limited by the amount and quality of DNA that can be isolated from a single cell.