A de novo transcriptome assembly for the bath sponge Spongia officinalis, adjusting for microsymbionts

Objectives We report a transcriptome acquisition for the bath sponge Spongia officinalis, a non-model marine organism that hosts rich symbiotic microbial communities. To this end, a pipeline was developed to efficiently separate between bacterial expressed genes from those of eukaryotic origin. The transcriptome was produced to support the assessment of gene expression and, thus, the response of the sponge, to elevated temperatures, replicating conditions currently occurring in its native habitat. Data description We describe the assembled transcriptome along with the bioinformatic pipeline used to discriminate between signals of metazoan and prokaryotic origin. The pipeline involves standard read pre-processing steps and incorporates extra analyses to identify and filter prokaryotic reads out of the analysis. The proposed pipeline can be followed to overcome the technical RNASeq problems characteristic for symbiont-rich metazoan organisms with low or non-existent tissue differentiation, such as sponges and cnidarians. At the same time, it can be valuable towards the development of approaches for parallel transcriptomic studies of symbiotic communities and the host.


Objective
Sponges are organisms with simple body plan, lacking true tissue differentiation [1]. Moreover, they often host rich symbiotic bacterial communities, thus creating complex holobionts [2,3]. These traits, combined with the diverse nature of the poriferan phylum and their vulnerability to global change makes them ideal casestudy species (e.g. [4][5][6]). Although transcriptomic studies facilitated through NGS can provide sound answers to ecological questions, the lack of a reference genome makes the building a de novo assembly necessary, as for all non-model organisms. This becomes more challenging in sponges, as it is often difficult to discriminate between signals of metazoan and prokaryotic origin [7,8], thus introducing biases to interpretation.
Here, we constructed the transcriptome of the Mediterranean bath sponge Spongia officinalis, an organism that has suffered a substantial decline in the past decades due to the combined impact of harvesting and mass mortalities attributed to extreme climatic events [9,10]. The acquisition of the transcriptome was used to assess gene expression within a manipulative experiment, where individuals of the sponge were subjected to a gradient of elevated temperatures simulating extreme climatic events currently occurring during the warm season in its native habitats (see Table 1 data file 1 for experimental design). The results of the study are published in [4] and all data files are presented in Table 1.
The built transcriptome assembly comprises the only transcriptome reference available for S. officinalis and can serve as a baseline for further studies on the species. This transcriptome reference has already been used in studies of different focus (see [11]) indicating the importance of this transcriptome generation in various study fields. The proposed pipeline can be followed to overcome the technical RNASeq problems characteristic for symbiont-rich metazoan organisms with low or non-existent tissue differentiation, such as sponges and cnidarians.

Data description
Four S. officinalis individuals collected from natural populations from the island of Crete, Greece, were reared in closed tanks and experimentally exposed to elevated temperatures approximate an extreme climate event naturally occurring in the sponge's habitat during summer. The 50 m 3 rearing tanks contained natural seawater collected from a pristine open-sea area, with temperature and salinity adjusted to reflect typical local conditions for the time of year (24 °C and 39 ppt, respectively). Two experimental tanks were employed, one as control (24 °C) and one as treatment with increasing temperature (up to 30 °C). Five sampling points initiated after 5 days acclimatization in the tanks and over a span of 6 days, resulted in 20 samples. RNA was extracted with TRIZOL (TRIzol ™ Reagent, Thermo Fisher Scientific, Cat. number 15596026) following the manufacturer's protocol. Sequencing yielded on average 12,933,232 raw paired reads per library (data set 1). Raw reads were quality controlled using multiple software in a workflow described in [12] and run through bash scripts (data file 4 and 5). The used software included scythe (version 0.994 BETA; https ://githu b.com/vs.buffa lo/scyth e), sickle (version 1.33; https ://githu b.com/najos hi/sickl e), prinseq (version 0.20.4; http://prins eq.sourc eforg e.net/) and trimmomatic version 0.32 [13]. The quality-controlled data were used to build an initial Trinity (v2.1.1) [14] assembly (data file 6). However, given that a great percentage of sponge transcriptome is comprised of bacterial sequences, we downloaded all bacterial sequences from NCBI (data file 7) and removed all reads (2.2 to 17.6% of the reads of each sample) that were successfully mapped on them using riboPicker (ribopicker-standalone-0.4.3 version; https ://sourc eforg e.net/proje cts/ribop icker /files /stand alone /; command ribopicker. pl -c 47 -i 75 -l 40 -z 3). Then, we built another assembly with the remaining reads (data file 8). The reconstructed transcripts were then used for a similarity search through NOBLAST [15] against the Swiss-Prot database (e-value: 1.0E−5). Transcripts that had as best hit prokaryotic sequences (17.1% of the assembly)

Data repository and identifier (DOI or accession number)
Data file 1 Figure 1 were eliminated leading to the final assembly (data file 9). Their corresponding reads were eliminated from the bam files as well (data file 10) and were excluded from downstream analyses.

Limitations
The proposed pipeline eliminates effectively most prokaryotic sequences within the sequenced dataset, however, it does not filter out non-sponge eukaryotic sequences that are often present due to existence of symbiotic eukaryotes as well, e.g. fungi and dinoflagellates.

Abbreviations
RNASeq: RNA-sequencing the use of next-generation sequencing to assess the presence and quantity of the expressed RNA in a biological sample; NGS: next-generation sequencing.