A chromosome-level genome assembly of the Korean minipig (Sus scrofa)

DNA and RNA sequencingBlood sample of a male Korean minipig (27 months old) was collected with approval by the National Institute of Animal Science (NIAS) and all procedures were performed according to the ARRIVE guidelines. DNA was extracted from the collected blood sample and DNA libraries for long reads were prepared using a SMRTbell Template Prep Kit and sequenced on a PacBio Sequel system. For short read data, libraries for paired-end reads and mate-pair reads were constructed using a TruSeq Nano DNA Kit and a Nextera Mate Pair Sample Prep Kit, respectively, and sequenced on an Illumina platform. In addition, Hi-C sequencing reads were generated using the same procedure for generating paired-end reads (Table S1).RNAs from 26 different tissues (appendix, backfat, bone marrow, brain, colon, forelimb muscle, groin, heart, hindlimb muscle, intestine, kidney, liver, lung, lymph node, nipple, pancreas, phren, pituitary gland, rib, sirloin, spinal cord, spleen, stomach, tenderloin, testis, and thymus) were also extracted using a TRIzol reagent. Sequencing libraries were then prepared using a TruSeq Stranded mRNA LT Sample Prep Kit (Illumina, San Diego, CA, USA) and sequenced on an Illumina platform (Table S1).Genome assemblyThe size of the Korean minipig genome was estimated using the k-mer distribution (k = 19) calculated with Jellyfish (v2.3.0)20. Contigs were generated by connecting PacBio subreads using Canu (v1.9)5, with an estimated genome size of 2.5 G as the ‘genomeSize’ option. Only contigs supported by a minimum of 50 subreads were selected for the subsequent assembly procedure. Remaining contigs were polished using GenomicConsensus (v2.3.3; https://github.com/PacificBiosciences/GenomicConsensus) with the ‘–algorithm = arrow’ option, incorporating information from PacBio subreads mapped to contigs using pbmm2 (v1.2.1; https://github.com/PacificBiosciences/pbmm2).To build a chromosome-level genome assembly, contigs were scaffolded using various types of sequencing data, including short reads, long reads, and Hi-C reads, as well as multiple reference genomes. Firstly, polished contigs were assembled into longer scaffolds using an improved version of RACA3 (manuscript in preparation), which integrated diverse sequencing read data and multiple reference genome information. To prepare input data for RACA, short and long read data were mapped to the polished contigs using BWA-MEM (v0.7.17-r1198)21 and pbmm2 (v1.2.1; https://github.com/PacificBiosciences/pbmm2), respectively. In addition, reference genomes of three minipig breeds (Bama, Göttingen, and Meishan), three pig breeds (Duroc, Landrace, and Large white), cow, and goat were collected from the NCBI database22 (Table S2). Using the genome assembly of Duroc (Sscrofa11.1) as a reference, pairwise whole-genome alignments were generated by LASTZ (v1.04.00)23 with ‘E = 150 H = 2000 K = 4500 L = 2200 M = 254 O = 600 Q = human_chimp.v2.q T = 2 Y = 15000’ options. Considering the divergence time against the Korean minipig, all pig breeds were used as ingroup species, while cow and goat were used as outgroup species. Secondly, scaffolds generated by RACA were further assembled using Hi-C data. For Hi-C scaffolding, Hi-C reads were aligned to scaffolds using the Arima Hi-C mapping pipeline (https://github.com/ArimaGenomics/mapping_pipeline) and SALSA24 was run with the ‘-e GATC’ option. Lastly, correction of misassemblies and the gap closing were done with short read data twice using Pilon (v1.22)12.Genome assembly quality assessmentTo assess the contiguity of the KMP assembly, assembly statistics were calculated using assembly-stats (v1.0.0; https://github.com/sanger-pathogens/assembly-stats). The completeness of genome assembly was calculated with BUSCO (v3.0.2)13 using the mammalia_odb9 dataset. Assembly statistics for the pig reference genome (Sscrofa11.1) were also calculated and benchmarked with those of the KMP assembly. To validate Hi-C mapping patterns of the KMP assembly, Hi-C reads were mapped using Juicer (v1.6)14 and the Hi-C contact map was visualized with JuiceBox (v2.3.4)24. In addition, the quality value (QV) score for each chromosome was estimated with short reads using Merqury (v1.3)15. Additional short reads from ten Korean minipig samples (five ET-type Korean minipigs and five L-type Korean minipigs) were also mapped to the KMP and pig reference genome assemblies using BWA-MEM (v.0.7.17-r1198)21. The number of mapped reads and properly mapped reads were counted using the ‘stats’ module in samtools (v1.9)25.Next, the quality of the generated KMP assembly was validated by comparing the genomic structure between the KMP and the pig reference genome. The GMASS16 score representing structural similarity between two genome assemblies was measured using GMASS with ‘-r 100000,200000,300000,400000,500000 -s near’ options. Lastly, whole genome alignment of the KMP assembly against the pig reference genome assembly was conducted using LASTZ (v1.04.00)23 with the same options used in the ‘Genome assembly’ section. Synteny blocks were constructed at 300 Kb resolution using the synteny block detection program in InferCars26. The number of matched and mismatched bases in the syntenic regions was calculated using the Perl script (https://github.com/jkimlab/NCMD_study) provided by a previous study27.Validation for genome rearrangement in the KMP assemblyTo verify the quality of the KMP assembly, physical coverage patterns of breakpoint regions discovered through the synteny analysis were confirmed. Breakpoint regions were defined as non-syntenic regions adjacent to synteny blocks with different orders or orientations in the KMP assembly when compared to the pig reference genome. To measure base-level read coverages in breakpoint regions, short reads (paired-end and mate-pair) and long reads were mapped to the KMP assembly using BWA-MEM (v.0.7.17-r1198)21 and pbmm2 (v1.2.1; https://github.com/PacificBiosciences/pbmm2), respectively. Base-level coverage values were calculated using the ‘genomecov’ module in bedtools (v.2.28.0)28 with the ‘-bga’ option. Read coverage patterns in the breakpoint regions including the ±1~200 Kb flanking regions were visualized.Genome annotationFor annotating protein-coding genes, RNA-seq data generated from 26 different tissues of the Korean minipig were mapped to chromosome-level scaffolds in the KMP assembly using HISAT2 (v2.2.1)29. In addition, the reference genome assembly and gene annotation data of six different species (human, mouse, pig, cow, goat, and sheep) were collected from the Ensembl database30 for homology-based gene annotation (Table S2). Using both RNA-seq and collected gene annotation data, we predicted protein-coding genes in the KMP assembly by running GeMoMa (v1.9)17 with ‘ERE.s = FR_FIRST_STRAND m = 200000 AnnotationFinalizer.r = NO GAF.f = “start =  = ‘M’ and stop =  = ‘*’ and (isNaN(score) or score/aa >  = 4)”’ options. Subsequently, BUSCO scores were calculated for protein sequences extracted using the final KMP and the reference gene annotation by BUSCO (v3.0.2)13 with mammalian_odb9 dataset. To predict functions of protein-coding genes in the KMP gene annotation, homologous gene information identified by GeMoMa17 was used. When multiple gene functions were found for a single protein-coding gene, the function of the protein with the highest ‘pident’ value in the protein sequence alignment with vertebrate protein sequences was selected. BLASTP (v2.9.0)31 was employed for protein sequence alignment using protein sequences of vertebrate species collected from the UniProtKB/Swiss-Prot database18 (v2024_02).For annotating non-coding genes, various types of non-coding RNAs, including tRNA, rRNA, snRNA, and miRNA, were annotated using the Rfam database32 and Infernal (v1.1.3)33 with ‘–cut_ga–rfam–nohmmonly’ options. Additionally, tRNA and rRNA were predicted with tRNAscan-SE (v2.0.5)34 and RNAmmer (v1.2)35, respectively. The final annotation was generated by merging all predictions using the Perl script (https://github.com/jkimlab/NCMD_study) provided by a previous study27.To annotate repetitive elements, a de novo repeat library and an existing pig taxon-specific repeat library were merged as described in a previous study27. A de novo repeat library for the KMP assembly was built using RepeatModeler (v2.0.1)36, and a pig taxon-specific repeat library was extracted from the RepeatMasker (v4.0.5)19 database with the ‘queryRepeatDatabase.pl’ utility.

Hot Topics

Related Articles