Genetic Diversity and Relationships of Phalaenopsis Based on the rbcL and trnL-F Markers: In Silico Approach

In silico is the more comprehensive and applicable approach in supporting, both conservation and breeding programs of germplasm. The study aimed to analyze and determine the genetic diversity and relationships of 24 species of Phalaenopsis using two DNA barcoding markers, namely the rbcL and trnL-F, by in silico approach. All sequences of these markers were collected randomly from the NCBI website and analyzed using several softwares and methods, such as ClustalW and MultAlin for multiple sequence alignments and MEGA-X to determine its genetic diversity and relationships. Specifically, the genetic diversity was determined using a nucleotide diversity index and their relationships by the Maximum Likelihood method. The results showed that Phalaenopsis has a low genetic diversity of 0.24, 0.32, and 0.19, respectively. The phylogenetic analysis revealed that this orchid separated into five (for the rbcL), six (trnL-F), and seven clades (a combined one), where the closest relationship is shown by P. amboinensis vs. P. venosa, whereas the farthest by P. gibbosa vs. P. doweryensis, P. stuartiana vs. P. micholitzii, and P. celebensis vs. P. pulchra. The results have novel information on the diversity and relationships of Phalaenopsis on the in silico approach. Thus, our findings might be used in supporting the conservation and breeding program of Phalaenopsis, both locally and globally.


INTRODUCTION
Phalaenopsis, commonly known as moth orchid (Tsai et al., 2012), is the most popular orchid genus in the world (Chen et al., 2013b;Deng et al., 2015;Hsu et al., 2018). The popularity of this orchid is mainly related to the characteristics of the flowers it has, both shape, color, scent, and a long-lasting blossom (Hsu et al., 2011). Besides, Phalaenopsis is fast-growing and flowering, has a relatively short juvenile period, and easy to control at the flowering stage . Firgiyanto et al. (2016) reported that Phalaenopsis also has resistance and the ability to flower under unfavorable conditions. Globally, Phalaenopsis consists of about 66 endemic species that are scattered mainly in the western and southeastern Asian regions (Hinsley et al., 2018;Liu et al., 2016), covering Sri Lanka, India, Himalayas, China, Tibet, Philippines, Andaman Islands, Taiwan, Indonesia, and Papua New Guinea (Chen et al., 2013b;Deng et al., 2015;Rahayu et al., 2015), including northern Australia (Tsai et al., 2010;. According to Deng et al. (2015) and Tsai et al. (2010), the highest Phalaenopsis diversity was found in Indonesia and Philippines. Especially in Indonesia, there are more than 20 species of Phalaenopsis scattered in several large islands, including Sumatra, Java, Kalimantan, Nusa Tenggara, Sulawesi, Maluku, and Papua (Fatimah & Sukma, 2011;Rahayu et al., 2015).
Unfortunately, most of the Phalaenopsis species are currently very difficult to find in the wild, even among them are in the threatened category (Zhang et al., 2018). Deforestation, habitat destruction, overexploitation, and illegal trading, as well as other environmental impacts, are the major causes of the decline in the Phalaenopsis population in the wild (Fatimah & Sukma, 2011;Luo et al., 2014;Zahara & Win, 2019). Hence, the preservation, breeding, and analysis of genetic diversity of Phalaenopsis orchids are very urgent to employ.
For decades, analysis of genetic diversity, including orchids, has been carried out conventionally, using morphological markers (Kwon et al., 2017). However, these markers are greatly influenced by environmental factors and plant growth phases, so they are time-consuming (Kwon et al., 2017;Nadeem et al., 2018). Several molecular markers have used to study the genetic diversity of Phalaenopsis, namely RAPD (Goh et al., 2005;Niknejad et al., 2009), AFLP (Chang et al., 2009), and SSR (Chung et al., 2017;Fatimah & Sukma, 2011;Tsai et al., 2015b). However, these markers also have weaknesses, such as very subjective, and the results of the analysis are less accurate (Lee et al., 2017).
Currently, chloroplast DNA (cpDNA), known as DNA barcoding markers can be used to determine the genetic diversity and relationship of germplasm, including orchids (Jheng et al., 2012;Tsai et al., 2012). These markers have advantages over some of the previously mentioned, such as faster and more accurate in determining the genetic diversity of germplasm (Lee et al., 2017;Li et al., 2015;Singh et al., 2017). The Consortium for the Barcode of Life's or CBOL (2009) have recommended several DNA barcoding markers, two of these are the rbcL and trnL-F.
The rbcL is a coding region of cpDNA that has a low rate of polymorphism or mutation. However, this marker have generated a high quality output of sequence and a high universality of primer, then easy to aligned across various plant taxa (Dong et al., 2014). Furthermore, the trnL-F is a non-coding region of cpDNA with a number of structural mutations found, especially the insertions-deletions (indels). Hence, it can be used as a reliable genetic marker in population genetics and plant systematics (Chen et al., 2013a). This marker has also a conserve region that provides the opportunity to create universal primers for various plant taxa (Taberlet et al., 1991). The combination of these two (rbcL and trnL-F) markers have successfully applied for identification of NW-European fern (de Groot et al., 2011).
This study aimed to analyze the genetic diversity and relationship of 24 species of Phalaenopsis, based on the rbcL and trnL-F markers, by in silico approach. It means we have collected and used those markers from the GenBank or the National Center for Biotechnology Information (NCBI). According to Mascher et al. (2019), this institution provides a comprehensive database of nucleotide sequences or gene descriptions that are freely accessed. Hence, such a study does not require high costs and is applicable to support germplasm conservation, breeding, and cultivation programs (Mursyidin & Makruf, 2020). In other words, our findings may be usable as a reference in supporting the conservation and breeding programs of Phalaenopsis, both locally and globally.

Data collection
The rbcL and trnL-F sequences of 24 Phalaenopsis species were collected randomly from the GenBank or NCBI website (https://www.ncbi.nlm.nih.gov). All sequences of both regions (Table 1) were then saved into FASTA or Notepad (text) format. Multiple sequence alignment All sequence datasets of the rbcL and trnL-F of Phalaenopsis were aligned using ClustalW (Kumar et al., 2018) and MultAlin (Mitchell, 1993). The multiple alignments analyses were also conducted for a combined sequence. At this stage, the conserve region and/or polymorphic sites can be observed in both sequences.

Analysis of genetic diversity and their relationships
The level of genetic diversity of 24 species of Phalaenopsis was determined by the nucleotide diversity index (π) with the categories: 0.1 to 0.4 is low, 0.5 to 0.7 is medium, and 0.8-2.00 is high (Nei & Li, 1979). The phylogenetic relationship of germplasm was analyzed using the Maximum Likelihood method and evaluated by a bootstrap analysis for 1,000 replicates (Lemey et al., 2009). All analyses were conducted using the assistance of MEGA-X software (Kumar et al., 2018). Other parameters, such as the number of polymorphic sites (S), transition/transversion bias value (R), and Tajima's neutrality test (D) were also determined using this software (Kumar et al., 2018).

Genetic diversity and mutational events
Phalaenopsis has unique characteristics of the rbcL (Figure 1) and trnL-F (Figure 2) sequences. In general, both markers are equipped by a conserve region and some mutational evens, both substitutions and insertions-deletions (indels). Following Figure 1 and 2, a conserve region of both genes showing in bases with red color, whereas some mutational events, such as substitutions and insertions-deletions or indels, showing in green and orange rectangle, respectively. At a glance, following these two figures, the mutational events of trnL-F are relatively higher than the rbcL. Further information about the sequence characteristics of these two regions, including their mutational events and their specific loci are shown in Table 2.
Based on the Table 2, the Phalaenopsis has different of nucleotide length, both for the rbcL and trnL-F. In this case, the rbcL has a range of nucleotides of 669-718 bp, whereas the trnL-F has 568-1126 bp. According to CBOL (2009), the rbcL has a complete sequence, including approximately 1400 nucleotides coding for the large subunit protein, but the length varies slightly among flowering plants (Angiosperm). Singh and Banerjee (2018) reported that this region has an intergenic spacer with 600-800 nucleotides. Similarly, an entire sequence region of the trnL-F has also reported approximately of 1400 bp (Quandt et al., 2004).
Furthermore, there are a different number of polymorphic sites (S) and transition/transversion bias values (R) on the rbcL and trnL-F regions of Phalaenopsis. In general, the rbcL has a higher number of polymorphic sites (62 loci) than the trnL-F (59 loci). However, the rbcL has a relatively lower in transition/transversion bias values (0.40) than the trnL-F (0.42) ( Table 2). According to Stoltzfus and Norris (2015), this bias can be described as a ratio of differences, which makes the probable effect a complex function of the degree of sequence divergence.  In this study, all mutations event, mainly substitutions (transition and transversion), also indels (insertion and deletion) are found in the region of the rbcL and trnL-F of Phalaenopsis. According to Aloqalaa et al. (2019), transitions are more often found in sequences than transversions. In other words, a pattern where nucleotide transitions are found several folds over transversions is common in molecular evolution (Stoltzfus & Norris, 2015).
Conceptually, mutations, both substitutions and indels, are therefore tend to cause changes in the biochemical properties of amino acids or the protein products (Keller et al., 2007). According to Flint-Garcia (2013), mutations are permanent changes that are inherited in the genes or nucleotide sequences (genome) of an organism, and it can affect a single nucleotide (point mutation) or some that are close to each other (segmental mutation). The Tajima's neutrality test revealed that Phalaenopsis has an overage of low-frequency polymorphisms relative to expectancy, indicating population size expansion (e.g., after a bottleneck or a selective sweep) and/or purifying selection, because all sequences have negatives of D value (D<0) (Tajima, 1989).
Following Govindaraj et al. (2015), mutations are an initial step in establishing the primary population for natural selection and an integral part of evolution and genetic diversity. In other words, this phenomenon is the main factor giving rise to genetic diversity (Frankham et al., 2004). Hence, mutation and genetic diversity are two interrelated things. In this case, based on the Nei's (1979) category, Phalaenopsis shows a low level of genetic diversity, both for the rbcL (0.24) and trnL-F (0.32), as well as a combined sequence (0.19) ( Table 2). According to Acquaah (2012), information on this diversity is valuable for future breeding and conservation programs, particularly in developing new superior cultivars.

Phylogenetic relationships
The maximum likelihood analysis shows that Phalaenopsis has a complicated relationship. This complexity can be seen from the clades generated by each sequence used. Based on the rbcL region, this orchid was separated into five main clades (Figure 3), where the very closely relationship shown by three pairs of Phalaenopsis, namely P. philippinensis vs. P. stuartiana; P. amboinensis vs. P. venosa; P. sumatrana vs. P. inscriptiosinensis with a similarity coefficient of 99.71. Whereas a very far related shown by P. gibbosa vs. P. doweryensis at a similarity of 91.73 (Table Supplementary 1).
Following the trnL-F, this orchid was separated into six main clades (Figure 4), where a very close related shown by P. venosa vs. P. amboinensis; P. parishii vs. P. gibbosa (similarity of 99.99) and a very distantly (85.82) by P. stuartiana vs. P. micholitzii (Table Supplementary 2). Furthermore, a combined sequence of both regions has separated Phalaenopsis into seven main clades ( Figure 5), where P. venosa and P. amboinensis are a closest relationship with a coefficient similarity of 99.84, whereas the fartest shown by P. celebensis and P. pulchra (90.12) ( Table Supplementary 3).
Based on the rbcL and trnL-F markers, as well as a combined one, most of the Phalaenopsis species are grouped into a relatively similar clade. For example, P. celebensis, P. amabilis, P. aphrodite, P. equestris, P. philippinensis, and P. stuartiana are included into a similar large member based on these three sequences (Table 3). However, there is an exception, specifically for P. lowii which grouping into the similar clades for rbcL and a combined sequence with P. braceana and P. wilsonii, and separate from these two species, but joined together with P. chibae, P. gibbosa and P. parishii (Table 3).   Following the bootstrap analysis, the trnL-F has a higher resolution of phylogenetic tree (82.35%) than the rbcL (60.00%). Whereas the combined sequence produces a relatively high similar resolution to trnL-F (80.00%). According to Nelson (2008), bootstrapping is a numerical method in generating confidence intervals that use either resampled or simulated data to estimate the sampling distribution of the maximum likelihood parameter probabilities. Hence, the trnL-F and the combined sequence can be useful to identify or differentiate Phalaenopsis, particularly at the genus level.
In general, this grouping usually corresponds to the morphological or other characteristics of each species have. For example, P. amabilis and P. aphrodite belong to the similar group based on all sequences (Table 3), presumably because they have almost the similar flower morphology (Tsai et al., 2015). Tsai et al. (2015) even included the two into one subgenus, namely P. amabilis complex.
At the end of the discussion, although such studies have been carried out comprehensively by several researchers, especially by Tsai et al. (2010) and Zhou (2015), we tried to combine the data from both, then deepen by determining the genetic diversity and mutations that occur therein, as well reconstructed its relationship with a simpler manner. Therefore, this information has good implications and is essential for species conservation and plant breeding programs in the future (Flint-Garcia, 2013). In other words, the results of our study have beneficial impacts, particularly for the development of new Phalaenopsis orchids with desirable traits. Note. *inconsistant in grouping; ** above the value of 50

CONCLUSION
Based on the rbcL, trnL-F, and their combined sequence, Phalaenopsis has a low genetic (nucleotide) diversity. However, this germplasm shows a complex relationship. In general, Phalaenopsis separated into different clades, i.e., five, six, and seven clades for each marker used, respectively. The bootstrap analysis revealed that the trnL-F and a combined sequence provide a high resolution of phylogenetic trees. In this case, P. amboinensis vs. P. venosa is the closest, and three other pairs (P. gibbosa vs. P. doweryensis; P. stuartiana vs. P. micholitzii; and P. celebensis vs. P. pulchra) are the farthest. Hence, both sequences can be applied to identify or differentiate Phalaenopsis, particularly at the genus level. The information is essential in supporting the conservation and breeding programs of Phalaenopsis, both locally and globally.