Application of AI-driven PCA-random forest model in intelligent identification of mulberry varieties in Guangxi

LIU Dan; QIU Chang-yu; HUANG Sheng; QING Jun; MO Rong-li; LU Xiao-mei; ZENG Yan-rong; HE Guo-ling; ZHANG Chao-hua; LIN Qiang

doi:10.3969/j.issn.2095-1191.2026.02.013

LIU Dan, QIU Chang-yu, HUANG Sheng, QING Jun, MO Rong-li, LU Xiao-mei, ZENG Yan-rong, HE Guo-ling, ZHANG Chao-hua, LIN Qiang. 2026: Application of AI-driven PCA-random forest model in intelligent identification of mulberry varieties in Guangxi. Journal of Southern Agriculture, 57(2): 451-461. DOI: 10.3969/j.issn.2095-1191.2026.02.013

Citation:

Application of AI-driven PCA-random forest model in intelligent identification of mulberry varieties in Guangxi

Abstract

Abstract

【Objective】This study aimed to investigate the application of an artificial intelligence （AI）-driven principal component analysis （PCA）-random forest model in the intelligent identification of mulberry varieties in Guangxi，which could provide scientific basis for clarifying the genetic relationships of local mulberry germplasm resources in Guangxi as well as their conservation and utilization.【Method】Taking plump seeds of six elite mulberry varieties as the research materials，whole-genome resequencing was performed on the six varieties using the Illumina NovaSeq 6000 sequencer to construct mulberry genomic DNA libraries. To solve the problem of single nucleotide polymorphism （SNP） redundancy of PCA-random forest model，a two-round screening strategy was adopted. Candidate SNP loci were dimensionally reduced via PCA，and the score of each SNP locus under the first several principal components was obtained through PCA. Bayesian optimization was used to search for the optimal hyperparameters of the model for its construction，and a machine learning algorithm with 5-fold cross-validation to prevent overfitting was applied for model training，thus acquiring the importance value of each SNP. Core SNP loci were screened to establish a molecular marker library for mulberry variety identification，and the key SNP loci extracted from the mulberry samples to be identified were aligned with those in the library to verify the identification accuracy of the optimal model.【Result】The mapping rates of the six mulberry varieties ranged from 92.87% to 97.34%，indicating high quality of the sequencing data. After strict quality control and alignment，a total of 1163291 high-quality SNP loci were obtained，which were distributed in the upstream regions，exons，introns and intergenic regions with the proportions of 6.27%，10.74%，25.18%， and 48.17% respectively. The proportions of SNPs with transition and transversion mutations were 64.66% and 35.34% respectively，with an average fixation index of 0.63. The SNP density was unevenly distributed across different chromosomes，and the SNP loci density on Chr01 was significantly higher than that on other chromosomes. Based on the scores of each SNP locus under the first three principal components，the top 10000 SNP loci were selected for subsequent machine learning training and locus screening. According to the importance value of each SNP，225 core SNP loci were finally screened out. The four key evaluation indicators of the optimal model，including F1 score，precision rate，recall rate and accuracy rate，all reached 100%. Comparative verification showed that the variety number of the SNP library constructed for each mulberry sample was completely consistent with the predicted variety name.【Conclusion】The AI-driven PCA-random forest model algorithm successfully screens 225 core SNP loci from six mulberry varieties，and this method can be effectively applied to the identification of mulberry varieties.

FullText(HTML)

References (27)

Cited By

Application of AI-driven PCA-random forest model in intelligent identification of mulberry varieties in Guangxi

Abstract

Catalog

Export File

Citation

Format

Content