AI驱动的PCA-随机森林模型在广西桑树品种智能鉴定中的应用

刘丹; 邱长玉; 黄胜; 卿军; 莫荣利; 陆晓媚; 曾燕蓉; 何国玲; 张朝华; 林强

doi:10.3969/j.issn.2095-1191.2026.02.013

AI驱动的PCA-随机森林模型在广西桑树品种智能鉴定中的应用

Application of AI-driven PCA-random forest model in intelligent identification of mulberry varieties in Guangxi

摘要

摘要: 【目的】研究AI驱动的主成分分析（PCA）与随机森林算法组合（PCA-随机森林）模型在广西桑树品种智能鉴定中的应用，为明确广西地方桑树种质资源的遗传关系及其保护和利用提供科学依据。【方法】以6个优良桑树品种的饱满种子为研究对象，通过Illumina NovaSeq 6000测序仪对6个品种进行全基因组重测序，构建桑树基因组DNA文库。通过PCA-随机森林法解决单核苷酸多态性（SNP）冗余问题，采用2轮筛选策略，通过PCA分析对候选SNP位点进行降维，采用PCA分析前几个主成分下每个SNP位点的得分。使用贝叶斯优化寻找模型的最佳超参数构建最佳模型，并采用5折交叉验证防止过拟合的机器学习算法进行训练，得到每个SNP位点的重要程度值，筛选核心SNP位点，构建桑树品种鉴定分子标记库，将其与提取待鉴定桑树样品的关键SNP位点进行比对，验证最佳模型的鉴定准确性。【结果】6个桑树品种的比对率为92.87%~97.34%，表明测序数据质量良好。经严格质控和比对，最终获得1163291个高质量SNP位点，分布在上游、外显子、内含子和基因间区的高质量SNP位点占比分别为6.27%、10.74%、25.18%和48.17%。发生转换和颠换突变类型的SNP位点占比分别为64.66%和35.34%，平均固定指数为0.63。SNP密度在不同染色体上分布不均匀，染色体Chr01上SNP位点密度明显高于其他染色体。基于前3个主成分下每个SNP位点的得分，选择排名前10000的SNP位点用于后续的机器学习训练以及位点筛选研究，根据每个SNP位点的重要程度值，最终筛选出225个核心SNP位点。最佳模型的F1值、精确率、召回率和准确率4个关键指标值均达100%。经对比验证，每个桑树样品构建SNP库的品种编号与预测到的品种名均一致。【结论】根据AI驱动的PCA-随机森林模型算法，从6个桑树品种中筛选出225个核心SNP位点，该方法可用于桑树品种鉴定。

Abstract: 【Objective】This study aimed to investigate the application of an artificial intelligence （AI）-driven principal component analysis （PCA）-random forest model in the intelligent identification of mulberry varieties in Guangxi，which could provide scientific basis for clarifying the genetic relationships of local mulberry germplasm resources in Guangxi as well as their conservation and utilization.【Method】Taking plump seeds of six elite mulberry varieties as the research materials，whole-genome resequencing was performed on the six varieties using the Illumina NovaSeq 6000 sequencer to construct mulberry genomic DNA libraries. To solve the problem of single nucleotide polymorphism （SNP） redundancy of PCA-random forest model，a two-round screening strategy was adopted. Candidate SNP loci were dimensionally reduced via PCA，and the score of each SNP locus under the first several principal components was obtained through PCA. Bayesian optimization was used to search for the optimal hyperparameters of the model for its construction，and a machine learning algorithm with 5-fold cross-validation to prevent overfitting was applied for model training，thus acquiring the importance value of each SNP. Core SNP loci were screened to establish a molecular marker library for mulberry variety identification，and the key SNP loci extracted from the mulberry samples to be identified were aligned with those in the library to verify the identification accuracy of the optimal model.【Result】The mapping rates of the six mulberry varieties ranged from 92.87% to 97.34%，indicating high quality of the sequencing data. After strict quality control and alignment，a total of 1163291 high-quality SNP loci were obtained，which were distributed in the upstream regions，exons，introns and intergenic regions with the proportions of 6.27%，10.74%，25.18%， and 48.17% respectively. The proportions of SNPs with transition and transversion mutations were 64.66% and 35.34% respectively，with an average fixation index of 0.63. The SNP density was unevenly distributed across different chromosomes，and the SNP loci density on Chr01 was significantly higher than that on other chromosomes. Based on the scores of each SNP locus under the first three principal components，the top 10000 SNP loci were selected for subsequent machine learning training and locus screening. According to the importance value of each SNP，225 core SNP loci were finally screened out. The four key evaluation indicators of the optimal model，including F1 score，precision rate，recall rate and accuracy rate，all reached 100%. Comparative verification showed that the variety number of the SNP library constructed for each mulberry sample was completely consistent with the predicted variety name.【Conclusion】The AI-driven PCA-random forest model algorithm successfully screens 225 core SNP loci from six mulberry varieties，and this method can be effectively applied to the identification of mulberry varieties.

HTML全文

参考文献(27)

施引文献

资源附件(0)