Accelerating the discovery of acceptor materials for organic solar cells by deep learning

The DeepAcceptor frameworkDeepAcceptor is a deep learning-based framework, including data collection and management, PCE predictor based on abcBERT, material design, and discovery. As shown in Fig. 1a, the computational data and experimental data are used as unlabeled and labeled data, respectively. The computational dataset contains 51,256 NFAs. The PCEs from computational data lack the desired accuracy, and their structures are simpler compared to those of experimental NFAs. Nevertheless, the similarity of these substructures to experimental data proves advantageous for the model in learning the rules of molecular formation.Fig. 1: Overview of DeepAcceptor.a Computational and experimental data collected from the literature served as unlabeled and labeled molecules. b The abcBERT model was pre-trained by predicting masking atoms of unlabeled molecules with the assistance of bond lengths and connections. c The pre-trained model was fine-tuned on the experimental dataset. d A molecular generation and screening process was built to find high-performance acceptor candidates.The experimental dataset includes 1027 small-molecule NFAs and their PCEs collected from the literature. The statistical distribution of the experimental dataset is shown in Fig. 2a–c. The median value of PCE is 9.43%, and the average value is 9.01%. As shown in Fig. 2b, the values of the bandgap of n-type acceptors (Eg,N, LUMO(A)-HOMO(A)) are between 1 and 3 eV. The SAscore distribution of the high-performance molecules is shown in Fig. 2c. It can be found that the SAscore of all molecules is less than 8.Fig. 2: Analysis of database and overview of DeepAcceptor interface.a The PCE distribution of the experimental dataset. The dataset was randomly split into 7:2:1 to train, test, and validate the model. The training, validation, and test sets were made to cover all different distribution intervals by using a stratified sampling algorithm. b The Eg,N of NFAs in experimental dataset. The values of Eg,N were between 1 and 3 eV. c The SAscore of high-performance NFAs in the experimental dataset. The SAscore value is less than 8. d The editable NFA database in DeepAcceptor online. e The molecular designer and PCE predictors in the interface of DeepAcceptor. The PCE of designed molecules predicted by abcBERT and RF can be displayed in real time.The graph representation learning of GNNs is integrated into the powerful BERT model to predict PCE (Fig. 3). The unlabeled computational data is used to pre-train the abcBERT model (Fig. 1b), adopting molecules with masked atoms and bond information as the input. The model is pre-trained by predicting the masked atoms. Then, the curated labeled data is used to fine-tune the model with a prediction head (Fig. 1c). The abcBERT contains an embedding layer, Transformer encoder layers, and task-related output layers (Fig. 3). In the embedding layer, the atom tokens are obtained. The atom tokens are encoded and embedded as the input of Transformer encoder layers. In Transformer encoder layers, the information of tokens is exchanged with other tokens through the attention mechanism. When performing attention calculations, bond lengths and connection information are fully considered. As shown in Fig. 2d, e, a user-friendly interface of DeepAcceptor was built to make the model easier to use. The usage of the interface is described in Supplementary Note 1. The PCEs of new molecules-based devices can be predicted and displayed in real time to assist acceptor design. The interface of DeepAcceptor is available at https://huggingface.co/spaces/jinysun/DeepAcceptor.Fig. 3: The detailed architecture of abcBERT in DeepAcceptor.The model integrates the graph representation learning of GNNs into the origin BERT model. The atoms, bond length, and connection information were used for encoding and embedding molecules.To demonstrate the performance of abcBERT, a molecular generation and screening process was established to discover high-performance molecules. As shown in Fig. 1d, the breaking of the retrosynthetically interesting chemical substructures (BRICS)35 algorithm and variational autoencoder (VAE)36 are used to generate molecules. The Gen database was generated by combining molecules generated by BRICS and VAE. Then, it was screened with basic properties such as molecular weight, LogP, the number of H-bond acceptors and donors, and the number of rotatable bonds. They are associated with the solubility, synthesis difficulty, and performance of the OSC materials9,21. After that, a graph neural network (molecularGNN)37 was trained to predict the highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO). The HOMO and LUMO of all the molecules were predicted. The molecules were screened to match the donor PM6 according to their predicted HOMO and LUMO. The HOMO offset (ΔHOMO, HOMO(D)-HOMO(A)) and LUMO offset (ΔLUMO, LUMO(D)-LUMO(A)) between the donor and NFA affect the hole-electron separation at the D/A interface9,38. Besides, the Eg,N is also essential for the devices to reach excellent performance22,39. The ΔHOMO, ΔLUMO, and Eg,N were used for screening the candidates. SAscore40 was used to evaluate the synthetic accessibility of these candidates. Properties related to molecular polarity and charge distribution were calculated by RDKit. Polarization properties of organic molecules may affect exciton binding energy9,21. These properties were used to screen the molecules further. The fine-tuned abcBERT model was used to predict the PCE of screened molecules. The candidates with high-performance properties and potential structures were further selected according to experimental synthesis experience. Finally, the selected candidates were validated by experiments.Hyperparameter optimization and model architecture selectionTo choose a better architecture for predicting PCE values, 6 different abcBERT architectures were built and tested. The specific hyperparameters and results are shown in Supplementary Table 1. As observed, Model 3 achieves the best performance on the validation set. The performance of Models 1 and 2 was limited by the number of layers and parameters. Models 4, 5, and 6 performed worse than Model 3. This phenomenon may be caused by the fact that the larger model has an overfitting risk with too many trainable parameters. Hence, the architecture of Model 3 was finally adopted to screen NFAs. Specifically, the number of heads in the multi-head attention was 8, and the number of layers was 8. The embedding size was set to 256. The pre-training task and fine-tuning task share the same embedding and Transformer encoder layers, and the output layers are different according to different downstream tasks. The number of output layers of the pre-training model was 3. The number of output layers of the fine-tuning model is 4. The dimensions of the output hidden layers are both 256.During hyperparameter optimization, the model was continuously trained. The loss and learning curves of the abcBERT at the pre-training and fine-tuning stage are shown in Supplementary Fig. 1. The sparse categorical cross-entropy and MSE loss function were used to pre-train and fine-tune the model, respectively. The early stopping strategy was utilized to train the model. The hyperparameters (such as dropout rate, batch size, and learning rate) were optimized until the best performance was reached on the validation set. The learning rates at the pre-training and fine-tuning stages were both set to 0.0001. The dropout rates were set to 0.1 at the pre-training and fine-tuning stages. The Adam41 optimizer was used to update the model parameters based on the computed gradients during the training procedure.Evaluation and comparisonThe performance of abcBERT was compared with the state-of-the-art (SOTA) models. They include random forest (RF)42, dilated convolutional neural network (dilated CNN)17, MolCLR43, molecularGNN, MG-BERT30, ATMOL44, graph attention network (GAT)45 and graph convolutional network (GCN)43. They were trained, validated, and tested on the same dataset and were assessed by mean absolute error (MAE), mean squared error (MSE), coefficient of determination (R2), and Pearson correlation coefficient (r). The detailed results of all the models on the test set are shown in Fig. 4. It can be observed that the abcBERT method outperforms other SOTA methods with MAE = 1.78, MSE = 5.53, r = 0.82, and R2 = 0.67 on the test set. The dimension of pooled features was reduced by uniform manifold approximation and projection (UMAP)46. The UMAP low dimensional embedding of the training set and the test set are visualized in Supplementary Fig. 2a. The distribution of absolutes error on the test set is shown in Supplementary Fig. 2b. As can be observed, all the molecules in the test set are within the distribution of the training set and can reach an accurate prediction with an absolute error of less than 2%. These prediction errors are very close to the errors caused by factors such as experimental conditions2. It can be deduced that abcBERT is a reliable and promising tool for predicting PCE. The encouraging results demonstrate the potential of deep learning and molecular graphs for the prediction of PCE.Fig. 4: Prediction results of different models.The correlation between the experimental and predicted PCE from abcBERT (a), RF model (b), dilated CNN model (c), MolCLR (d), molecularGNN (e), MG-BERT (f), ATMOL (g), GAT (h), and GCN(i) on the test set.Ablation studyTo test the utility of the pre-training, hydrogen, bond length, and connection information of the molecules, an ablation study was applied to validate their contributions to model performance on the test set. The performance was evaluated by MAE, MSE, and R2. As shown in Fig. 5, the use of pre-training reduces MAE by 0.26, MSE by 1.14, and improves R2 by 0.06. The results demonstrate the effectiveness of pre-training for PCE prediction when limited data is available. Besides, the MAE, MSE, and R2 results indicate that the utility of adding hydrogens, bond length, and connection information affects the model performance. It shows the importance of incorporating more molecular information. The detailed results of the ablation study are shown in Supplementary Table 2.Fig. 5: Results of ablation study on the test set.The MAE (a), MSE (b), and R2 (c) of ablation study results; d The utility of hydrogen atoms in converting molecules to graphs. Benzene and Cyclohexane can be converted into the same graph without adding hydrogens and bond information. If hydrogen atoms are added, they will be converted into two different graphs.Improving the accuracy of predictions is difficult due to the lack of sufficient data. Moreover, the acquisition of material data requires high experimental costs. Employing more effective machine learning strategies to extract information with limited data is crucial to improving model prediction accuracy. As shown in the results above, the pre-training could enhance the accuracy of the model when the number of labeled molecules is quite limited. The MLM task is beneficial for the model in learning the chemical knowledge of molecular formation. The pre-trained model can transfer the learned chemical rules to downstream tasks by providing a nontrivial neural network initialization. Masked atom tasks may ignore the information of bond types. Molecules can become indistinguishable without considering the type of chemical bonds. As shown in Fig. 5d, Benzene and Cyclohexane can be converted into the same graph without hydrogen atom and bond information. If hydrogen atoms are added, they will be converted into two different graphs30. The bond length encoding contributes significantly to the prediction of PCE. It may be that material properties (especially electronic properties) are highly sensitive to structural features such as bond length/angles and local geometric distortions. Bond lengths reflect the strength of the interaction forces between atoms. To better encode edge features into attention layers, the bond length encoding is used in abcBERT. Bond length encoding reflects the relative distance of any two connected atoms. The attention mechanism needs to consider the distances for each connected atom pair. As a result, bond length information improved the performance of the model. The connection information can help the model learn local spatial information of molecular graphs and exchange information through chemical bonds30,47. These results indicated that molecular representations with more chemical information improved the performance of models. Altogether, adding more chemical information to the input is beneficial for models to learn better representations for the downstream tasks.Discovery of high-performance NFAsAs shown in Fig. 1d, a large-scale screening process for NFA materials was built to demonstrate the performance of DeepAcceptor. First of all, a large-scale database was constructed by BRICS and VAE. The fragments were generated from experimental molecules. Specifically, BRICS was used to decompose molecules into constituent fragments, and these substructures were classified into terminals (T) (one group to be attached), cores (C) (two groups to be attached and 300 < MolWt < 000), and spacers (S) (two groups to be attached and MolWt < 300). Then, new molecules were generated by BRICS, recomposing the fragments in the order of T-C-T and T-S-C-S-T. All SMILES in the BRICS dataset were converted to SELFIES, which is a 100% robust and efficient molecular representation for molecular generation48. The SELFIES were used as the input of VAE. The VAE was used to generate more diverse structures and discover efficient fragments. The architecture of VAE is described in Supplementary Note 2. To demonstrate the effectiveness of VAE, the validity (the fraction of generated molecules that are valid), uniqueness (the fraction of validly generated molecules that are unique), and novelty (the fraction of valid unique generated molecules that are not in the training set) were used to evaluate the model.As shown in Fig. 6a, the validity, uniqueness, and novelty of the VAE-generated SMILES are 100%, 87.1%, and 100%, respectively. The generated database (Gen database) includes 4.8 million molecules generated by BRICS and VAE. The Gen database was first screened with some basic properties such as molecular weight, LogP, the number of H-bond acceptors and donors, the number of rotatable bonds, the number of rings for a molecule, and the number of nitrogens and oxygens. These properties were calculated by RDKit. The thresholds were set according to the properties of the high-performance acceptors (PCE > 10%) in the experimental dataset. The detailed descriptors and thresholds are shown in Supplementary Table 3. In this way, 3.6 million molecules were obtained.Fig. 6: Results of the molecular generation and screening.a The metrics of VAE-generated molecules. b Energy diagram of PM6:NFA candidates. c The correlation between the computational and predicted HOMO. d The correlation between the computational and predicted LUMO. e The distribution of results screened by SAscore and f predicted by abcBERT.Here, PM6 was chosen as the donor, and the candidates were screened to match the HOMO and LUMO of PM6 (Fig. 6b). According to ref. 49, the HOMO and LUMO of PM6 are −5.45 and −3.65 eV. To predict the HOMO and LUMO of NFA candidates, the molecularGNN was trained on an NFA dataset (51,000 NFAs) with HOMO and LUMO computed by DFT. The architecture of the molecularGNN is described in Supplementary Note 3. This dataset was split randomly with a ratio of 8:1:1. The predicted results on the test set are shown in Fig. 6c, d. As observed, the MAE and R2 of the predicted HOMO are 0.057 and 0.970. The MAE and R2 of the predicted LUMO are 0.064 and 0.967, respectively. The predictors are compared with the Tartarus benchmarking platform50, which provides the HOMO and LUMO calculated by GFN2-xTB and the power conversion efficiency (PCE) computed based on the Scharber model. As shown in Supplementary Table 4 and Supplementary Fig. 3, the HOMO and LUMO predictors perform better than Tartarus on the test set with much faster speed. The trained GNN model was used to predict the HOMO and LUMO of the candidates. The candidates were further screened to make the ΔHOMO > 0, ΔLUMO > 0, and 1 < Eg,N < 3. In this way, the number of molecules was reduced to 104,295.After that, SAscore was used to evaluate the synthesizability of molecules. As mentioned above, SAscore <8 was chosen as the screening criteria. As shown in Fig. 6e, 47,653 molecules were obtained. Next, the properties related to molecular polarities and charge distribution, such as TPSA, were calculated. The thresholds were set according to the statistical results of the properties of high-performance acceptors (PCE > 10%) in the experimental dataset. Detailed descriptors and thresholds are shown in Supplementary Table 3. The number of molecules was reduced to 23,029. Then, abcBERT was used to predict their PCE values. As shown in Fig. 6f, 74 candidates with PCE > 14% were obtained. Their structures and PCE-prediction results are shown in Supplementary Table 5.With the above screening procedures, the large-scale database was reduced to a limited size. Next, according to our experimental synthesis experience, manual selection was conducted by considering the difficulty of actual synthesis, conjugation, and solubility. The details are shown in Supplementary Note 4. Among the materials for high-performance devices, Y6-series NFAs with A-DA′D-A type structures and alkyl chains at the terminal of the DA′D central cores have shown extremely promising performance since they exhibit strong intramolecular charge transfer effect and superior energy level tunability2,51,52,53. Finally, three candidates with predicted PCE > 15% were selected by considering the above factors.The structures of these candidates are shown in Fig. 7. The predicted properties of these candidates are shown in Supplementary Table 6. Then, these molecules were synthesized, and characterized in our laboratory to further demonstrate the validity of the model. Detailed information on materials, synthetic procedures, device fabrication, and characterization are shown in Supplementary Note 5. Photophysical and electrochemical parameters are shown in Table 1. These acceptor candidates formed active layer materials of organic photovoltaic devices with PM6, respectively. The photovoltaic parameters of the devices are shown in Table 2. The PCEs predicted by the abcBERT and Sharber model are compared with those obtained from experiments. The experimental and predicted PCEs are listed in Table 2. As can be observed, the abcBERT reached more accurate prediction results. The experimental PCE of all the candidate-based devices reached over 12%. The average MAE between the experimental and predicted PCE values is 1.96%. Especially, the PCE of candidate 1-based device reached 14.61%. It reflects the reliability and effectiveness of DeepAcceptor. The results show that DeepAcceptor is a promising framework for screening potential &NFA candidates.Fig. 7The structures of selected candidates.Table 1 Photophysical and electrochemical parameters of the candidatesTable 2 The predicted and experimental results of device data for PM6:Candidates devices

Hot Topics

Related Articles