PatCID: an open-access dataset of chemical structures in patent documents

PatCID is a chemical-structure dataset automatically created from images in patent documents. Figure 1 illustrates its principal usage for document and molecule retrieval. A molecule can be searched with as-drawn, similarity, or substructure search, and a list of patents referencing the molecule is retrieved. On the other hand, molecules selected from a specific document can be extracted and then leveraged to browse and explore the document. PatCID allows persons in the intellectual property domain to carry out prior-art search or landscape analysis1, and persons in the organic chemistry domain to review patent literature in various fields such as drug discovery, pharmaceutical chemistry, or material science15.Fig. 1: PatCID usage for document and molecule retrieval.PatCID can A retrieve patent documents referring to a query molecule, via similarity, substructure, or as-drawn search. It can also B retrieve molecules contained in a document and be used to interactively explore a document collection.To perform a comprehensive evaluation of PatCID, we compare PatCID with state-of-the-art patent databases; the high-level statistics in terms of molecules and documents coverage, the molecular-structure search performances, and the ability to extract molecules from different sections of documents are evaluated. Additionally, we evaluate each component of the processing pipeline used to build PatCID.Data statisticsPatCID covers documents from five major patent offices, from the United States (USPTO16), Europe (EPO17), Japan (JPO18), Korea (KIPO19), and China (CNIPA20). The selected documents are associated with the field of organic chemistry by mentioning the term ‘alkyl’. For an exemplary time window of the years 2010–2019, these five patent offices cover 1.06M patent families in the field of organic chemistry, while all 107 patent offices worldwide21 cover 1.16M, i.e., the offices covered in PatCID represent 90% of published patent documents in the field of organic chemistry. (Here, a patent family refers to the set of patent documents that disclose the same invention, eventually published in different countries.) In total, PatCID indexes 80.7M molecule images, resulting in 13.8M unique chemical structures. This extensive coverage allows the use of PatCID for applications related to various domains of organic chemistry. Additional details related to the collection selection and statistics are provided in Supplementary Note 1.Table 1 compares key characteristics of state-of-the-art chemical patent databases. It shows the number of patent documents, molecules, and unique molecules covered by patent databases, as well as which offices are covered and since when. PatCID contains documents that are not manually annotated in Reaxys: the documents published between 1978 and 2001 by the offices in the U.S. and Europe, between 2004 and 2015 in Japan, and between 1998 and 2015 in Korea. PatCID contains 80.7M molecules which is substantially more than Google Patents (39.8M) and SureChEMBL (48.8M). PatCID also contains 13.8M unique molecules, which is more than Google Patents (13.2M) and SureChEMBL (11.6M). Here, molecules (respectively unique molecules) are counted as the number of non-distinct (respectively distinct) canonical Simplified Molecular-Input Line-Entry System (SMILES)22 indexed. Additionally, covering Asian Pacific offices is a great advantage over SureChEMBL, as about 70% of patent documents from Asian Pacific offices are not extended to the United States (see Supplementary Note 1). Further information on obtaining the database characteristics can be found in the Method section. For PatCID, detailed statistics by office are also available in Table 2.Table 1 Patent databases statisticsTable 2 PatCID detailed characteristicsDocument ingestion pipelinePatCID leverages state-of-the-art document understanding models to ingest documents. As illustrated in Fig. 2, the ingestion pipeline uses three components: the document segmentation (DECIMER-Segmentation23), the image classification (MolClassifier), and the chemical structure recognition (MolGrapher24). The document segmentation module locates the position of chemical images in documents. Chemical images comprise molecular-structure images and Markush-structure25 images. (Markush structures are sets of molecules defined using positional and frequency variation indications.) To distinguish molecular-structure images and Markush-structure images, we use an image classification module with three output classes: ‘Molecular Structure’, ‘Markush Structure’, and ‘Background’. This further allows to filter some outliers from the segmentation step, as segmentation errors are included in the ‘Background’ class. Finally, molecular-structure images are converted to molecular graphs using MolGrapher, without stereo-chemistry, and stored as SMILES.Fig. 2: PatCID ingestion pipeline.The creation of PatCID relies on three key steps: (1) the document page segmentation to extract chemical images, (2) the image classification to identify molecular-structure images, and (3) the molecular recognition to obtain final chemical structures. Blue marks molecular structure images. Red marks Markush structure images.As PatCID is one of the first document-to-molecular-structures pipelines, there is no benchmark for simultaneously evaluating the document segmentation, image classification, and molecule recognition steps. There is even no benchmarks for independently evaluating the document segmentation (with annotated bounding boxes) and the image classification. For this reason, we introduce two benchmark datasets: D2C-RND (Document to Chemical Structures, Random) and D2C-UNI (Document to Chemical Structures, Uniform). Each of these datasets contains three subsets: a first set for evaluating the document segmentation, a second set for image classification, and a third set for the molecule recognition module. Molecules sampled from the recognition subset are taken from images in the classification dataset, which are taken from the pages in the segmentation dataset. This strategy allows us to precisely assess the impact of each module on the overall data quality of the database. D2C-RND is sampled using a random distribution on chemical images, resulting in a higher abundance of recent patents and patents from the U.S. office. This test set can evaluate the average quality of databases. On the other hand, D2C-UNI covers a uniform distribution with respect to the year of publication and publishing office in order to assess databases in challenging scenarios. Specifically, molecule images from older patents and from non-U.S. offices can be of lower quality and use a less standard display style. An example illustrating the diversity of display styles for the same patented molecule in different countries is shown in Supplementary Fig. 6. As the first benchmarks for end-to-end document-to-chemical structures conversion, these benchmarks will benefit future research in this area14. In total, they contain 700 manually-annotated pages, 753 manually-annotated chemical images, and 364 precisely annotated molecular graphs (MOL files26). More details can be found in the “Methods” section below.Table 3 presents the performances of these three key ingestion steps. It shows the precision and recall of the page segmentation, the image classification, and the chemical-structure recognition, and for DECIMER-Segmentation and MolGrapher, a comparison with state-of-the-art models. For the recognition module, the precision is computed using InChIKey27 equality, ignoring stereo-chemistry. The evaluation is performed for the random benchmark D2C-RND and the uniform benchmark D2C-UNI. Further details are available in the “Methods” section.Table 3 Pipeline comparisonThe segmentation and classification modules achieve high precision and recall of more than 80% on both datasets. The segmentation module outperforms YoDe-Segementation28 in terms of recall and precision by more than 40% on both benchmarks. The recognition module correctly recognize 63.0% of randomly selected molecule images in PatCID. This is substantially higher than OSRA (45.6%), currently used in automatically-created databases pipelines. On this dataset DECIMER achieves 67.2% and MolScribe achieves 75.9%. MolScribe was not available at the time PatCID was created. It can also be noted that some images from our benchmarks are part of MolScribe’s training data. MolGrapher was preferred over DECIMER for its performance on standard benchmarks (see ref. 24) and its runtime performance advantage, allowing it to be run using CPU only (see Supplementary Table 1). More details on the computational considerations can be found in the Method section below. For all components, models perform better on the random set D2C-RND than on the uniform set D2C-UNI, confirming that documents published recently and in the United States are easier to automatically process. The PatCID ingestion pipeline includes basic filtering steps, such as verifying that the predicted molecular structures contain only one fragment. Based on MolGrapher filtered precision, the precision of the complete PatCID processing pipeline is 54.5% on D2C-RND and 41.3% on D2C-UNI. The recall of the complete pipeline is 46.0% on D2C-RND and 44.5% on D2C-UNI. Qualitative examples of the ingestion pipeline predictions are shown in Supplementary Figs. 1 and  2.Search evaluationIn this section, we compare the molecule and document retrieval performance of PatCID with state-of-the-art databases.Each benchmark dataset contains pairs of molecules and patent documents, from which the molecules have been extracted. By searching for documents in various databases, we compute the document retrieval performance, defined as the percentage of documents retrieved with chemical annotation attached, and we compute the molecule retrieval performance, defined as the percentage of molecules retrieved from the correct reference documents. A query molecule is retrieved if the annotation and ground-truth have identical InChIKeys, ignoring stereo-chemistry. The complete querying process for each database is explained in the Methods section. A comparison of automatically-curated databases will be presented, and a comparison of manually-created databases will follow.Table 4 compares the recall of molecules and annotated documents of state-of-the-art automatically-created databases on benchmarks D2C-RND and D2C-UNI. For the random set D2C-RND, PatCID achieves a molecule recall of 56.0%, which is higher than Google Patents with visual annotations (36.5%) and higher than Google Patents and Reaxys with visual plus textual annotations (41.5%). For the challenging set D2C-UNI, PatCID achieves a molecule recall of 47.6% and substantially outperforms SureChEMBL with visual annotations (4.9%) and Google Patents with visual annotations (9.8%). It also surpasses Reaxys with textual and visual annotations by more than 10%. PatCID data quality outperforms all automatic databases by a substantial margin. For D2C-RND, the annotated document recall is 100%, compared to 68.2% in Google Patents, and for D2C-UNI, 98.2%, compared to 67.0% in Google Patents. PatCID has substantially better document coverage. It can be noted that the low document coverage of SureChEMBL is due to the missing coverage of Asian Pacific patent offices. While the PatCID ingestion pipeline only covers visual representation of molecules, its quality and robustness still enable it to surpass state-of-the-art automatically-created databases. SureChEMBL and Google Patents also rely on textual data, and molecular structures information (MOL files) directly provided by the USPTO. Supplementary Table 2 reports the overlap between textual and visual annotations for molecules in Google Patents and SureChEMBL.Table 4 Search comparison for automatically-created databasesTable 5 compares the recall of molecules and annotated documents of state-of-the-art manually- and automatically-created patent databases. PatCID molecule recall outperforms manual annotations of SciFinder for both D2C-RND (56.0% against 49.5%) and D2C-UNI (47.6% against 47.0%). Also, the PatCID annotated document recall is higher than Reaxys with manual and automatic annotations for D2C-RND (100% against 68.8%) and D2C-UNI (98.2% against 67.0%). This advantage of document coverage allows PatCID to compete with Reaxys. Indeed, for D2C-RND, PatCID achieves better molecule retrieval performance than Reaxys, even though Reaxys combines manual and automatic annotations retrieved from images as well as text. Additionally, Reaxys and SciFinder benefit from exploiting the patent families grouping. For example, when searching for a molecule in a Korean patent, SciFinder and Reaxys are allowed to retrieve the query molecule from any patent in its family, for instance a patent from the U.S. patent office. This is an advantage because the Korean patent depiction style is typically more challenging to automatically process than U.S. patent documents, in which the style is more standardized (see Supplementary Fig. 6). For D2C-UNI, Reaxys has a molecule recall of 51.2%, which is better than the 47.6% molecule recall in PatCID.Table 5 Search comparison for manually- and automatically-created databasesFigure 3 illustrates the proportions of molecules covered in the PatCID and Reaxys databases for the random (D2C-RND) and uniform (D2C-UNI) benchmarks, and their subsets restricted to documents annotated in Reaxys. For the D2C-UNI benchmark, Reaxys, with its manual and automatic annotations, covers 51.2% of molecules, while PatCID covers 47.2%, but together they cover a total of 67.1%. Even though Reaxys performs better on average, some of the molecules correctly found in PatCID are not found in Reaxys. Even restricting the evaluation to documents annotated in Reaxys, PatCID covers 8.7% of molecules from D2C-RND and 5.5% of molecules from D2C-UNI, which are not covered in Reaxys. To complement this analysis, a comparison of the number of molecules annotated per patent in PatCID and Reaxys for the random (D2C-RND) benchmark is shown in Supplementary Fig. 5. PatCID bridges the gap between automatically- and manually-created databases, and stands out as a complementary tool to manually-curated databases.Fig. 3: Search comparison between PatCID and Reaxys.Proportions of molecules covered in the PatCID and Reaxys databases for the random (D2C-RND) and uniform (D2C-UNI) benchmarks, and their subsets restricted to documents annotated in Reaxys. For the top bar, 33.5% of molecules in D2C-RND are covered with both Reaxys (manual annotations) and PatCID, 22.5% in PatCID only, 8.0% in Reaxys (manual annotations) only, and 36.0% in none of the databases.Document coverage evaluationPatent documents in the field of organic chemistry are typically written following two different styles. In the first case, a patent begins by enumerating a large number of molecular structures, and thereafter, for selected key molecules a detailed description and synthetic routes are presented. In the second case, a patent is structured such that from the start, a limited number of molecules is described in detail. Molecules in the description (before examples) section refer to molecules that are displayed and for which no synthesis or properties are provided. Molecules in the description (examples) section refer to molecules that are displayed and for which a synthesis or properties are provided.This section presents an evaluation of the coverage of different document sections using two documents that are typical examples of different writing styles. US20220127225 has overall a very large number of molecules and for only a few molecules, the synthesis is described in the examples. US9096558 has overall a few molecules, and for all molecules the synthesis is described in the examples. In these two documents, the positions of all chemical structures were manually annotated. In each section, 50 images (if available) were randomly selected and their molecular structures were precisely annotated. In total, this test set was created by manually annotating the position of 1822 molecule images in 235 pages, as well as 141 molecular graphs (MOL files).Table 6 shows the percentage of correctly retrieved molecules from different patent sections in different chemical-structures databases. PatCID’s fully automated process allows it to cover entire documents, including the abstract, the drawings, the description (including examples), and the claims sections. On the other hand, due to the limited workforce, manually-curated databases made the choice to be restricted to the molecule in the examples and to the molecules in the claims section. Doing so, some key patented compounds can be missed. For instance, as illustrated in Table 6, the patent US20220127225 contains mainly molecules before the examples subsection, with virtually none found in Reaxys or SciFinder, whereas PatCID retrieves 78% of them. An example of a page containing only molecules missed in SciFinder and Reaxys, and almost all found in PatCID, is shown in Supplementary Fig. 3. These molecules illustrated before the examples section can be all the more valuable as some of them are not found in any entries of the entire Reaxys and SciFinder databases. A qualitative example of such molecules is shown in Supplementary Fig. 4. For these reasons, PatCID has a clear advantage over SciFinder and Reaxys with respect to the coverage of sections within documents.Table 6 Document coverage comparisonInteractive document explorationFigure 4 illustrates an example of document exploration with PatCID. Contrary to SureChEMBL, Google Patents and Reaxys, given a query molecule, PatCID not only finds the documents referencing this molecule but also keeps provenance to its explicit location within documents. For patent documents that can span hundreds of pages and contain thousands of similar molecules, this feature is very useful. It allows to interactively explore documents, easily referring to neighbouring content of the query molecule. It may show related molecular structures or, as depicted in Fig. 4, the synthesis of the molecule.Fig. 4: PatCID interactive document exploration.Given a (1) query molecule, PatCID (2) retrieves, as other databases, the reference documents containing the molecule, but additionally (3) retrieves the explicit location of the molecule within documents.Providing a dataset of annotated chemical structures, embedded in documents, PatCID can also serve as a foundation for building multi-modal document understanding methods29.

Hot Topics

Related Articles