Molecules of the year 2023 – part 2. A FAIR data comment on a Strontium Metallocene.

I will approach this example of a molecule-of-the-year candidate – in fact the eventual winner in the reader poll – from the point of view of data. Its a metallocene arranged in the form of a ring comprising 18 sub-units.[cite]10.1038/s41586-023-06192-4[/cite] Big enough to deserve a 3D model rather than the static images you almost invariably get in journals (and C&EN). So how does one go to the journal and acquire the coordinates for such a model?

Well, nowadays most reputable journals include a “data availability” statement, which in this case is indicated using a URL-style identifier for supporting information. This means by the way that this identifier may not be persistent, since the path to the document in the string https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-023-06192-4/MediaObjects/41586_2023_6192_MOESM1_ESM.pdf may change in the future according to the publishers production workflows. The Acrobat file contains the required coordinates, of which I give a small sample here:
18‐ring
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
1386
Energy = ‐29312.63737385 dispersion contribution = ‐2.415738946
C 5.1700172 1.6243489 ‐11.0779621
C 5.6857216 1.5855492 ‐12.4187559
C 6.0496599 0.6048512 ‐13.3969079
C 6.1219344 ‐0.8254711 ‐13.5066237

I selected the molecule coordinates within the PDF, pasted into a text editor and then spent a few minutes removing the resulting extraneous blank lines due to the page breaks present in the PDF document (a paginated document format is NOT a good vehicle for data!). I then added further lines (topped and tailed it) to eg make it viewable using a molecular editor such as Gaussview, only to get the following error.

A bit of research leads to eg the following page: The difference between a dash and a minus sign. There you find four different glyphs any of which could look like a minus sign – there could in fact be more. Next, using the following resource: https://www.fontspace.com/unicode/analyzer#e=4oCQ tells us that the “-” found in the supporting information is in fact a “hyphen“. Typed from a keyboard as a “-” one learns this is a “hyphen-minus“. There is also “−” which emerges as a “Minus sign“, whilst a “–” emerges as an “EN Dash“. Confused yet? Well, it all does rather depend on whether the creator of the molecular viewing program you are about to use has included all these variations in their program code. In this case clearly not, since a hyphen is not recognised. Once you get to this stage, around 30 minutes of occasional head scratching have elapsed, and you further have figured out how to do a global find and replace of a hyphen by a minus using your preferred software.
What does all this have to do with FAIR? This means Findable, Accessible, Interoperable and Reusable. And those actions have to be possible not only by a human but by an autonomous and probably unsupervised system gathering data for machine learning or artificial intelligence. The Finding was facilitated by the “data availability” statement using the article DOI (a fully persistent identifier), but probably only a human could actually cope with the diversity of presentations for data found across multiple publishers (thus, to be technical, the access location of supporting data is rarely if ever actually declared in the metadata record associated with the DOI, which is what a machine would need to access the data). The Access in this case means resolving the URL above, but only if it does not change in the future! But the next bit, the Interoperability, is more of a challenge. Like myself, many a human might also take 30 minutes, or indeed just give up, in coping with the challenge of recognising that a hyphen is not a minus! So although we are grateful for that “data availability” statement, I dream of the day when that will in fact become a “FAIR data availability” statement!‡ Not many signs of that happening yet. I guess the AI-algorithms will in fact get smarter faster than people for coping with such issues.
Anyway, you now have a 3D model of the 18-metallocene as this year’s selected molecule of the year! Click on the image above to load it.

‡For example, the data for this post is available at a FAIR repository, with the persistent DOI identifier: https://doi.org/10.14469/hpc/13536. This contains the optimised coordinates using the PM7 method. These are very little different from the coordinates from the article, which were obtained using the PBE0/Def2-TZVP method, a remarkable calculation given it uses 21618 basis functions!

Related

This entry was posted on Friday, December 29th, 2023 at 3:59 pm and is filed under Uncategorised. You can follow any responses to this entry through the RSS 2.0 feed.

You can leave a response, or trackback from your own site.

Hot Topics

Related Articles