A dataset of alternately located segments in protein crystal structures

Raw data collectionProtein structures are collected from the Protein Data Bank (PDB) through a structured query against the polymer entity data API13,14. We queried for all entities in structures meeting the following criteria: (i) Method: X-Ray Diffraction; (ii) X-Ray Resolution ≤ 3.5 Å; (iii) Rfree ≤ 0.33, (iv) Number of chains ≤ 20.Each query result contains a list of PDB IDs with an entity number (e.g. 1ABC:1) matching the query criteria, and each entity within a PDB structure corresponds to one or more identical polypeptide chains which exist in the structure. Note that a structure may have more than one unique entity (e.g., 1ABC:1 and 1ABC:2) in which case we would obtain both. For each unique entity ID, we then obtain its associated chains (e.g., 1ABC:A and 1ABC:C), and include all of them in the dataset. In cases where the chain ID in the PDB files (author chains) do not match the canonical chain IDs assigned by the PDB, we map between the author and PDB chains, such that our dataset will contain only the canonical IDs.Altloc collectionWe use BioPython15 to parse the PDB structure files and extract the residues and atom locations from the collected chains. For each atom in the structure, we parse all available alternate locations (altlocs) from the file. The altlocs are usually labelled with capital letters starting from ‘A’. In cases where a structure has a non-standard altloc labelling, we sort the labels lexicographically and relabel them starting from ‘A’. In such cases, the dataset will denote these altlocs as e.g. ‘A(Z)’ in the altloc name columns, meaning that the altloc with original label Z is denoted by A in the dataset’s other columns. This relabeling helps keep the column names consistent across different structures.Aligning to uniprot sequencesWe align the amino acid sequence of each chain to the Uniprot16 record sequence to provide a Uniprot index for each residue in the chain. We query the PDB’s entry data API14 and examine the metadata to construct a mapping from the specific chain to a list of Uniprot IDs. Whilst most chains map to a single Uniprot ID, there are cases of synthetic proteins which have no associated Uniprot ID, and other cases where a chain is chimeric i.e. contains sections from multiple different proteins. We discard such cases and keep only chains which map to a unique Uniprot ID. The alignment is performed using BioPython’s default pairwise alignment algorithm. We used BLOSUM80 as the substitution matrix for the alignment, a gap-opening penalty of \(-\)10 and a gap-extension penalty of −0.5.Backbone locations and dihedral angles per altlocFor each PDB chain, we calculate the backbone angles \(\left(\varphi ,\psi \right)\), per altloc. To calculate a dihedral angle at altloc X, we take the X-altloc coordinate of all atoms participating in the calculation (from the current and previous/next residue). In case the atoms required for dihedral angle calculation from either the previous or next residue do not have the current altloc, we use the single set of coordinates modelled at that location in the calculation of the dihedral angles of all the current altlocs. The dataset also always includes the dihedral angles calculated with all atoms at their default positions, i.e. ignoring altlocs. For each of the backbone atoms in each residue, we also collect its XYZ coordinates under each of the altlocs which exist for it. Finally, we use DSSP to assign a secondary structure per residue.B-factors, location standard deviations and distances between altlocsWe calculate the b-factor per residue, by averaging the b-factors of the N, CA and C backbone atoms. This is performed using the default atom positions, and additionally for each altloc which is defined for all three atoms.For the CA atom, we also calculate, per altloc, the standard deviation in its location and distance from other altlocs, in Angstroms. The standard deviation is obtained from the b-factor \({B}_{X}\) of altloc X by \({\sigma }_{X}=\sqrt{{B}_{X}/8{\pi }^{2}}\). For each pair of altlocs X and Y, we then calculate the distance in Angstroms between the alpha-carbons, \({d}_{X,Y}=\Vert {{\boldsymbol{p}}}_{CA,X}-{{\boldsymbol{p}}}_{CA,Y}\Vert \), where \({{\boldsymbol{p}}}_{CA,X}\) is the location of the alpha-carbon under altloc X. We also calculate this distance in units of the standard deviation, which is given by \({\tilde{d}}_{X,Y}=\Vert {{\boldsymbol{p}}}_{CA,X}-{{\boldsymbol{p}}}_{CA,Y}\Vert /\sqrt{{\sigma }_{X}{\sigma }_{Y}}\).Finally, we calculate the peptide bond length between adjacent residues under each pair of altlocs of the current residue’s carbon and the next residue’s nitrogen.ContactsWe calculate the contacts between all atoms of a residue, under all its altlocs, and all other atoms in the PDB structure, also under all possible altlocs. The per-atom contacts are then aggregated to the residue level for inclusion in the dataset.First, we collect the set of locations of all atoms in the structure, under all altlocs. Next, we iterate over each residue in the chain, each atom within it, and each altloc defined for that atom. Given the location of this altloc atom as a source, we calculate the distance to each target atom in the set of all locations. Two atoms are defined as in contact when their distance is below a threshold of 5 Angstroms. Each detected contact is then classified into one of three types: regular AA contact, out-of-chain (OOC) contact or ligand contact, depending on the identity and chain of the target atom. Contacts from all atoms of the current residue are collected into one of three lists of contacts for that residue, based on this classification. The minimum distance is calculated across atoms belonging to the same source and target residue. Hydrogen atoms and water molecules are always excluded even if they are modeled in the structure.Codon assignmentSince the exact genetic sequence of the protein is not annotated in the PDB we assigned codons from the native sequence following the procedure described in our prior work17. Given a PDB chain, we obtain its unique Uniprot ID from the previous step. We query Uniprot to obtain all cross-referenced IDs to the European Nucleotide Archive (ENA). From the ENA database, we obtain all available genetic sequences for the specific protein, translate each genetic sequence to an amino-acid sequence using the standard genetic code table, and perform pairwise sequence alignment between the PDB chain’s amino-acid sequence and the translated genetic sequences. The alignment is performed using BioPython using the same options as in the previous section.Following the pairwise alignment of the amino acid sequence to all translated genetic sequences, we obtain the aligned codons from each sequence and assign them to corresponding residues from the PDB chain. This process yields zero or more assigned codons per residue in the PDB chain. In cases where there is more than one codon (i.e., different genetic sequences contributed different codons), we choose the most common, and reflect this ambiguity by assigning a codon score which is the proportion of genetic sequences that contributed the assigned codon.Removal of low-quality structuresWe used the R-factor to remove structures with a potentially poor fit to the electron density. The intersection of two criteria was used to define a structure as admissible: (1) Rwork ≤ 0.98 Rfree; and (2) Rfree ≤ min{0.3, max{0.2, resolution-dependent cut-off}}. The resolution-dependent cut-off was fitted as a monotone polynomial to the 90%-tile of Rfree estimated in 12 equiprobable resolution bins ranging from 0.5 to 3.5Å (Fig. 1).Fig. 1Example of broken altloc chain shown on the crystal structure 1VYO (Avidin at 1.48Å resolution). Altloc B (red) is not modelled between residues 37 and 41 and is therefore counted as two separate segments in our data set.Non-redundant cluster assignmentTo account for redundancy of the collected chains and proteins they originate from, we clustered the chains into non-redundant clusters using the amino acid sequence data. Clustering was performed using mmseq218 with minimum sequence identity threshold 0.5 and target coverage 0.8. Cluster identities were recorded in the metadata alongside with the chain identities.Segmentation of contiguous altlocsFor each of the collected chains containing altlocs, we grouped all altlocs with contiguous residue numbers into numerically numbered segments. Assignment as a segment requires the altlocs at every location within the segment. This means that in cases of altlocs broken by missing residues, this section will be counted as two segments (Fig. 1)

Hot Topics

Related Articles