Automatic ploidy prediction and quality assessment of human blastocysts using time-lapse imaging

Characteristics of datasetsThe study was performed in accordance with relevant guidelines and regulations. The study was approved by the Institutional Review Board at Weill Cornell Medicine (numbers 1401014735 and 19–06020306) and by the IVI Valencia Institutional Review Board (number 1709-VLC-094-MM). IRB determined that this research meets the exemption requirements at HHS 45 CFR 46.104(d) and is secondary research for which consent is not required. A waiver of informed consent was granted from the IRB as the images were de-identified for this retrospective review of clinical data. The embryo imaging was performed as a part of the standard care procedure during the preimplantation and IVF cycle. No discarded embryos were used. In this study, information, which may include information about biospecimens, is recorded by the investigator in such a manner that the identity of the human subjects cannot readily be ascertained directly or through identifiers linked to the subjects. Moreover, the investigators do not contact the subjects, and the investigator will not re-identify the subjects. As such, informed consent was not obtained and participants did not receive compensation for the study. The research utilized multiple datasets for training and validation of the machine learning models. The first dataset, known as the WCM-Embryoscope data, was collected from the Center for Reproductive Medicine at Weill Cornell Medicine between 2018 and 2019. It comprises time-lapse images and PGT-A results for 1998 embryos, including 494 single aneuploid (SA), 588 complex aneuploid (CxA), and 916 euploid (EUP) embryos. A total of 498 patients were included in the WCM-Embryoscope data, with an average of four biopsied embryos each. We treated each sample independently, irrespective of parental origin. Accompanying the time-lapse sequences were clinical data such as embryologist-derived blastocyst score (BS), morphokinetic parameters, and maternal age at the time of oocyte retrieval. The blastocyst score is the sum of a set of scores converted from the expansion, inner cell mass (ICM), trophectoderm (TE) grades, and day of blastocyst formation14. The blastocyst score ranges from 3 to 14, with a lower number indicating a higher-quality embryo. The images were captured using the Embryoscope® imaging instrument. To validate the models’ generalizability, we used a second dataset, referred to as the WCM-Embryoscope+ data, which was also collected from the Center for Reproductive Medicine. However, these were gathered between 2019 and 2020 and included a total of 841 embryos (170 SA, 261 CxA, and 410 EUP), using a newer Embryoscope+® instrument. Similar to the first dataset, this also contained BS, morphokinetic parameters, and maternal age for each embryo. Furthermore, two external datasets were employed for further validation. The first, referred to as the Spain dataset, came from IVI Valencia and contained 543 embryos (309 ANU and 234 EUP) with time-lapse sequences, morphokinetic parameters, and maternal age. These images were also captured using the Embryoscope instrument. The second external dataset, referred to as the Florida dataset, was collected from IVF Florida and included 869 embryos (202 SA, 222 CxA, and 445 EUP) with maternal age and blastocyst score for each embryo. These images were captured using the Embryoscope+® instrument.Preimplantation genetic testingEmbryos from Weill Cornell were biopsied on day 5 or day 6, depending on when they reached the blastocyst stage. Biopsied cells were analyzed using next-generation sequencing (NGS) technology at the Ronald O. Perelman and Claudia Cohen Center for Reproductive Medicine (CRM). CRM uses VeriSeq technology from Illumina. The VeriSeq kit utilizes targeted DNA sequencing to detect chromosomal anomalies in embryo biopsies. Samples prepared with the VeriSeq PGS kit are sequences with the standard Illumina MiSeq system. Details about the VeriSeq kit and MiSeq system can be found on the Illumia platform19,20. Analyses for the Spain Dataset were done by Igenomix Spain. Embryos were subjected to assisted hatching on day 3, after cell counting, with the Hamilton-Thorne LykosVR laser. After reaching the blastocyst stage, 5–6 trophectodermal cells were biopsied and their ploidy was assessed by Thermo Fisher Scientific’s NGS technology. Embryos from IVF Florida were also analyzed by Igenomix using Thermo Fisher Scientific’s NGS technology. More details about PGT-A protocols can be found in García-Pascual et al.21.Temporal and spatial processingExtracted time-lapse image sequences were highly variable in length, frame rate, start and end points. These variabilities resulted in numerous embryos missing information from particular time periods, and a lack of proper annotation could lead to bias in model training. To mitigate these biases, the following protocol was developed to clean and standardize all time-lapse sequences, as shown below.

1.

Standardized time points are designated at 30-min intervals from 0 to 150 hpi (i.e., 0 hpi, 0.5 hpi, … 149.5 hpi, 150.0 hpi).

2.

For each embryo, time-lapse images taken closest to standardized time points are assigned to each standpoint. If there is no image close enough (within 2 h) to the standardized time point, a blank frame is assigned to the standardized time point. We chose a 2-h boundary as the ‘close enough’ range for several reasons. First, our observations indicated that significant changes in the embryos typically occurred at intervals greater than 2 h. As a result, a 2-h window provided a balance between accurately capturing significant changes while also allowing for reasonable data standardization. This timeframe was also influenced by the overall rate of data acquisition, which sometimes varied but was generally frequent enough to capture changes within this 2-h window. However, we recognize the potential for variability, and further studies may explore the impact of different time boundaries. We also note that the rest of our analysis can be replicated with a different time window and, hence, can be modified on a case-by-case basis. At this point, each standardized time-lapse sequence has 301 frames, with each frame corresponding to a standardized test point between 0 and 150 hpi.

3.

After the construction of standardized time-lapse sequences, frames can be extracted for video classification model development using three parameters: start hour, end hour, and interval. For example, a model trained on day 2 embryo development would use these parameters: start hour = 24.0 hpi, end hour = 48.0 hpi, and interval = 2 h. This results in 13 frames.

4.

For image classification tasks, a time point of focus can be ascertained, and the frame assigned to that time point can be extracted.

We standardized the lengths, start, and end points of all time-lapse videos using set time points and intervals. Adjacent frames were utilized to impute missing time points. Some sequences, rendered unusable for certain prediction tasks post-standardization, were excluded from the analysis based on exclusion criteria. These criteria encompass instances where the embryo was absent from the petri dish, the embryo was less than half-visible, or the image was too dim to discern the embryo. We resized each frame from 800 × 800 to 224 × 224. To curtail background bias during model training, we implemented a circle Hough Transform for embryo segmentation in each video frame. This processing was uniformly applied across WCM-Embryoscope, WCM-Embryoscope+, Spain, and Florida datasets. To bolster the diversity and robustness of our training data, we incorporated video augmentation techniques, including random horizontal flipping and rotations. The former yielded mirror images of original frames, effectively doubling our data and fostering diverse pattern learning. Random rotations enhanced the model’s adaptability to varied embryo orientations, thereby simulating real-world scenarios. We opted for these techniques as they accurately represent potential real-world variations, fortifying our model’s robustness.General study architectureTwo different prediction tasks were modeled between euploid (EUP), aneuploid (ANU), and complex aneuploid (CxA): EUP versus ANU and EUP versus CxA. Spatial features for each frame were extracted from the cleaned time-lapse images of the embryos using an ImageNet pre-trained VGG16 convolutional neural network (CNN). Time-lapse image frames from 96 hpi to 112 hpi (day 5) were processed according to the “Temporal and spatial processing” section. The features extracted from these frames were input to a multitask BiLSTM regression model (video regression task), which was primarily trained to predict embryologist-derived blastocyst scores. We investigated various dataset combinations for training the BELA models (Supplementary Note 4), ultimately using only WCM-Embryoscope data for the final models. To prevent data leakage, the WCM-Embryoscope dataset was split 70/30 for training/testing. This process exclusively utilized embryos that passed our exclusion criteria, reducing the dataset from 1998 to 1684 embryos. The BiLSTM regression model was trained only using the training slice of the dataset. Four-fold cross-validation was employed when training the BiLSTM regression models, setting aside data for monitoring validation loss. The predicted blastocyst scores for the training split embryos from the BiLSTM regression model, along with maternal age, were used to train a logistic regression model to predict embryo ploidy. A logistic regression model was trained on each of the cross-validated BiLSTM regression models, and the performance metrics of each logistic regression model were averaged. Model performance was measured using accuracy, area-under-receiver-operator-curve (AUC), precision, and recall.Feature extractionTo extract spatial features from each frame of time-lapse images, an ImageNet pre-trained model from Tensorflow 2.7 was utilized. After experimenting with various pre-trained feature weights and extractors, we utilized a VGG16 CNN architecture to extract spatial features from images. The VGG16 architecture performs significantly better than ResNet50 and DenseNet201 (p < 0.05) (Supplementary Fig. 14). While not significantly better performing than the InceptionV3 architecture, a speed increase was observed with the VGG16 architecture, which further warranted its use. VGG16 architectures have been used successfully as feature extractors for other tasks pertaining to time-lapse images in IVF22,23,24. Furthermore, a survey of developments in medical image deep-learning revealed that VGG16 was among the three predominantly utilized CNN architectures, attributed to its fewer hidden layers and reduced propensity for overfitting on smaller datasets25. The final layer of the pre-trained architecture performed average pooling, which resulted in 512-dimensional feature vectors for each frame of each embryo.BELA prediction modelsA BiLSTM network was employed for blastocyst score regression, leveraging its capabilities in sequential data pattern recognition, thus processing temporal information from time-lapse images26. BiLSTM architectures have been employed in video classification and regression tasks across healthcare and broader domains27,28. Given that time-lapse images represent sequences of frames in which data order is pivotal, the bidirectional attributes of the architecture become essential for discerning events with distinct phases. Merging feature extraction processes, which identify spatial patterns in time-lapse images, with a BiLSTM architecture adept at interpreting temporal context, facilitates optimal utilization of the time-lapse data. Our architecture comprises a bidirectional LSTM layer and three dense layers. The BiLSTM received 512-dimensional feature vectors extracted per frame for each embryo. While attention mechanisms and multiple bidirectional LSTM layers were explored, they failed to enhance performance significantly (p > 0.05) across all tasks. We modified the BiLSTM architecture to perform multitasking, wherein, in addition to the blastocyst score, the model was trained to predict the expansion score, ICM score, and TE score. Multitasking has been used in previous studies to increase performance in scenarios where predicting different scenarios together may be advantageous to individual task performance. Similar tasks may have overlap in model weights required to come to accurate predictions, hence providing additional information for performing each task29,30. Because expansion, ICM, and TE scores make up the overall blastocyst score, we believe that multitasking can be used to improve blastocyst score prediction. The BiLSTM architecture consists of one bidirectional LSTM layer followed by two multi-unit dense layers. For each prediction task, a 1-unit dense layer is added to the model. Since all tasks of the multitask model are regression-based, we used logcosh as the loss function and Adam as the optimizer. Loss weights for each prediction task within the multitask environment were equal. Maternal age was included as a feature in the BiLSTM regression model to predict blastocyst score. Early-stopping with patience = 5 was used to ensure that the model was not overfitting to the training data by monitoring the validation loss on the cross-fold validation data. The performance of the first component of BELA was evaluated using the mean absolute error (MAE) of the predicted blastocyst score (MDBS). Multitask BELA demonstrated a lower MAE (1.855 ± 0.03) compared with a non-multitask BELA (1.877 ± 0.027) on the WCM-Embryoscope test, supporting the use of multitasking. The second part of BELA, the logistic regression model, was fed the predicted blastocyst score, sometimes in combination with maternal age, and performed a binary classification task. The logistic regression model used cross-entropy loss.Computational resources and time requirementsModel training and inference were conducted using an Apple M1 Mac with TensorFlow Metal. Logistic regression models demonstrated an average training time of 2.5 ± 1.2 s, whereas BiLSTM models required 30.3 ± 11 min. The BELA model on the STORK-V platform was trained on a high-performance BioHPC computing cluster at Cornell, Ithaca, utilizing an NVIDIA A40 GPU and achieving a training time of 5.23 min. Inference for a single embryo on the STORK-V platform took 30 ± 5 s. The efficient use of consumer-grade hardware highlights the practicality of our models for assisted reproductive technology applications.Statistics and reproducibilityWhere relevant, we used the Student’s t-test to compare the means between two groups. This statistical test was selected because it is well-suited for comparing the means of two samples when the data is approximately normally distributed and the variances of the two groups are similar, as is the case with our data. In addition, all experiments were adjusted for multiple testing using Bonferroni correction to control for the increased chances of observing a statistically significant result, where appropriate. Sample sizes for datasets were determined based on the maximum usable subset available after all exclusion criteria were applied to embryos. These exclusion criteria included embryos with a mosaic PGT-A status, and embryos with missing information such as blastocyst score, ploidy status, and maternal age. Randomization was introduced into experimentation through four-fold cross-validation in all relevant comparisons. The investigators were not blinded to allocation during experiments and outcome assessment.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles