Study designIn this cross-sectional study, individuals with AD or MCI due to AD (PwA) were recruited from among visitors to the outpatient clinic of the Neurology or Psychiatry Department of Juntendo University Hospital, Juntendo Koshigaya Hospital, or Takeda General Hospital from June 1, 2022, to December 31, 2022. The diagnoses of AD and MCI due to AD were made according to the National Institute on Aging and Alzheimer’s Association criteria33. HCs (i.e., those without any neurological diseases) were recruited by a recruitment company (https://3h-ct.co.jp/) during the same study period and matched to patients by age and sex. The sex of HCs was controlled by exact matching with PwA, and we restricted the age difference between matched PwA and HCs to within 5 years. The inclusion criteria for participation were as follows: (1) aged over 65 years; (2) native Japanese speaker; and (3) ability to use the application. The exclusion criteria were as follows: (1) speech problems so severe that speech was undetectable by a tablet microphone; and (2) inability to use the application. General information such as age, sex, and past medical history were obtained using the application.This study consisted of two sessions: an application session, which comprised a cognitive test and a conversation with a chatbot, and a cognitive assessment session that comprised the Mini-Mental State Examination-Japanese (MMSE-J), the Hasegawa Dementia Scale-Revised (HDS-R), the Japanese version of the Montreal Cognitive Assessment (MoCA-J), the Trail Making Test (TMT), the Neuropsychiatric Inventory (NPI), and the Geriatric Depression Scale (GDS). The tests in the cognitive assessment session were administered by licensed psychologists.The order of the application and cognitive assessment sessions was randomized to control for potential order effects. Both sessions were conducted in a quiet brightly lit room of the hospital during one visit, except in a few cases in which the patient was unable to complete the sessions in one day. Both PwA and HCs were tested in the same environment. The application section was completed by the participants themselves, who followed the staff’s instructions. Sample photos from the experiment are shown in Supplementary Fig. 1. All participants were using the application for the first time and were required to finish all of the tasks in one session. Participants who left midway during the experiment were excluded.Equipment and online applicationAn iOS mobile application was developed to collect data during the testing session. The application was built using the Amazon Web Service (AWS, Seatle, WA, USA), including the AWS lambda and the AWS Application Programming Interface (API) Gateway. Unstructured (i.e., video, figure, and sound) data were stored in the Simple Cloud Storage (S3) of the AWS using the WebRTC protocol (an open source-project for real-time communication in applications; https://webrtc.org/), and structured (i.e., text answers) data were stored in the AWS Relational Database Service. We used a facial expression recognition Application Programming Interface, supported by GLORY Ltd. (Tokyo, Japan).The iOS application was installed on an iPad device (Apple, Cupertino, CA, USA). The application consisted of three parts: a smile test session, a chatbot talk session, and a cognitive test session. Assuming that older adults may possess less technical knowledge and fewer digital skills, we implemented visually simple and intuitive designs with voice guidance. We also introduced elements such as speech bubble text displays to support further actions, waveform displays during voice recording, and tutorials for all measurements, enhancing visual and auditory aspects to provide an easily understandable interface. Tasks appear one at a time: the next task appears after finishing the previous one. Participants were asked to remove their surgical masks they were wearing to prevent COVID-19 infection, in order to properly assess their faces and voices during the application session. Sample images of the user interface are shown in Supplementary Fig. 2.In the smile test session, the participant was provided with audio and visual instructions to smile and then return to a neutral face. For each instruction, facial expressions were captured for 15 s (smile test video). The smile test session was performed twice for each participant. If there were problems with the facial data (e.g., face out of frame), the smile test session was repeated once.The application cognitive test included 35 questions. One additional external question taken from the Cookie Theft picture from the Boston Diagnostic Aphasia Examination Third Edition (BDAE-3) was also used.Among the 35 cognitive questions, 11 had multiple-choice answers, which accounted for 15 points; 18 required audio responses, totaling 30 points; and 6 involved on-screen interactions, such as drawing and moving objects, which accounted for 6 points. These questions were formulated on the basis of the MMSE, MoCA-J, and TMT and spanned several cognitive domains: memory: 4 questions (12 points); language: 11 questions (11 points); attention: 4 questions (8 points); calculation: 1 question (1 point); executive function: 5 questions (9 points); and orientation: 10 questions (10 points). Of the 35 cognitive questions, 11 (15 points) were scored automatically, and the remaining 24 were scored manually by a psychologist. Detailed information on the 35 cognitive questions is provided in Supplementary Table 3.The additional external question, which used the Cookie Theft picture from the Boston Diagnostic Aphasia Examination Third Edition, was designed to evaluate speech fluency and executive function. In the task, a drawing of a complicated kitchen scene is shown, in which a woman is wiping dishes, and two children are trying to obtain cookies from a cupboard. Auditory data from an image-description task including this picture have been used in previous studies to discriminate patients with AD from HCs13,19,25,28,29,34, partially because there is an accessible databank that includes these data. In the present study, this stimulus appeared on the participants’ tablets, and they were asked to orally describe it. Responses were recorded for up to 5 min and transformed into audio features, which were compared with the features generated from the chatbot questions. Details of the external question are provided in Supplementary Table 4. At the end of the application session, the total score and standard deviation were provided as feedback to the participants.The chatbot session included seven questions that were inserted before and after the cognitive testing session. The chatbot asked general questions, for example about the participants’ hobbies, favorite foods, and impressions of the cognitive test. All questions in the chatbot session were answered orally, and both audio and facial expression data were recorded (chatbot talk video). The maximum answer time (response time), which was controlled by the app, was 5 min for each chatbot question. After finishing a question, the participants pressed the “next” button to move to the next one. Psychologists reminded the participants to press the button. The questions used in the cognitive test and chatbot sessions are listed in Supplementary Tables 3–5 in the supplementary materials. The data collection, processing, and analysis steps are shown in Supplementary Fig. 3.Video data and facial expression featuresAs described in the previous section, two types of videos were recorded: a smile test video and a chatbot talk video. The recording conditions were as follows: resolution, 1,280 × 960 pixels; frame rate, 30 fps; and compression format H.264 (approximately 2,300 kbps).We used the smile index35 to evaluate the facial features in the smile test video. The processing of the smile index consists of two steps. First, the system extracts a facial area from the image using a deep learning-based model trained on a general facial image dataset, primarily depicting healthy people. Next, the system calculates the smile index from the aforementioned facial image using another deep learning-based model trained on a dataset containing thousands of smiling and neutral faces, mostly of healthy people. In particular, the latter model is trained to output the degree of smiling corresponding to the input face image, and the smile index obtained as the output takes scalar values ranging from 0 (neutral) to 100 (smiling).A smiling section of video was defined as a section with a relatively high smile index. The boundaries of the smiling section were determined using a method that identifies a region in which there is a significant change in the smile state, as indicated by the difference between local smile indexes.Sections in which participants were instructed to smile were designated as instructed smiling sections, and sections in which participants were instructed to make a straight face were designated as instructed straight-face sections (Fig. 1). Participants with fewer than two adequate-quality videos were excluded from the analysis.Fig. 1Composition of sections considered when extracting of facial features from the smile test.We used two videos per participant to calculate the smile index using four types of smile-related features: (1) duration of smiling face in the smiling section; (2) time taken to smile following the instruction in the smiling section, indicated by the rising angle of the smile index; (3) amplitude and fluctuation of the smile index in the smiling and instructed smiling sections, calculated from the average, maximum, minimum, and standard deviation values of the smile index; and (4) difference in the strength of the smile index between smiling and non-smiling sections and between instructed smiling and instructed straight-face sections.To evaluate the facial features in the chatbot talk video, we used the smile index as detailed above, as well as the orientation index, which indicates the angles (elevation, azimuth, and rotation) of the face; the eye-opening index, which indicates the degree of eyes opening; and the blink index, which denotes the blink behavior estimated from the eye-opening index. A total of 48 facial features were extracted from the video data for further analyses according to previous reports and the experiences of neurologists.Audio data and sound featuresThe audio data were recorded at a 12-kHz sampling rate in MPEG-4 audio format. To prevent effects of background noise, we applied sound overlap processing to eliminate the noise. First, all audio files were manually checked, and the start and end times of the overlapping sections were tagged. Second, the overlapping sound sections were wiped according to the tagged times and saved as a new sound file. Finally, all sound features were extracted from the new audio data.We previously defined three types of sound features: linguistic, prosodic, and acoustic features35. Linguistic features are related to the lexical content of speech and include the statistics of filled pauses, parts of speech, and vocabulary richness. Prosodic features are related to the pitch, speech rate, and phonation time. Acoustic features are related to the statistics of formants (f0, f1, and f2), shimmers (variations in sound wave amplitudes), jitter (variations in sound wave frequencies), and MFCCs. In this study, audio answer data were saved for individual questions. To average the responses in the chatbot session, we combined the audio answer data from the entire chatbot session. All sound features were calculated from these combined data and processed using open-source library packages (Librosa = 0.8.1, https://github.com/librosa/librosa; Pydub = 0.25.1, https://github.com/jiaaro/pydub; Parselmouth = 0.4.3, https://github.com/YannickJadoul/Parselmouth; and Spacy = 3.4.4, https://github.com/explosion/spaCy). For the assessments, a total of 110 sound features, comprising 53 linguistic features, 6 prosodic features, and 51 acoustic features, were extracted from the audio data as suggested by previous reports and on an empirical basis according to neurology specialists (H. T-A., G.O., N.H.).Statistical analysisWe applied the Mann-Whitney U test to evaluate the differences in sound and facial expression features between PwA and HCs. Significant (P ≤ 0.01) features were then used for classification model building.We applied Pearson’s correlation test to examine the relationships between features and the application cognitive test score. Features with a correlation of 0.1 or higher were used in the regression model building.The statistical analysis and model building were performed using open-source library packages (scipy version 1.7.1 and scikit-learn version 1.0.1) in Python version 3.8.