Unobtrusive measurement of cognitive load and physiological signals in uncontrolled environments

Ethics StatementEthical approval has been obtained from the Institutional Review Board (IRB) of the University of Potsdam (application number 36/2022). Study information sheets were sent to potential participants weeks before they participated in the study, and participants had sufficient opportunities to ask questions or raise any concerns before, during, and after the study. Written consent was obtained for both the participation in this study and the publication of anonymized data. It was communicated to the participants that they could, at any time and without negative consequences, drop out of the study, which one participant chose to do while collecting data in the self-chosen environments.Participants & Demographics InformationData collection was conducted during the summer, autumn, and winter of 2022 as well as the spring of 2023, after advertisements were sent out via mailing lists. Inclusion criteria required the participants to be aged 18 to 68, fluent in English, have a normal or corrected-to-normal vision, know how to use a smartphone, and have to regularly perform work that will be performance-evaluated (e.g. students or employees). Participants were excluded if the potential participant was retired or needed to regularly take medication to support the treatment of a neurological disease such as depression, brain damage, or similar. In total, 24 participants agreed and were eligible to participate. One participant dropped out due to personal reasons during the data collection in the self-chosen environments, and two additional participants recorded incomplete datasets, as the participant-controlled data collection failed. However, the laboratory data was collected mostly complete for these three participants. The data is nearly balanced across biological sex (11 female participants). The participants were educated, with the majority of study participants holding the equivalent of a master’s degree or higher (seventeen participants), while every participant had at least a bachelor’s degree. Participants were aged 24 to 61, the mean age was 29.5 years (± 8.2 years). All the participants were right-handed. The distribution of the countries of origin is depicted in Fig. 2a; the participants were from Germany (10), Brazil (3), Bangladesh (2), Chile (2), India (2), Ecuador (1), Egypt (1), Mexico (1), Peru (1), and the USA (1).Fig. 2Metadata on (a) the countries of origin of the participants, and (b) the duration of data collected for each participant in the controlled environment (in ), and in the uncontrolled environments (in ). As can be seen in (a), the majority of participants identified Europe, South America, or Asia as their continent of origin, while the minority of participants identified North America or Africa as the continent of their origin. As depicted in (b), overall approximately 120 hours of recordings in the controlled environment and approximately 194 hours of recordings in the uncontrolled environments were collected for both the Muse S headband and the Empatica E4 watch in total.Experimental DesignTo overcome the existing limitations in the research body, an experimental design was developed including data collection at varying days, varying times of day, and varying locations. As a result, the data set at hand provides for each participant data that has been recorded on different days and times of day, allowing the investigation of the fluctuation of physiological signals for this individual in unnatural (controlled laboratory) as well as in natural (realistic work-from-home) environments.Controlled EnvironmentIn the laboratory setting in designated rooms at the chair for Digital Health – Connected Healthcare at the Hasso Plattner Institute, the Digital Engineering Faculty of the University of Potsdam, participants were instructed to follow the study design depicted in sub-figure A of Fig. 1. Explanations on each respective task were given to the participants before they followed the instructions given by the study platform. Participants sat comfortably in a temperature-controlled room at a distance of 80 cm from a Full HD PC monitor with floor-to-ceiling windows to a green backyard garden. The laboratory rooms for the controlled recordings had been chosen in a rarely frequented wing, and the study directors had temporarily pasted notes on all the doors of that wing, reminding passersby to a) talk, if at all, at a low volume, and b) not to enter the recording rooms. Due to the sticky notes and the sound-absorbing properties of the doors, few distractions reached the participants while performing the tasks. If, by accident, someone entered the room, the experimenter present in the recording room got up quietly and immediately escorted the entering person outside, explaining they were unable to use that room for the time being, and ensuring the interruption was kept in low volume and of short duration. Thereby, interruptions during the recordings in the controlled environment were kept at a minimum (approximately five interruptions noticed by the participants) and noted down in the experimental notes. The experimenters sat at another workplace in the same room, far enough away to ensure the participant’s comfort and give a sense of privacy, while still allowing questions to be raised, if they were necessary. To avoid participant-driven breaks, participants were reminded to use the bathroom before the recordings, while being encouraged to notify and take a bathroom break during the recording, if necessary. Participants were given drinks of their choice (water, different teas, or coffee), and the choice was noted down in the session notes if it deviated from water. Subsequently, participants were instructed on how to use the devices on their own while the study director ensured proper device fit, and started the Python PsychoPy application (v2022.2.1, developed in Python 3.8).During the study, the participants were mainly guided by the PsychoPy application. Only for two activities—the synchronization of the Python PsychoPy application with the sensors at the beginning and end of the experiment—the platform asked the participants to call the study directors for their assistance. This synchronization was performed by fast-paced tapping on the spacebar performed by the participant with the hand on which the Empatica E4 watch was worn, their non-dominant hand (i.e. left hand). Before the experiment, participants had been shown how to perform a fast-paced spacebar tapping by raising the hand with the Empatica E4 watch high above their head and quickly dropping it onto the spacebar, thereby generating high acceleration data. This whole process was to be repeated four times for both the synchronizations at the start and end of the experiment. During the instruction, all participants seemingly understood the procedure sufficiently well and tapped the training keyboard properly. However, during the actual experiment, some participants (approximately during a total of ten individual sessions) were too careful with the recording equipment. It was later confirmed in personal discussions that some participants were afraid of destroying the keyboard. As a result, some participants performed slow-paced tappings with small effective ranges of motion, resulting in difficult-to-detect acceleration patterns within the acceleration data. If these challenges occurred, the experimenters noted them in the recording notes. The study director sat on the desk opposite the participant, hidden behind two monitors, and took notes on excessive body movements or other anomalies that might have led to artefacts during the recordings, such as timestamps of drops in the Bluetooth connection or when the participant drank something.The study protocol for the controlled environment is illustrated in subfigure A of Fig. 1. In the beginning and after synchronizing the sensors with each other and with the platform, the participants watched a relaxation video of scenic shots from the national park Torres Del Paine, Chile, with relaxing music (https://www.youtube.com/watch?v=jXl1GbK5ZO8&t=3s). Next, a set of questionnaires and an eye-closing session were performed, to assess the baseline affective state and physiological signals of the participants. For the remainder of the recording, the participants performed four different tasks in two difficulty levels: easy and hard, each of ten minutes duration and in random order.These four tasks encompassed (i) Mental Arithmetic, (ii) Sudoku, (iii) N-back Task, and (iv) Stroop Task. (i) One of the well-known cognitive load-inducing techniques12 is the Mental Arithmetic Task. Here, participants are required to calculate mathematical problems mentally without additional support from writing instruments or calculators. The study was designed to include simple addition and subtraction for the easy level, potentially involving carry numbers. For the hard level, complex calculations of multiple-digit multiplication and division were required to be calculated. Operands ranged from -100 to 100, and the specific tasks, answers, and reaction times were logged during each experiment. (ii) The digital version of the game Sudoku13 can sometimes be found pre-installed on Linux and Windows systems and was utilized in this study. The Sudoku application was chosen in two distinct difficulty levels, easy and hard, among the four default difficulty levels provided (https://wiki.gnome.org/Apps/Sudoku). Participants aligned the puzzle of numbers from 1 to 9 in a 9 × 9 grid to arrange each column, row, and subsection to contain all numbers while accounting for the constraint that the same number can not occur twice in the same row, column, and 3 × 3 grid (i.e. subsection). If the participants solved the game before the time was up, they were instructed to play more rounds of the same difficulty level until they reached the time limit of ten minutes and the application automatically terminated. (iii) The N-back task14 requires memory-sequencing of coloured rectangle blocks shown on the screen. The participant had to match the colour of the current stimuli with the colour of the stimuli n elements earlier. The application was configured to depict six distinct colours (, , , , white, ), with n = 1 and n = 2 defining the respective easy and hard difficulty levels. The participants had two seconds to give their answer before inactivity was rated as a missed trial. Colours, answers, and reaction times were logged during each experiment. (iv) To stress the control processing of the working memory, the Stroop task15 was chosen. In this task, a sequence of single words appears on the screen, stating a colour (e.g. ). However, the word is coloured in the same or a different colour (e.g. ). The participants had to recognize the font colour, ignoring the written word, and type the starting letter of the name of the font colour (e.g. y for ). In total, four colours were utilized, namely , , , and . On the easy level, the participants had a maximum time to answer of 5.5 seconds, whereas on the hard difficulty, the answer had to be given in under 1.5 seconds. Colours, answers, and reaction times were logged during each experiment.Other than Sudoku, each task was preceded by a trial session of 45 seconds. Objective labels and data were logged by the PsychoPy application (e.g. task difficulty, start and end timestamps, task-specific information, and more). Subjective labels were provided by the participants as answers to NASA-, PANAS-, affective sliders-, and Likert scale questionnaires after the relaxation video and at the end of all of the tasks. Due to time reasons, in between the individual tasks the participants solely answered pair-wise NASA-TLX questionnaire16, affective sliders17, and reported their mental workload and stress level during the previous task on a Likert scale18. The pair-wise NASA questionnaire quantifies the subjective mental workload of a given task in six continuous sub-scales, each ranging from 0 to 100, which includes questions regarding mental demand, frustration, physical demand, temporal demand, performance, and effort. For these sub-scales, additional weights were received from the pair-wise comparisons of the questions. The affective sliders measure the subjective ratings of pleasure and arousal on two separate sliders. The sliders were designed with visual bipolar affective states through emoticons19 to rate the current emotion and did not need a written explanation, despite being explained before the experiment in a video-guided explanation of the whole study paradigm. The participants also rated their subjective mental workload and stress levels on two 5-point Likert scales with the options: “very very low”, “low”, “nor low nor high”, “high”, and “very very high”.After the recording, information on how to use the devices in the uncontrolled environments was repeated, the participants received a printed handout illustrating proper sensor fit detailing the steps required for data acquisition, and an appointment for the second recording in the controlled environment was made.Uncontrolled EnvironmentFor the recordings in the uncontrolled environment, participants were free to choose where, when, and on which tasks they wanted to perform a recording. The only strict requirement was that participants had to aim for a balanced distribution between tasks they would consider low load as well as high load. A printout with steps to be followed, handed out to each participant, served as a guideline to ensure sufficient data quality of the recordings, instructing the participants to perform the synchronization shaking protocol described in ‘Data Synchronization’, to perform sensor fit checks as illustrated, to start with an eye-closing session of one minute, and to record data for a variable time of 15 to 45 minutes at a stretch. Most of the participants followed the instructions on the shaking of the devices (some, however, at a rather low intensity and velocity), on the eye-closing protocol (more than 90%), and on the distribution of low-vs-high load tasks. The most prominent tasks were reading or searching information (i.e. documentations or research papers; ≈ 26.75%), coding (various difficulties and programming languages; ≈ 16.74%), and processing data (e.g. performing data analysis, project planning based on data, doing mathematical calculations, etc.; ≈ 16.74%), while the least prominent tasks were relaxation (e.g. meditation; ≈ 0.5%), preparing a presentation ( ≈ 1.9%), and attending a meeting ( ≈ 2.8%). Other tasks performed by the participants encompassed playing a game ( ≈ 7.8%), responding to emails ( ≈ 8.6%), watching a video (e.g. learning a new skill; ≈ 8.6%), and typing (e.g. summarizing publications or writing a new manuscript; ≈ 9.7%).Recording DevicesTwo wearable devices capable of recording physiological signals were used in this study, the Muse S headband and the Empatica E4 watch, shown in Fig. 3. The Muse S headband is capable of measuring Electroencephalography (EEG), Accelerometer (ACC), Gyroscope (GYRO), as well as Photoplethysmography (PPG) data. EEG data was recorded with 256 Hz at the five sensor locations AF7, AF8, TP9, TP10, and FpZ, according to the 10/20 international system and depicted in detail in Fig. 3b. ACC and GYRO data were sampled at 50 Hz and PPG data could have been sampled at 64 Hz. However, for reasons of the battery life and the stability of the Bluetooth connection, the recording of PPG data from the headband was not performed. As a consequence, PPG data was recorded only in the Empatica E4 watch. The recording for the Muse S headband was started and data was collected via the third-party app Mind Monitor. The third-party app Mind Monitor provided many additional features such as power band values, amongst which the Horse Shoe Indicator (HSI) for each electrode location—which reflects three states of electrode connection quality ([0.0 means no connection], [1.0 means good connection], and [2.0 means poor connection])—could be used in future studies to correlate it with the computed signal quality indices of this work. Subsequent pre-processing steps were performed based on the raw data, while the band-power features computed by the third-party app were discarded from the subsequent analysis but made publicly available as well. The Empatica E4 is capable of measuring Temperature (TEMP), Photoplethysmography (PPG), Electrodermal Activity (EDA), and Acceleration (ACC) data using the accompanying E4 realtime app provided by the manufacturer for iOS and Android devices or the E4 streaming server for Windows. TEMP and EDA data were collected with a sampling rate of 4 Hz, PPG data at 64 Hz, and ACC data at 32 Hz.Fig. 3Figure 3a shows a study director wearing the Muse S headband and the Empatica E4 watch. Figure 3b was adapted with permission from38, and shows the electrode positions of the Muse S headband dry-electrode sensors according to the international 10-20 system.Two further recording devices were utilized during the study: a Google Pixel phone (Pixel 3a; Android 12), on which the data collection apps had been installed, and a personal computer (PC) was used to display the study platform PsychoPy. After data collection in the controlled environment, the respective Google Pixel phone was handed out to the participant alongside the Muse S headband and the Empatica E4 watch used, for subsequent data collection in the self-chosen environments. The PC ran Ubuntu 20.04.5 and had 4 cores and 8 GB of RAM. An extensive logging functionality had been implemented, which logged a multitude of information about synchronisation taps, questionnaire answers such as for NASA, PANAS, affective sliders, Likert scales, eye closing timestamps, and task-specific information such as correct answers, response time, and the operators in the specific task, amongst others. For reasons of redundancy, the logging was directed to two files: a. csv-file, and a. log-file. In two recordings, this redundancy has been needed, as either one of the files had not been written properly due to an application error, while the other file had been correctly stored on disk. As a consequence, for two of the recordings of the controlled environments, the log files had to be utilized to derive the task labels, and for two further recordings, the physiological signals had to be interpolated at the end of the recordings, due to Bluetooth connection problems with the sensors.Data SynchronizationThe internal clocks of (wearable) devices can run at different speeds than those of reference systems, resulting in a phenomenon called clock drift – and wearable sensors are no exception to this. This situation is worsened by the circumstance that wearable sensors can have different time zone settings, and computations on floating point numbers, the dates, are performed at varying degrees of accuracy. For these reasons, the internal clocks of devices need to be resynchronized. To ensure the synchronicity of data recorded from different devices and platforms, multiple solutions exist, such as the Lab-Streaming Layer (LSL; https://github.com/sccn/labstreaminglayer). However, this solution is limited to the availability of a platform that receives the streamed data, which is not guaranteed to be the case for the uncontrolled environments chosen by participants. For these circumstances, a shake-based protocol was developed and participants were asked to follow it closely. When starting a session in the self-chosen environment, participants had to start the recordings, place the wearable sensors flat on a surface and wait for six to twelve seconds, then take both the Muse S headband and the Empatica E4 watch together and shake them violently for about twelve seconds, finally placing the devices again on a flat surface and wait for six to twelve seconds before starting with the actual task. This procedure was to be repeated at the end of the recording. By resting both devices on a flat surface twice and simultaneously shaking them in between, very clear and similar patterns of acceleration and gyroscope data were collected. Figure 4 provides an overview of the resulting accelerometer data. After loading the Muse S and Empatica E4 data utilizing devicely in version 1.1.1 (https://pypi.org/project/devicely/1.0.2/), subsequent peak detection allowed for synchronization of the time series by potential alignment after clock drift, using the Python-based synchronization package Jointly20 in version 1.0.4. However, while a few participants omitted this step during the recordings in the self-chosen environment, a few of the accelerometer recordings in the uncontrolled environments were difficult to align. For this reason, and to ensure the same pre-processing steps across the data published, synchronization was performed based on the timestamps given by the wearables for the data recorded in the uncontrolled environments, while the data in the controlled environments had been synchronized using Jointly.Fig. 4Overview of the normalized acceleration data magnitude calculated from the Muse S headband and the Empatica E4 watch, utilized for data synchronization between the wearable devices. The shake start and end times are prominently visible by abrupt changes of acceleration magnitude, while the duration of shaking activity per sensor is similarly about 12 seconds. To form the pre-processed and labeled physiological data from the controlled sessions, the time series were aligned between the shakes using the Python package Jointly20. For the data recorded in the uncontrolled sessions, the labeled data in the Labeled folders of each participant were extracted using the same timestamps from both the wearable devices rather than the sensor synchronization using Jointly. This decision was met as some participants had forgotten to shake the devices either at the beginning, at the end, or at both times of the respective recording in the uncontrolled environments.Data ProcessingTwo different formats of unprocessed data are available: (i) the raw data as recorded from the individual wearable devices and the study platform, and (ii) the data organized by tasks which is the result of splitting the original (raw) time series data into the respective tasks performed by the participants. Task extraction was based either on a simple splitting technique using the timestamps from the devices in the uncontrolled environment (ii.1) or a sophisticated shake-detection protocol performed by the experimenters for each recording in the controlled environment (ii.2). Additionally, a simple data pre-processing pipeline was implemented for use in qualitative and quantitative evaluations and resulted in three data sets (i-pre), (ii.1-pre), and (ii.2-pre). Pre-processing steps for the EEG data included a Butterworth filter for the range of 0.5 – 50 Hz using mne python package (https://mne.tools/stable/generated/mne.filter.filter_data.html) and applied to remove high-frequency noise of the muscle activation of the scalp and low-frequency disturbances such as heartbeats. An additional Butterworth filter at 50 Hz removed the power-line interference from the signal. Furthermore, a movement filter was applied by filtering the accelerometer data from the headband in the range of 0.5-20 Hz and the magnitude of the acceleration for each participant of a given controlled and uncontrolled session. By applying a binary search, the participant-wise threshold was computed to detect 3 to 5.25% of the high acceleration data and, after that, interpolate the corresponding EEG data. Additionally, the EEG data from the controlled session was normalized by min-max normalization and by removing the baseline obtained by the eye-closing session. In contrast, the data from the uncontrolled session went through a min-max normalization as not all participants performed an eye-closing session as instructed. The data from both sessions were average-referenced. Due to the lack of dedicated recording channels for ocular (electrooculogram, EOG), muscular (electromyogram, EMG), or cardiac (electrocardiogram, ECG) activity, the widely applied21 steps of EOG-, EMG, and ECG-removal were not performed in more detail, and only other obvious artefacts—such as the loss of contact or the Bluetooth connection, amongst others—were interpolated using the mean value of the neighbouring values. The raw blood volume pulse (BVP) data extracted from the PPG sensors underwent the same normalization mentioned for EEG data. Subsequently, for further data cleaning, the Savitzky-Golay-Filter was implemented using the Scipy python package (https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.savgol_filter.html) with filter order 4 and window length 31 and applied on the BVP data. The TEMP and EDA data were preprocessed by interpolating the missing data. The preprocessing steps are summarized in Fig. 5.Fig. 5The preprocessing steps performed to clean the raw EEG data by applying Butterworth filter, Notch filter and baseline normalization. The BVP data was preprocessed by applying Savitzky-Golay-Filter and baseline normalization.For extracting features from the preprocessed data of the four modalities, a 60-second window with 80% overlap was computed. As mentioned in the literature22, using Welch’s method and Hann window function, the power spectral density (PSD) of five commonly used frequency bands, namely δ (0.5 − 4 Hz), θ (4 − 7 Hz), α (8 − 12 Hz), β (12 − 30 Hz) and γ (30 − 50 Hz) were computed from each of the four channels of EEG data. The average of each band power over four channels denoted as mean − (δ, θ, α, β, γ) and the ratio θ/α were used as features. Additionally, asymmetry features were calculated by subtracting the log-transformed spectral power of the right hemisphere from the left hemisphere of each power band and denoted as frontal − α − asy and (δ, θ, α, β, γ) − asy. Using the NeuroKit2 python package (https://neuropsychology.github.io/NeuroKit/), features were extracted from the cleaned BVP data. The time domain features include the mean of the RR intervals (HRV – MeanNN), the standard deviation of the RR intervals, (HRV – SDNN) and the root mean square difference of successive R-R interval (HRV – RMSSD). The frequency domain features include normalized low frequency (0.04 to 0.15 Hz) power (HRV – LFn), normalized high frequency (0.15 to 0.4 Hz) power (HRV – HFn), and the ratio between the latter (HRV – LFn/HRV – HFn). From the EDA signal, the number of peaks of skin conductance response (SCR – Peaks – N), and the mean amplitude of the peak occurrences (SCR – Peaks – Amplitude – Mean) were extracted. Furthermore, the mean of the temperature (mean – temp) and the standard deviation of the temperature (std – temp) were extracted from the cleaned TEMP signal.

Unobtrusive measurement of cognitive load and physiological signals in uncontrolled environments

Pakistan’s fencing threatens conservation | Science

Early education’s long-term benefits | Science

Early education’s long-term benefits—Response | Science

NASA spacecraft to probe possibility of life in Europa's salty ocean

News at a glance: Long-lasting HIV prevention, a new neutrino detector, and rescuing scientists

Hot Topics

Pakistan’s fencing threatens conservation | Science

Early education’s long-term benefits | Science

Early education’s long-term benefits—Response | Science

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Pakistan’s fencing threatens conservation | Science

Early education’s long-term benefits | Science

Early education’s long-term benefits—Response | Science

NASA spacecraft to probe possibility of life in Europa's salty ocean

Popular Articles

Pakistan’s fencing threatens conservation | Science

Early education’s long-term benefits | Science

Early education’s long-term benefits—Response | Science