The DataReader
To ensure an application of the submitted prediction algorithms in a prospective manner, the program ‘datareader’ is used as an interface for the prediction algorithms to access pieces of the EEG data consecutively. It is supplied to the contest participants together with the training data such that the optimization can be performed using the same infrastructure.
Compilation and Configuration
The ‘datareader’ is supplied as a compact program written in C++, which is platform independent and can be compiled and used on various platforms. It consists of the source files ‘datareader.cpp’ and ‘datareader.h’. It can be compiled using the GNU C++-Compiler with the command
g++ datareader.cpp –o datareader
Using the command line compiler of Microsoft Visual Studio, it can be compiled by
cl datareader.cpp
after setting the appropriate environment variables (e.g. by using vcvars32.bat of Visual Studio).
The data sets of each patient have to be stored in separate folders accompanied by a file ‘patient.txt’ which contains basic information about each patient (sampling rate, number and names of EEG channels, time of the start of the first recording). The raw data is supplied in a binary format split in blocks of hour duration. Information about each data block is given in text files with the file extension ‘.info’, containing the start and end time of the data block, and a list of events which occurred during the block interval. The time is specified as ‘sample number since start of recording’.
In addition to the file ‘patient.txt’, a file ‘datafiles.txt’ contains the names of all available data blocks, and the file ‘lastposition.txt’ is used to store the number of the last sample read, and the number of the last data block. To initiate a new analysis both have to be set to zero.
Please note that the datareader program takes care of going through all the datablocks for one patient. ‘datafiles.txt’ and ‘lastposition.txt’ should only be edited when one wants to reset the program or one wants to evaluate a specifc block for some reason. During the evaluation, both files are not allowed to be changed.
Using the ‘datareader’
When started without any command line options, the program ‘datareader’ prints its version number and an error message as part of its standard output including information about the correct way to use it. The following five arguments have to be specified:
datareader OutputFile PatientNo AlarmEventSample PredHorBegin PredHorEnd
- OutputFile: the name of the file the datareader stores the data of each segment of EEG data (see below) (please use a slash (‘/’) as the folder separator, e.g. c:/contest/out.txt)
- PatientNo: the patient number of the current patient
- AlarmEventSample: if a prediction was triggered during the last time window, the sample number of the event time has to be given, or zero (‘0’) otherwise
- PreHorBegin: if a prediction was triggered during the last time window, the sample number of the start of the time interval has to be given, for which the seizure is predicted, or zero (‘0’) otherwise
- PreHorEnd: if a prediction was triggered during the last time window, the sample number of the end of the time interval has to be given, for which the seizure is predicted, or zero (‘0’) otherwise
A file 'datapath.txt' has to exist in the directory the datareader is started in, and must contain the path to the patient data. E.g. if the patient data is located at /home/feldwisch/prediction_contest/data/pat1 etc., the path "/home/feldwisch/prediction_contest/data" has to be written to 'datapath.txt' - excluding the patient directories. Relative paths can be used, too. This file will be edited for the evaluation phase to adjust it to the situation on the evaluation infrastructure.
When started correctly the ‘datareader’ stores the requested section of EEG data in the text file ‘OutputFile’. It consists of n+2 columns of integer numbers which are seperated by tabulator characters (‘\t’). The first column contains the sample number, the second column flags for the events which occurred during this section, and the remaining n columns are used to store the recordings of all channels (which are specified in ‘patient.txt’). The seperator of rows is the newline character (‘\n’).
The number of samples which shall be provided as one chunk of data has to be set by the option ‘SamplesPerWindow’ in ‘patient.txt’ for each patient.
Imagine that at some point you read the window with samples 1000-2000, you analyze it and trigger a prediction on sample 1500, i.e. AlarmEventSample=1500. If furthermore your algorithm tells you that the predicted seizure may start anytime between samples 4000 and 6000, you call for next read the datareader with AlarmEventSample=1500, PredHorBegin=4000, and PredHorEnd=6000.
The AlarmEventSample thereby denotes the sample at which something particular happens that causes the prediction algorithm to predict a seizure. The ‘alarm’, i.e. for instance the warninig of the patient, is raised immediately after the chunk of data. In the above example, the AlarmEventSample can be 1500 but the alarm itself is considered to be raised at sample 2001. If the PredHorBegin is also set to 2001, we will consider this alarm to be merely an early seizure detection than a prediction. The AlarmEventSample is used to find possible common effects that cause algorithms to raise alarms. The AlarmEventSample is not used for the evaluation of the seizure prediction performance.
Please note that there must be a minimum time interval of 10 seconds between the alarm, i.e. the first sample of the next data-block and the PredHorBegin if the alarm should be considered as a prediction. If this interval is less than 10 seconds it will be considered as an early detection, since the uncertainty in the determination of the seizure onset is in the order of a few seconds. Moreover the PredHorBegin is not allowed to start within the current sample interval. For the above example, a PredHorBegin=1800 would lead to an immediate termination of the datareader with an error message.
The electrographical and clinical events which occurred during data recording are written as an 'event flag' to the second column of the output file. The following list specifies the events which can occur, and the event type numbers they are assigned to:
ESO (Electrographic seizure onset): Type 1
EST (Electrographic seizure termination): Type 3
CSO (Clinical Seizure Onset): Type 5
CSO NA (Clinical Seizure Onset not available): Type 7
CST (Clinical Seizure Termination): Type 8
CST NA (Clinical Seizure Termination not available): Type 10
SSO (Subclinical Seizure Onset): Type 11
SST (Subclinical Seizure Termination): Type 14
STS (Start of Stimulation Interval): Type 17
STE (End of Stimulation Interval): Type 18
ART (Artefact): Type 19
MRX (Measurement Range Exceeded): Type 21
EBD (Electrode Box Disconnected): Type 24
EBR (Electrode Box Reconnected): Type 25
No Data (Gap in the Recording): Type 26
The event types of each sample are combined by a bitwise logical “or” operation using the C expression
flags = flags | (1 << (type-1));
for each event type. For an event type type, the following command can be used to test whether the according flag was set:
if(flags & (1 << (type-1))) { ... }
In the case of a gap in the recording, either the flag for EBD / EBR is set (see above), or the ‘No Data’ flag is set for each sample for which no data is provided.
Please note that for the testing data only the information when the seizure terminated is provided and not the information when seizures start. This information may or may not be used.
Protocol output of the ‘datareader’
To enable an automatic assessment of the prediction results, a protocol entry is written for each execution of the program ‘datareader’. The file of the protocol can be set by using the option ‘ProtocolFileName’ in ‘patient.txt’, which has to be set to a separate file for each patient. The values of the following variables are written to this file (separated by tabulator characters):
PatientNo: The number of the current patient
LastHBlock: The number of the last data block
FirstSample: The first sample number of the written data block
LastSample: The last sample number of the written data block
WindowLength: The requested number of samples
AlarmEventSample: The sample an alarm is triggered for (or 0 otherwise)
HorizonStartSample: The start sample of the interval the seizure is expected for
HorizonEndSample: The end sample of the interval the seizure is exprected for
TotalEventCount: The total number of events of this block (without ‘No Data’)
IsAtEnd: Is set to ‘1’ if the last data interval of this patient was written
ExecutionString: The execution string of the program
Error: An error string if an error occurred, or ‘none’ otherwise
The same information is written to stdout each time the program finished execution, which for instance can be parsed to evaluate whether an error occurred.
If ‘datareader’ has just read the last data of the current patient and exits successfully, it returns 3 as its exit value. If an error occurs it returns 1 and after successful operation with another following data block it returns 0.
Additional information regarding the data
Together with the raw data, additional information about the recordings is provided for the training data set. Most importantly, start and end time of the clinical and subclinical seizures which occurred during the recording are given. These were classified by certified epileptologists using the standard protocol of the Epilepsy Center of the University Medical Center, Freiburg, Germany. For clinical seizures, as well the onset / termination of epileptic activity in the EEG is marked (ESO, Electrographic Seizure Onset / EST), as the onset / termination of clinical symptoms (CSO, Clinical Seizure Onset / CST). If it was not possible to determine the onset / termination of clinical symptoms, it was marked using the markers “CSO NA” and “CST NA”.
As the recordings were performed during presurgical evaluation, electrical stimulations were underdone. These are marked by the markers “STS” and “STS”. Only a very small and unrepresentative fraction of all artefacts were marked by hand using the markers “ART”. Artefacts lasting for several bins are also usually marked only once. Common artefacts which are not marked occur due to clipping of the data, resulting in constant plateaus of measurement values. Disconnection and reconnection of the recording box is marked using “EBD” and “EBR”.
If a gap in the recording occurred, all intermediate samples are marked as containing ‘No Data’ (Event type number 26, see above).