With this challenge, we aim to establish a first large and comprehensive database utilizing data obtained from 6 different data centres that includes John Radcliffe Hospital, Oxford, UK; ICL Cancer Institute, Nancy, France; Ambroise Paré Hospital of Boulogne-Billancourt, Paris, France; Istituto Oncologico Veneto, Padova, Italy; University Hospital Vaudois, Lausanne, Switzerland; Botkin Clinical City Hospital, Moscow. This dataset will be unique and represent multi-tissue (gastroscopy, cystoscopy, gastro-oesophageal, colonoscopy), multi-modal (white light, fluorescence, and narrow band imaging), inter patient and multi-population (UK, France, Russia, and Switzerland) endoscopic video frames. Videos were collected from patients on a first-come-first-served basis at Oxford, while randomized sampling was done at French centres and only cancer patients were chosen at the Moscow centre. Videos at these centres were acquired with standard imaging protocols using endoscopes built by different companies like Olympus, Biospec, and Karl Storz. While building our dataset, we have randomly mixed these data with no exclusion criteria.
All the data used in the training and testing of the challenge will be published without any restrictions and made openly available through a challenge website after the challenge.
Our clinical and program committee will ensure that the appropriate ethics approval for all the data will be in place. Approval is underway and pending for public release for this challenge. It is worth noting that we already have ethical clearance on some data that will be utilized for this challenge.
Training & Testing Data
Each contributing institute will aim to balance label proportions in both train and test dataset. Stratified sampling will be used for data coming from different data institutes and type of underlying tissue in order to create a combined and well-balanced multi-institute train-test data. For the generalisation challenge, only label proportions will be balanced.
Please see below for more information on the train-test data.
Training data: ~2000 mixed resolution, multi-tissue, multi-modality, mixed population video frames from 3-4 different data centres with corresponding bounding boxes labels. Each frame can contain multiple artefact classes. For the semantic segmentation challenge, ~500 of video frames will also have masks for different artefact classes available.
Testing data: ~ 500 mixed resolution video frames from 3 data centres (same institutions as in training data) will be provided without any ground-truth labels. Classes are proportional to provided train set. 100/500 of the provided test frames will be used by the participants for semantic segmentation challenge.
Testing data: ~ 200 mixed resolution video frames from a 5th or 6th data centre not present in the balanced or train dataset. This will be not provided during training time.
In addition, participants can use their own training data as well but must clearly provide detailed information regarding the data used. Instructions will soon be provided on the website.
First, clinical relevance to the challenge problem was identified. During this, 6 different artefact types were suggested by 2 expert clinicians who performed bounding box labelling of these artefacts on a small dataset (~100 frames). These frames were taken as reference to produce bounding box annotations for the remaining train-test dataset by 3 experienced postdoctoral fellows. Finally, further validation by 2 experts (clinical endoscopists) was carried out to assure the reference standard. The ground-truth labels were randomly sampled (1 per 20 frames) during this process.
To have consistent annotation labels, we determined a few rules to minimise the variance in annotations between two annotators. Additionally, for the final scoring, we penalised annotator variance of IoU (intersection over union, in the final score) which are high and unavoidable in terms of bounding box annotations in the dataset.
For the same region, multiple boxes were annotated if the region belonged to more than 1 class
The minimal box sizes were used to describe the artefact region, e.g. if there are lots of specular reflections present in an image then instead of one large box we use multiple small boxes to capture the natural size of the artefact
Each artefact type was determined to be distinctive and general across endoscopy datasets
Variance in bounding box annotations are considered by weighting the final score in multi-class artefact detection (0.2*IoU + 0.8*mAP) as IoU (intersection over union) is likely to vary largely compared to mAP (mean average precision)
Variance in class labels of masks for semantic segmentation was not significant
Data Release Dates
|Training data I||Released|
|Training data II (2nd release)||Released|
|Test data||16th Febraury|
*Please note that release dates might change. We are doing our best to make it available as soon as we can.