Evaluation criteria

The challenge problems fall into three distinct categories. For each there exists already well-defined evaluation metrics used by the wider imaging community which we use for evaluation here. 

The three categories are: 

1) Multi-Class Artefacts detection - proposed evaluation metrics: 

  • mAP – mean average precision of detected artefacts.  

  • IoU – intersection over union  

Participants will be ranked on a final mean score, a weighted score of mAP and IoU (0.6*mAP + 0.4*IoU). 

Please note that your IoU should be in proportion to mAP if your IoU is too high and mAP low, the panel may decide to pick a different winner who has higher mAP.

2) Semantic Segmentation - proposed evaluation metrics: 

  • DICE coefficient 

  • Jaccard Index (for scientific completeness) 

  • F2-score

Participants will be ranked on a final mean score, a weighted average of DICE and mAP, (0.75*({DICE+Jaccard}/2) + 0.25*F2-score). 

Note for semantic segmentation we will be evaluating only for categoryList = ['Instrument', 'Specularity', 'Artefact' , 'Bubbles', 'Saturation']

3) Data Generalization - (Only on the multi-class artefact detection, task-3 mAP will be estimated on a 6th organisation not included in the training data) 

  • Score gap: Deviation score based on task-1 mAP and task-3 mAP 

Support software for these evaluation metrics are made *available online at *GitHub. We will evaluate all participants’ submissions through a web server on this website.  

EAD2019 Leaderboard

check here: readme_leaderboard

Submission style
- ead2019_testSubmission.zip
    - detection_bbox
    - generalization_bbox
    - semantic_masks
  • detection bbox/generalization bbox - VOC format in  .txt

          class_name confidence x1 y1 x2 y2
    

    Tips:

    -If you have a YOLO format (.txt) please convert to VOC format - check our "yolo2voc_converter.py"

  • semantic masks

    • .tif file with 5 channels 

      ch1: instrument, ch2: specularity, ch3: artefact, ch4: bubbles, ch5: saturation

    • semantic bbox detection criteria has been removed. **Now, the participants will be scored only on their semantic segmentation **

Evaluation Scoring
  1. Detection problem - Final score: 0.6 * mAP + 0.4 * IOU

  2. Generalization problem

    - Deviation score per class above or below tolerance (+/-10%) will
      be reported
    
    *For example: if your algorithm in detection gives an mAP/class of
    30% then your generalization should be with in the tolerance range,
    i.e., 27%\<=mAP/class\<=33%, in this scenario your deviation will be
    zero. However, anything below or above will be penalized. Lets say
    if your algorithm scores 25% on generalization data then your
    deviation will be 2% which will be reported.*
    
  3. Semantic problem

    - Final score: 0.75 \* overlap + 0.25 \* F2-score (Type-II error)