Evaluation criteria

The challenge problems fall into three distinct categories. For each there exists already well-defined evaluation metrics used by the wider imaging community which we use for evaluation here. 

The three categories are: 

1) Multi-Class Artefacts detection - proposed evaluation metrics: 

  • mAP – mean average precision of detected artefacts.  

  • IoU – intersection over union  

    Participants will be ranked on a final mean score, a weighted score of mAP and IoU (0.8*mAP + 0.2*IoU). 

2) Instance Segmentation - proposed evaluation metrics: 

  • DICE coefficient 

  • Jaccard Index (for scientific completeness) 

  • Average Precision (mAP)  

    Participants will be ranked on a final mean score, a weighted average of DICE and mAP, (0.75*({DICE+Jaccard}/2) + 0.25*mAP). 

    Note for semantic segmentation we will be evaluating only for categoryList = ['Instrument', 'Specularity', 'Artefact' , 'Bubbles', 'Saturation']

3) Data Generalization - (Only on the multi-class artefact detection) 

  • Score gap: Final mean score of detection on #2 – Final mean score of detection on #1 (#n represents data from institution or data modality) 

Support software for these evaluation metrics are made available online at GitHub. We will evaluate all participants’ submissions through a web server on this website.