SMM22 – Poster FAQ

Is there a downloadable version of your poster anywhere?

Yes! You can find a copy of my poster here.

How was the fin detector trained?

The fin detector was trained by manually labelling all of the data provided by Shapre et al. [1]. This data consisted of images of Tursiops aduncus, collected during 2015 fieldwork surveys off Zanzibar, Tanzania.

To label the data, the images were loaded into the VGG Image Annotator [2] and any fin was manually outlined to produce a mask for each fin. This resulted in 616 masks, which were then split into a train set (n = 493) and a test set (n = 123).

To test generalisability, the model was tested using the full above water dataset from The Northumberland Dolphin Dataset (NDD, n = 2900) [3] This dataset contains images of both T. truncatus and Lagenorhynchus albirostris collected during 2019 fieldwork surveys off Northumberland, UK

How do we evaluate the fin detector and what results does it achieve?

mAP@IOU is computer vision shorthand for mean average precision at intersection over union. This metric measures the mean pixel overlap between the labelled ground truth and the model’s predicted detection over all of the test set images. As no detector is perfect, we utilise a threshold (the IOU) – if more than N% pixels are overlapping, we consider the detection correct. For example, mAP@IOU[0.5] = 0.91 means that 91% of the detections made by the detector overlapped with the ground truth by at least 50% of the pixels.

On the data provided by Sharpe et al. [1], the detector achieved mAP@IOU[0.5, 0.75] = [0.91, 0.79].

On the data provided in NDD [3], the detector achieved mAP@IOU[0.5, 0.75] = [0.96, 0.83].

Why do we detect each pixel rather than a bounding box?

Whilst predicting a bounding box for the fin would be more computationally efficient, the point of this stage is to reduce the amount of background noise which could effect embedding generation. By predicting a bounding box, we would remove far less noise than predicting a pixel wise mask.

What model do you use for the detector?

The detector utilises a Mask R-CNN architecture [4].

Is the detector capable of detecting multiple fins in the same image?

Yes! The detector is able to detect multiple fins in the same image simultaneously. If this occurs, each fin is passed downstream on its own for identification.

An example image showing multiple detections in the same image.

How does the clean up work?

Based on a priori knowledge of cetaceans it can be deduced that large holes in a detection mask is likely to be unintentional and a product of surrounding noise. If this has occurred, it is likely the detector has failed to capture all of the fin in the mask. As such, any holes present in masks are filled using a combination of dilation and erosion morphological transformations ensuring no potentially identifiable information is lost as a result of an incomplete detection.

Note that any holes present in the dorsal fin from natural or anthropogenic activity such as from sting ray barbs are not transformed in this process to retain identifying information.

Left: An example detection mask with a hole in. Pixels detected as dolphin are shown in white, background in black. Right: The same mask after morphological transformation. The hole in the mask is now filled.

How do you perform colour thresholding?

Each component in a detection has its colour composition checked during post-processing. This helps to ensure that the detection is likely to be a cetacean rather than misclassified background noise.

Histograms of the RGB colour channel pixel intensities for each object classification in the Zanzibar data [1] were recorded, giving a total of six histograms per image. The histogram groups were then combined to give six global pixel intensity distributions.

Global pixel intensity histograms used to determine colour threshold values.

Using the global distribution histograms, a colour threshold was determined. For all correct detections, 90% of the RGB pixels are below intensities (148, 148, 159). In contrast only 50.2%, 38.2%, and 36.4% of the background pixels for each RGB channel respectively are below this threshold. As detected noise are often areas of water or splash, these components will be much lighter in composition than cetaceans, and thus can be removed with confidence.

An example of colour thresholding removing an erroneous component from a detection.

In the event multiple components in the detection pass noise removal and colour thresholding, each component is treated as a distinct detection when processing downstream. If a detection only contains one component, then colour thresholding is not applied. This ensures no detections are completely ignored due to post-processing, preventing the discarding of a fin which contains no disjoint components however is above the threshold, such as in the event of extreme over-exposure.

What is a Siamese Neural Network (SNN)?

An SNN is a type of neural network which provides a measure of similarity rather than a classification. It consists of multiple identical Convolutional Neural Networks (CNNs) placed in parallel. During training, each CNN branch takes as input a fin and generates an embedding. At test time, we only use one branch as we only input one image. These embeddings are used to facilitate most likely catalogue matching.

For more information on SNNs, see my guide to SNNs here.

What is an Anchor, a Positive, and a Negative?

These are the names given to the images seen by the SNN during training. The Anchor is an image of the class we want to train the network on at that step (e.g. an example image of individual 10). The Positive is another example of that class (e.g. another example of individual 10). The Negative is an example of another class (e.g. an example of individual 8). By showing the network two images of the same class and one image of another class, the network learns to generate embeddings such that those for the same individual are close to one another numerically, and those of different classes are far apart from each other.

What architecture does the SNN use?

We follow the architecture as outlined by Vetrova et al. [5]

How is the embedding used to generate a list of most likely catalogue matches?

We take the embedding and plot it into an X-Dimensional hyperspace. The dimensionality of the space is determined during training by the network through hyperparameter optimisation, and is equal to the length of the embedding (e.g. an embedding of 12345 would result in a 5-dimensional hyperspace).

By plotting the embeddings into the hyperspace and comparing distances between them using Euclidean distance, we can determine how similar an input image is to all the other embeddings seen previously. The smaller the distance, the more likely a match the input is.

How does this allow for flagging of potentially previously unseen individuals?

If the SNN receives as input the fin of an individual not currently in the catalogue, the embedding generated for that fin is highly likely to be significantly different to those generated previously. When plotted into the hyperspace, this new fin would be far from all other embeddings. If the distance is large enough, we can flag this to the user as it is likely to be an individual the network has not been before, and thus is not in the catalogue.

In the RESULTS section, why is the Catalogue Matcher (SNN) results given as a # of classes, rather than a # of individuals?

As the detector is not perfect, it sometimes detects what it thinks is a fin but actually is not. We call these erroneous detections noise. Whilst post-processing is performed on all detections to try and catch and remove this noise, some still slips through. As a result, the SNN catalogue matcher is also trained to recognise and classify this noise. This adds a class to the training set. To obtain the number of individuals in the catalogue used to train the SNN, minus 1 from # of classes.

What accuracy does the SNN achieve for most likely catalogue matching?

As the model outputs a list of likely matches, we evaluate the SNNs effectiveness using the top-1, top-5, and top-10 accuracy metrics. Using these, the model outputs a list of lengths 1, 5, and 10 respectively. For each test image, if the correct class is contained within the list we consider the model to be correct.

When trained and evaluated on the above water set of the NDD dataset [3] the SNN achieves 40.9% top-1, 68.9% top-5, and 83.1% top-10 accuracies.

When trained and evaluated on data collected by Tyson Moore et al. [6] in Naples, FL, USA the SNN achieves 63.75% top-1, 88.75% top-5, and 97.5% top-10 accuracies.

What are the limitations of the approach?

One limitation of the system currently is the need to re-train the SNN for each photo-id catalogue. As a result, initial manual curation of a photo-id catalogue must be performed before the methodology can be applied. The feasibility of a more general SNN capable of catalogue-agnostic photo-id should be examined in future work. This limitation does not apply to the Mask R-CNN however, which has been found to not require re-training when applied to a new photo-id catalogue.

Further, whilst the system has been shown to be robust enough to deal with multiple cetacean species, these have all been dolphins. It is not yet clear how well the pipeline would perform with cetacean species such as whales or porpoises, or with body parts like flukes instead of dorsal fins. Further studies with photo-id catalogues of other species should be explored.

How can I contact you?

You can contact me either in person at the conference the designated poster sessions! If you’re unable to do so, I’m available to contact either through the SMM virtual conference portal or via email at

Please reach out with any questions you may have about this work not covered in the FAQ, if you need clarity about something that is, or anything else!

I’m also currently in the process of finishing off my PhD thesis, and I’m looking for post-docs or permanent jobs in the area of conservation tech – if you’d like to chat about this please do reach out and I can provide a CV.


[1] Sharpe, M. and Berggren, P., 2019. Indian Ocean humpback dolphin in the Menai Bay off the south coast of Zanzibar, East Africa is Critically Endangered. Aquatic Conservation: Marine and Freshwater Ecosystems, 29(12), pp.2133-2146.

[2] Dutta, A., Gupta, A. and Zissermann, A., 2016. VGG image annotator (VIA). URL: http://www. robots. ox. ac. uk/vgg/software/via, 2.

[3] Trotter, C., Atkinson, G., Sharpe, M., Richardson, K., McGough, A.S., Wright, N., Burville, B. and Berggren, P., 2020. NDD20: A large-scale few-shot dolphin dataset for coarse and fine-grained categorisation. arXiv preprint arXiv:2005.13359.

[4] He, K., Gkioxari, G., Dollár, P. and Girshick, R., 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).

[5] Vetrova, V., Coup, S., Frank, E. and Cree, M.J., 2018, November. Hidden features: Experiments with feature transfer for fine-grained multi-class and one-class image categorization. In 2018 International Conference on Image and Vision Computing New Zealand (IVCNZ) (pp. 1-6). IEEE.

[6] Tyson Moore, R.B., Barleycorn, A., Cush, C., Honaker, A., McBride, S.M., Toms, C. and Wells, R. (2020) Final Report: Abundance and distribution of common bottlenose dolphins (Tursiops truncatus) near Naples and Marco Island, Florida, USA, 2018-2019. The Batchelor Foundation, p. 21.