Drowning Detection, YOLO-based

8 min readJan 25, 2020

Drowning is the third leading cause of unintentional injury & death worldwide. 1.2 million people around the world die and get injured by drowning every year, more than 2 people per minute[1].

Supervised beaches are the most challenging environments to lifeguards, due to the influence of external factors such as weather, currents, tides, waves, and poor visibility. On the other hand, unsupervised beaches are death traps since there is no one to educate and manage the crowd or to assist and alert in case of an emergency.

We’ve developed a drowning detection algorithm, which can be treated as an input/output black-box. It repeatedly reads new frames and generates corresponding inceptions accordingly.

The algorithm consists of three core steps: pre-processing, objects-detection and drowning-alerts, and is illustrated by the following figure:

Each incoming frame first passes a Region-Of-Interest (ROI) extraction phase, according to user parameter (roi). This phase is skipped in case the ROI equals to the whole frame.

The user may additionally set a minimal margin-gap from the ROI boundaries (margin_gap), in which detections will be ignored. This feature allows a manner of immunity for noisy effects nearby the ROI boundaries, e.g. objects which only partially being bounded within the ROI.

The extracted ROI is then pre-processed[2] and eventually gets packed into a 4D blob format. The preprocessing comprises resizing (according to user parameters, inpWidth and inpHeight), mean-subtraction (0,0,0), scaling (1/255) and R/B channel-swapping. The 4D blob is simply a tuple of (num_images, num_channels, width, height), where current usage always applies num_images=1 and num_channels=3.

The preprocessed blobbed image next enters an Object-Detection phase, which involves passing through a state-of-the-art, real-time object detection system termed YOLOv3[3] (“You Look Only Once”, version 3). This is a neural network capable of detecting what is in an image and where stuff is, in one pass. It gives the bounding boxes around the detected objects, and it can detect multiple objects at a time. YOLOv3 is trained over Darknet[4], which is a framework to train neural networks. It is open source and written in C/CUDA and serves as the basis for YOLO, meaning it sets the architecture of the network. The network structure basically looks like a normal CNN, with convolutional and max pooling layers, followed by 2 fully connected layers in the end.

The input image is divided into an S x S grid of cells. For each object that is present on the image, one grid cell is said to be “responsible” for predicting it. That is the cell where the center of the object falls into. Each grid cell predicts B bounding boxes as well as C class probabilities. The bounding box prediction has 5 components: (x, y, w, h, confidence). The (x, y) coordinates represent the center of the box, relative to the grid cell location (if the center of the box does not fall inside the grid cell, then this cell is not responsible for it). These coordinates are normalized to fall between 0 and 1. The (w, h) box dimensions are also normalized to [0, 1], relative to the image size. Finally, the confidence reflects the presence or absence of an object of any class. If no object exists in that cell, the confidence score is zero. Otherwise the confidence score equals to the intersection over union (IOU) between the predicted box and the ground truth. The final output of the objects-detection module is thereby a S x S x (B * 5 +C) tensor.

All of the detected objects are forwarded into a Drowning-Alerts phase for a further analysis. This phase is split into 2 versions, termed “DL Core” and “Classic ML”. The former hasn’t been implemented yet, and stands for a forward pass through an RNN network, designed and trained for this specific task. The later enfolds classic machine-learning strategies, and involves an explicit features-engineering path.

The “Classic ML” module contains 4 major steps

Detections Filtering
Detections Matching
Persons Update
Results Packaging

The Detection Filtering module goes over each the objects-detection results, and only keeps the most prominent ones. It takes into account user parameters for filtering according to a minimal confidence threshold (conf_thr) and to a maximal size threshold (size_thr). It also allows user filtering of all non-person objects (by setting detect_all to False). The results are converted into a frame-related coordinates, and pass through a Non-Maximum Suppression (NMS) procedure[5], to eliminate redundant overlapping boxes with lower confidence. The user may tweak this procedure by setting minimum thresholds for filtering boxes by score (conf_thr) and by overlapping amount (nms_thr). NMS suppresses overlapping bounding boxes and only retains the bounding box that has the maximum probability of object detection associated with it.

The Detections Matching module iterates over the filtered detections, seeks for matching persons, and update that person accordingly. Foreach bounding-box (‘detection’), it goes over all of the already registered persons, and tries to find a person-match, i.e. the person with the highest IOU score[6], aka Jaccard Index[7]. The user may tune this procedure by setting a minimal IOU score (iou_thr), where scores below that threshold will be filtered out. If there is no match, the ‘winner’ is then being registered as a new person. Otherwise, the corresponding person is getting updated with the detection findings. It shall be noted that each person carries a footprint tail of his recent tracking marks, tweaked by user parameter max_history. Seeking for a person match involves going through each of those history marks, means a match may be obtained with a recent previous detection of a given person. This feature allows covering sporadic ‘holes’ in the objects-detection phase, and thereby allows smoother temporal footprints of the registered persons.

Each person that gets registered into the system is being monitored the represented with following attributes:

*Classic ML — Registered Person encoding*

Each person maintains a life-cycle state, evolves from a ‘Newbie’ until either getting alerted or discarded, which affects the overall safety score. In addition, each person maintains an ‘age’ counter which is initialised to zero upon registration and increments up to a parametrised saturation value (age_max), while person is in either a ‘Newbie’ or an ‘Active’ state. The person’s age may get decremented upon any miss-detection that occurs while in ‘Newbie’ state. This feature allows filtering out noisy inputs from the object-detection module, in a relatively early stage, and continue monitoring ‘real’ persons, which reach a sufficient level of maturity.

The following figure illustrates the person life-cycle FSM:

The state counters implement a hysteresis, with the context of toggling between ‘Active’ and ‘Frozen’ states, e.g. a person may unfreeze faster than getting frozen. In addition, the FSM allows a ‘memory’ behaviours, as it aggregates significant persons events along his evolvement.

All ‘Frozen’ persons are automatically being cleared every parametrised period (epochs), allowing a clean and fresh tracking and preventing noisy ‘scars’ in the registration table.

Following figure demonstrates a transition from Active state up to S.O.S state:

Person[2] escalates from *Active up to S.O.S state*

Each person is characterised with designated features, which affect the overall safety-score. The features are calculated upon every person-update event (mean features are calculated in an online fashion):

Age - Person’s age
Mean Confidence - Mean confidence level, as reflected by the object-detector
Mean Size - Mean size (in pixels)
Mean #Neighbours - Mean amount of adjacent neighbours
Mean Drift Mean - drift from registration location (in pixels)
‘RIP’ State - Person is in ‘RIP’ state
Zero Neighbours - Person has no neighbours around

The adjacent neighbours are counted within an adaptive radius (‘R’), according to the following equation, governed by user-parameter (nbrs_rad) and the ratio between persons y-coordinate and frame’s height:

Assuming that the frame was captured from a relatively frontal view, deeper water are characterised with top pixels (low y coordinate). Deep water persons are thereby being inspected with a smaller neighbouring radius, adaptively scaled with the person size.

The safety_score is determined by safety_score_thr and safety_score_wht parameters, and calculated according to the following equation:

The danger_score is typically simply set equally to the dynamic score above. However, if a person is outbound, i.e. without the ROI boundaries or overlaps with its margin gaps, then danger_score is set to zero. Otherwise, if a person overlaps with a pre-defined danger-zone, then danger_score is calculated as a parametrised weighted average between the dynamic score and the a-priori danger-zone risk-factor:

The safety_score is then extracted as the complementary number of the danger_score:

Finally, the “Classic ML” core outputs are packed into a standard “AI Inceptions” format, which enfolds the registered persons information, and specifically highlights the overall number of active persons and overall number of alerts. This information may further be used by the hosting software, and/or for an offline algorithmic analysis.

The various parameters and hyper-parameters shall be tuned per each system setup, for achieving a best separation between true and false examples, corresponds with persons in a risk and with no-risk, respectively. That holds for any applied Drowning-Alerts version (DL/ML).

Following Are several preliminary results, which seems promising: