Building Extraction with YOLT2 and SpaceNet Data

阿新 • • 發佈：2019-01-12

Building Extraction with YOLT2 and SpaceNet Data

The first SpaceNet Challenge to identify building outlines from satellite imagery demonstrated that the field of computer vision as applied to satellite imagery remains relatively nascent. In many computer vision tasks (e.g. ImageNet), accuracies >95% are common, even expected. The winning SpaceNet Challenge score of

F1=0.26 underscores the challenges of extracting building footprints from satellite imagery in diverse and often very crowded scenes.

The majority of submissions to the SpaceNet Challenge utilized an image segmentation approach, where each pixel in an image is labeled as belonging to one of several classes (in this case: building or background). Since the goal of the SpaceNet challenge was to provide exact building outlines, this approach makes sense given that if one classified all pixels correctly all buildings would be precisely defined.

In this post we detail a different approach: object detection with the YOLT2 pipeline. Recall that YOLT outputs bounding rectangles around objects of interest. As such, this approach will never achieve perfect building footprint detection. Nevertheless, we demonstrate that this approach proves competitive for the challenge evaluation metric of assigning a true positive to any proposal with a

Jaccard index ≥ 0.5 compared to ground truth.

1. YOLT2

Recall that YOLO (upon which YOLT is based) is an object detection framework that uses a 7x7 final grid, meaning that each object is placed on one of 49 boxes. YOLO version 2 incorporates a number of improvements to the original paper such as: batch normalization, finer grained features, multi-scale training, and a denser 13x13 final grid. These enhancements improve the accuracy to state-of-the-art (see Table 3 in YOLO version 2), while maintaining a speed advantage over other options such as Faster R-CNN and SSD. Many of these improvements were independently implemented in the version of YOLT discussed in previous blogs, and the remaining improvements have been incorporated into YOLT version 2.

2. Training Data

We utilize data from the first SpaceNet challenge, obtainable from AWS. In the previous post we discussed methods for transforming the GeoJSON label files into formats more conducive for machine learning. Recall that YOLT2 requires cardinally oriented rectangles to label ground truth. In this post we utilize the NumPy arrays of building pixel coordinates to infer bounding boxes around buildings.

In most computer vision object detection paradigms, bounding boxes fully encompass objects of interest. Our goal is to achieve a Jaccard index ≥ 0.5, so the ground truth bounding boxes used for YOLT2 do not fully enclose the buildings, as illustrated in Figure 1.

Figure 1. Proposed bounding boxes for YOLT2 training. Left: Ground truth building outline shown in red. Middle: Bounding box (white) that fully encompasses the red building; given the non-cardinal orientation of this building the Jaccard index is below the threshold of 0.5. Right: bounding box extending only 80% of the length and width of the building, which gives a Jaccard index of 0.51, greater than the threshold for a true positive detection. For labeling purposes we therefore use the partial box depicted on the right.

To construct training data we utilize the geojson_to_pixel_arr.py script and, as in Figure 1, construct a bounding box extending 80% of the length and width of each building in the training dataset. Examples are shown in Figure 2.

Figure 2. Examples of YOLT2 SpaceNet training labels. The left column depicts ground truth labels overlaid in yellow on the image cutouts, whereas the right column shows YOLT2 bounding box labels in red. In dense areas the bounding boxes often overlap, complicating efforts to disentangle nearby buildings.

3. Model Training

We train on 90% of the SpaceNet training dataset, discarding images without any buildings present; the remaining 10% is withheld for internal testing purposes. This leaves 3926 labeled 200 x 200 meter images for training purposes. Image cutouts for the pan-sharpened 3-band imagery are 438–439 pixels in width, and 406–407 pixels in height. We craft a new network architecture with a denser 26 x 26 final grid to accurately localize buildings in the the highly concentrated regions of central Rio de Janeiro. Training occurs for for seven days on a single NVIDIA Titan X GPU.

4. Model Evaluation

The YOLT2 SpaceNet model is evaluated on the entirety of the SpaceNet test dataset from the SpaceNet Challenge. For the 200 x 200 meter test chips the YOLT2 pipeline inference proceeds at a rate of 45 frames per second. Post-processing is minimal, simply consisting of non-max suppression. We achieve an F1 score of 0.21 over the test set; this score is certainly far from ideal, though it would have been moderately competitive in the challenge results (reported scores are F1 * 1,000,000). Example outputs are shown below.

Building Extraction with YOLT2 and SpaceNet Data