A Methodology for Automation of Training Sampling within Supervised Classification based on Isolation Forest Algorithm
Silva, Joel¹; Bacao, Fernando¹; Foody, Giles²; Caetano, Mario¹
¹ISEGI, Universidade Nova de Lisboa, PORTUGAL; ²School of Geography, University of Nottingham, UNITED KINGDOM

Land Use Land Cover (LULC) mapping is one of the most important applications of Earth Observation (EO) data. The most used approach for LULC mapping is based on supervised classification algorithms. Traditionally the training stage of supervised classifiers is based on manual selection of sampling units for each class through the use of ancillary data, e.g., out-dated LULC maps, aerial ortophotos, ground truth data, LULC maps with lower scales or larger minimum mapping units (MMU). The generation of an adequate training sample database is time consuming and manpower demanding. The final accuracy of the LULC maps is a direct function of the training sample dimension and representativeness, i.e. a good training sample should be large enough to cover the spectral variability of each land cover class.

The increasing democratisation of Earth Observation and the upcoming Copernicus/GMES satellites will require a shift of paradigm within image processing. If one wants to take advantage of these opportunities and use large amounts of EO data in LULC mapping, the dimensionality of the classification process will increase significantly. As a consequence, and to avoid the mathematical problem known as Hughes effect, the size of the training sample also has to increase significantly. This means that there will be a need to develop methodologies for the automation of the training stage, which will allow the increase of dimensionality and representativeness of the training sample, without requiring an increase of human and budgetary resources. This paper presents a methodology to automatically select the sampling units to be used in supervised classifications of EO images. The method is based on an existent LULC map of the area and it entails two steps: (1) generation of a stratified random sample, and (2) identification and removal of anomalies. The existent LULC map can always be a global land cover map such as the GLOBCOVER, which means that the methodology can be applied to every single place of the globe. A set of sample units is automatically generated using a stratified random sampling over the existent LULC map to define the strata. This step will give the necessary dimension to the training sample but, at the same time, it will introduce anomalies due to scale and temporal issues. These anomalies will be identified in the second step of the methodology by using an ensemble method called Isolation Forest (iForests). The iForest is based on the random forest concept and it is capable of identifying anomalies without the need to make any assumptions about the data, particularly the statistical distribution of each land cover class. An iForest is a set of Isolation Trees (iTrees). iTrees are built by recursively breaking down the sampling space until all instances are isolated. An isolation score based on the average path-length obtained in each iTree is assigned to each sample unit. The sample units with isolation scores higher than a threshold defined by iForests theory are regarded as anomalies and removed from the training.

The proposed methodology is applied in a case study to derive a LULC map with MERIS data (September 2005) for an area in Continental Portugal. The CORINE Land Cover 2000 map is used as the base map and the classifier is the maximum likelihood classifier. The accuracy indices of the LULC map generated with the training sample derived by using the proposed methodology are compared with the accuracy indices of the LULC map generated with a training sample produced by visual inspection of ancillary data (i.e. the traditional way). Results show that the proposed method can be used for an automatic production of a training sample that can be effectively used for deriving LULC maps with a satisfactory accuracy.