Flood Inventory : Creation of a multi-source national geospatial database to facilitate comprehensive flood research

Floods are one of the most devastating natural hazards across the world, with India being one of the worst affected countries in terms of fatalities and economic damage. In-depth research is required in order to understand the complex hydrometeorological and geomorphic factors at play and design solutions to minimize the impact of floods. But the existence of a historical inventory of floods is imperative to promote such research endeavors. Though, a few global inventories exist, they lack the spatio-temporal fidelity necessary to make them useful for computational research due to reasons such as concentrating exclusively on large floods, limited temporal scope, non-standard data formats etc. Therefore, there is an urgent need for developing a new database that combines data from global and hitherto-underutilized local datasets using an extensible and common schema. This paper describes the ongoing effort of building the India Flood Inventory (IFI), which is the first freely-available, analysis-ready geospatial dataset over the region with detailed qualitative and quantitative information regarding floods, including spatial extents. The paper outlines the methodology that has been adopted as well as some preliminary findings using the data contained in this inventory. This dataset is expected to advance the understanding of flood processes in the worst affected region of the world.


Introduction
Floods continue to be one of the most devastating natural disasters across the world, accounting for one-third of all global geophysical hazards (Smith and Ward, 1998). In India alone, between 2010-2016, more than 10,000 people lost their lives and total damages of around 16,500 crores were caused by floods, according to the Central Water Commission (CWC, 2018). According to an Asian Development Bank report, floods have caused $50 Billion of economic damage since 1990 (Patankar, 2019). Observatory (DFO). The Global Flood Inventory (GFI) was one of the earliest efforts to synthesize information from multiple sources and databases to create a continuous flooding record (Adhikari et al. 2010). However, there are several limitations to GFI such as limited time span from 1998-2008 as well as point locational information on floods with acknowledged uncertainty. The global databases were also found to be of limited fidelity when it comes to describing spatial extents of flooding impact as well as temporal coverage. The bigger motivation behind the compilation of the India Flood Inventory (IFI) is the availability of large amounts of valuable information currently stuck in printed documents published by various government departments in India which have never found usage in furthering research due to not being available as an easily accessible database. This data is ground-validated and can be ascribed higher trustworthiness in terms of ascertaining damages, fatalities, as well as spatial extents. This is a non-peer reviewed preprint that has been submitted to Natural Hazards. 4 The IFI has been designed ground-up with careful consideration put into keeping it open, standardized, and, extensible, with data recorded in a way that could be useful for quantitative disaster modeling and analysis. The paper describes in detail the spatial and temporal coverage of the India Flood Inventory, the augmentations made to existing datasets, incorporation of new sources of information, and a summary of preliminary insights gained from this new dataset.

Existing flood databases and their biases
Several multi-hazard databases catalogue flooding events with varying scope and intended function. The Emergency Disasters Database (EM-DAT, http://www.emdat.be/) is a well-known international database administered by the Center for Research on the Epidemiology of Disasters (CRED) that collates natural and man-made disasters from 1900 to present. The criteria for an event to be included is when 10+ people are killed, 100+ people are affected, a state of emergency was declared, or a call for international assistance. This is the longest readily available database available of disasters internationally. However, since the inclusion criteria is impact-based, the data may be biased towards population centers like urban areas.
The Dartmouth Flood Observatory (DFO, http://floodobservatory.colorado.edu/) is a more comprehensive database exclusively focused on floods from 1985 to present. It's a simple excel sheet titled Global Archive of Large Flood Events, where the data is sourced from news, government sources, and satellite imagery. Though the data is richer than EM-DAT due to the availability of flood start and end dates, country, details of affected locations, flooded river, number of fatalities and damages, and spatial extent of flooding. The database also provides both static images and analysis-ready imagery showing the flood-affected regions. Though it has fairly good global coverage and higher data fidelity than EM-DAT, the database is not comprehensive compared to other databases. The georeferenced record of flood event locations is also only since 2006, limiting its viability in verification of long-term hydrologic simulations, which is our primary objective behind creation of IFI.
A few other databases have been mentioned in the literature such as the ReliefWeb long-form information about real-time events as they unfold and don't provide a historical database. While the IFNET doesn't provide enough useful information over a long enough period to be useful as a historical dataset. As such, both these global databases were ignored in the creation of IFI.
The mainstay of IFI is the hitherto under-explored "Disastrous Weather Events" (DWE) database compiled by the Indian Meteorological Department (IMD). This is a printed publication that has been published by IMD since 1979 till date and is extremely hard to access due to not being available online readily. The publication covers a wide gamut of natural hazards such as snowfall, cold wave, heat wave, squall, gale, dust storm, lightning, thunderstorm, hailstorm, floods and heavy rains, and cyclonic storm. The database has been used very few times in scientific research. For example, De et al. (2005) has used a small subset of this archive along with other databases to provide broad highlights of extreme weather events in India over 100 years . In another study, a more focused study on floods was performed with data from 1978-2006 highlighting the flood events, fatalities, and damages (2013). But the data remains underutilized as no publicly available, geospatial-analysis ready database is available publicly and the effort involves tremendous amounts of manual work as well as careful verification, which the present study has sought to embark upon, the details of which are explained in the next section. While designing the IFI, we have been motivated by our desire to create a schema and database that is suitable for use in big data modeling studies in the future.

Sources of Information
The IFI currently incorporates information from the following sources, which then undergoes multiple levels of augmentation: a. An annual printed publication named "Disastrous Weather Events" (DWE) by the

Description
The flood inventory has been structured into 2 parts: textual attributes and a spatial database.
In order to capture the qualitative and quantitative aspects of floods, we have defined several terms for the database: a. Unique Event Identifier (UEI) Each flood event is assigned a unique identifier in an extensible format such as UEI-IMD-FL-2015-0001, where IMD is the source dataset name, FL is for flood, 2015 is for year, and 0001 is for the serial event number of that year. This schema is flexible enough for us to incorporate different disaster database within a common framework. It will also facilitate incorporation of other geospatial disasters in the future and maintain interoperability, which may facilitate research into compound disasters such as floods and landslides.

b. Start date
This is the start date of the flooding event. The IMD DWE contains more granular information about the start and end of the event while databases like DFO and EM-DAT often only indicate This is a non-peer reviewed preprint that has been submitted to Natural Hazards.
7 the months. In order to maintain interoperability between various formats, all dates conform to ISO 8601 (YYYY-MM-DD), which is the international standard for the representation of dates and times. Often, the times provided are generic such as 3 rd week of the month, which were transformed to exact calendar dates.

c. End date
This is the end data of the flooding event which also conforms to ISO 8601 standards.

d. Duration
The number of days that have elapsed between the estimated start and end date of the event.

e. Main Cause
The primary cause of the flooding event, as recorded in the databases.

f. Location
This is only available for information incorporated from IMD. It indicates the names of districts, states, and regions.

g. Districts
This information had to undergo lots of standardization and quality control as many district names are wrongly entered in the original databases.
h. State Substantial amount of data did not come with state information and only with region or district information. These had to entered manually after consulting national geospatial databases. A few states have undergone changes in their official names, which have also been corrected.

Methodology
A systematic methodology was adopted to build the India Flood Inventory with the goal of conforming to modern interoperable standards and promoting computational hydrology research and applications. Different challenges were encountered with different datasets.
EM-DAT and DFO were the two global datasets that were incorporated. DFO was a simple excel sheet and the attribute names were standardized for our dataset and provided the Unique Event Identifiers (UEI). For EM-DAT, the same operation was performed after accessing the global database.
However, majority of the work required the digitizing and processing of the IMD Disastrous Weather Events that are available only as paper publications. The IMD DWE dataset is the most detailed official dataset of flooding in India, but records are available in a format not readily amenable for computational work and a geospatial database (See, Figure 1).
The dates were conformed to ISO 8601 standards and the human and animal casualties/injury numbers were extracted into separate columns. The most valuable part of this dataset was the information regarding districts that were affected. In order to generate GIS-friendly spatial extents of flood-affected areas, these district names were reverse-matched with a national district shapefile database (http://projects.datameet.org/maps/districts/) and a consolidated shapefile was generated for each event. Based on this event-based shapefile. the centroid of latitude and longitude was extracted and recorded. Each event was assigned a unique identifier like the global databases.

Uncertainty and limitations of the database
Compiling a hazard database of this nature is crucial for developing future hydrologic studies but requires painstaking work that is both scientifically and logistically challenging. The data itself is inconsistent as different agencies record it in different ways, but without the data in a common usable format, it remains a source of information rather than promoting further research. The obvious bias in the global databases such as EM-DAT and DFO is concentrating on only events with large impacts and covered by international media, while smaller events and more granular information is better recorded in local databases such as IMD DWE. There are administrative factors that may also impact the information in these databases, for example over-reporting when flood assistance from federal government is tied to damage reported by local disaster management offices. Under-reporting of events may happen for locations that have experienced fewer damages or casualties or located in more geographically distant locations instead of the bigger cities. Reporting bias is especially true in developing countries such as India where data collection is constrained due to budgetary reasons. This bias can be reasonably expected to have reduced over the years and hence an obvious increase in the number of flooding events may simply be due to better observational capabilities.
The other main source of uncertainty is the locational information. For example, the IMD DWE dataset is often inconsistent in what it is recording as the location, using districts, states, and regions interchangeably. The geographic centroid has been painstakingly recorded by building shapefiles for every event but is likely being biased due to insufficient granularity in the original database. But since no dataset is currently available for India, such information is expected to provide a certain bound in terms of understanding these natural hazards.

Preliminary analysis of hazards, fatalities, and damages 4.1 National and regional patterns
After the digitization, standardization, and augmentation, the India Flood Inventory was analyzed for spatio-temporal patterns to understand the frequency and severity of the events, the human and animal fatalities caused, and the causative factors. The IMD DWE dataset yielded the largest number of events (4176) with the highest spatio-temporal data fidelity.
Collected manually from government records, it can also be regarded as the best available lower-bound of ground reality. EM-DAT contains 276 events but since the criteria for inclusion in the dataset is 10 or more fatalities and 100 or more injuries, it is inherently biased towards larger flood events. Additionally, DFO contained 262 events, but for a much shorter period. The summary of these databases is provided in Table 1 with the global databases, EM-DAT and DFO, contributing 6% and 5% to the IFI respectively, while the national database of IMD DWE is contributing 89%, which substantially increases the sample size and consequentially the robustness of studies based on this dataset.  Figure 2 shows the evolution of the number of floods events in India for different time periods since 1926, as available in the 3 data sources. The increasing trend is clearly visible in all three, though some of it may be attributed to better data collection over the years as well. Overall, there is a definite increasing trend in flood fatalities across India. been observed in Europe and attributed to factors that have appeared or gained influence over the years, with higher sensitivity to smaller disasters, and a consequent increase in the reporting of such disasters (Hoyois and Guha Sapir, 2011). In a study of disasters globally, Jonkman (2005) found that Asian rivers are the most significant in terms of number of people killed and affected, with flash floods resulting in highest average mortality per event. There is a threefold rise in widespread extreme rain events over Central India (Roxy et al., 2017), increasing spatial variability in observed Indian rainfall extremes (Ghosh et al., 2012), and the increasing frequency of heavy rainfall events in peninsular, east and north east India that is correlated with flood risk (Guhathakurta et al., 2011).
entities. For the sake of simplicity, all of them have been referred to as states here. The number of flooding events for each state and database type has been shown in Figure 4. Since, DFO only records the latitude and longitude, state-wise statistics were not reported. The top 5 states have been reported in Table 2Error    substantial number of cloudbursts have been recorded in IMD DWE, which is a cause of major concern due to short but devastating nature of its impact. It is to be noted that the causative factors in IMD DWE are not encoded systematically and often don't record any causative factors. Hence, they need to be approached with caution. Monsoons from June to September record the majority of the flood events at 79% of the total.
It also accounts for 83% of the total fatalities year-round.

Digitization and possible Applications
Another uniqueness of this dataset is the availability of flooding extents in modern formats such as Shapefile (.shp), GeoPackage (.gpkg), and KML (.kml file). These extents have been calculated for each event by matching the district/state level information available in these datasets. Since these extents come with temporal information, remote sensing data such as Landsat/Sentinel/MODIS etc. could be used to develop inundation imagery for specific flooding events. This would be very helpful in validating hydrologic modeling simulations in various locations.

Conclusions and Future Work
The India Flood Inventory (IFI) is India's most comprehensive database of flooding events that is a) multi-source, b) standardized to international data specifications, and c) freely available in modern geospatial formats. Currently, IFI includes 49 years  of flood data digitized from the IMD Disastrous Weather Events. It also includes 34 years  of data from the Dartmouth Flood Observatory (DFO) and 93 years  of data from the International Disaster Database (EM-DAT). Best possible effort has been made to augment and standardize them to a common schema, which makes IFI an analysis-ready dataset for a wide-variety of applications related to flood hazard, risk, and exposure.
The majority of floods in the country happens in the monsoon season, which is 79% of the yearly total, with a peak in July. On the other hand, the number of flood fatalities during the same period is 83% of the yearly total, with a peak in August. The seasonality of flooding is well indicated in the country, which can guide flood management and disaster reduction efforts in the country. The large flood plains of the country such as Uttar Pradesh, Assam, Maharashtra, Bihar, and West Bengal experiences the highest number of floods and fatalities.
While, the hill states such as Uttarakhand have experienced catastrophic events, with some of the highest per capita death rates in the country.
This study has only begun a preliminary investigation into the spatio-temporal variations of flooding in India. Further investigation into the causative factors will be necessary to determine the structural and non-structural flood mitigation measures that may be necessary.
This dataset is expected to contribute towards encouraging such diagnostic and prognostic efforts. One of the goals is this study was to propose a standard specification for recording natural disaster information which will aid future data collection efforts. The extensible framework proposed for India Flood Inventory can be used to integrate data from large number of disparate databases for any number of natural hazards. An on-going upgradation to the inventory is to use a cloud-based platform to derive the spatial inundation extents for the events using satellite imagery. This compilation is designed to be a massive ongoing effort going forward as we digitize and incorporate sources of information from other federal and state disaster management agencies, most of whom maintain independent datasets and are expected to be of even higher fidelity.