Decadal Evaluation of the AIRPACT Regional Air Quality Forecast System in the Pacific Northwest from 2009-2018

The Air Indicator Report for Public Awareness and Community Tracking (AIRPACT) is a comprehensive, automated air quality forecast system that provides 48-hr in-advance air quality over the Pacific Northwest region (http://lar.wsu.edu/airpact/). Since 2001, the AIRPACT forecasting system has been successfully operated by Washington State 15 University, with the financial support from the Northwest International Air Quality and Environmental Science consortium (NW-AIRQUEST). AIRPACT consists of the Sparse Matrix Operator Kernel Emissions (SMOKE) model to provide temporal and spatial emissions, the Community Multiscale Air Quality (CMAQ) model to simulate hourly ozone, particulate matter and related precursor concentrations over the Pacific Northwest region, and the 20 Weather Research and Forecasting (WRF) model to simulate meteorology fields which are inputs for CMAQ: WRF is run by University of Washington and their outputs are transferred to Washington State University. AIRPACT is one of the longest operational regional air quality forecast system in the US that is based on a chemical transport modeling. In this paper, we have evaluated AIRPACT forecasts for the last ten years (2009-2018) against 25 quality-controlled EPA Air Quality System observations, with particular focus on examining how overall air quality forecast skill has changed as the AIRPACT system has evolved. During this period, AIRPACT has been intermittently updated with improved physical and chemical processes as well as newer emissions and higher resolution model domains. Our evaluation results show that AIRPACT’s skill at forecasting ozone (O3) has improved over time. 30 However, the fine particulate matter (PM2.5), forecast performance has decreased over time. This is a non peer-reviewed preprint submitted to EarthArxiv. 2 The PM2.5 forecasts in the most recent version of AIRPACT were underpredicted to a larger degree than the previous version, partly because elevated PM2.5 concentrations during the wildfire season in the years 2015 and 2018 were underestimated. In order to improve overall air quality forecast accuracy, our future efforts should focus on building a more reliable forecast system to handle extreme air quality events in combination with using new 5 techniques for data-assimilation, ensemble forecasting, and statistical post-processing. This is a non peer-reviewed preprint submitted to EarthArxiv.


Introduction
Ambient air pollution is responsible for almost 3 million deaths each year globally, making it a major concern for public health (WHO, 2016). In recent years, air pollution has received public attention, as many cities, particularly in developing countries, experienced dangerous levels of air pollution that caused a serious health burden. In the US, air pollution has greatly 5 improved over time due to the Clean Air Act (CAA) implemented by the U.S. Environmental Protection Agency (U.S. EPA) (U.S. EPA, 2015a). Regulatory policy has been controlling outdoor air pollution effectively; however, it is not realistic to eliminate air pollution entirely, and pollutants are transported in the ambient air for days or weeks (depending on the species), which makes air pollution a global problem. Some pollutants can be harmful even 10 at ambient concentrations, especially for sensitive groups such as children (Neidell, 2004). Therefore, to protect public health from outdoor air pollution more effectively, a proactive action, such as advising a sensitive population group about upcoming air quality information, might be necessary.
Ozone and particulate matter with a diameter less than 2.5 µm (PM2.5), are criteria 15 pollutants that are regulated under the CAA as their adverse health impact on the public. Tracking) air quality forecast system successfully for Pacific Northwest (PNW) (Mass et al., 2003;Vaughan et al., 2004;Mahmud, 2005;Chen, et al., 2008). Currently, the modeling AIRPACT predicts several air pollutants including surface ozone and PM2.5 to 1) assist state and local air quality managements to make short-term and long-term plans to 10 improve air quality in their jurisdictions and 2) forewarn the public, especially during the extreme air pollution events such as wildfires, so that they can make informed decisions on their activities. Along with the surface air pollution levels, AIRPACT also reports hourly air pollutant emissions, chemical boundary conditions used in AIRPACT, and observations in handy visualizations to enable the public to understand the information easily. All our 15 AIRPACT products are freely available via our website (http://lar.wsu.edu/airpact/).
The PNW region experiences various air quality events including stratospheric ozone intrusions, primarily in spring, prescribed agricultural burning in spring and fall, wildfires in summer and fall, and residential wood burning in winter. Wildfire smoke causes notoriously poor AQ during summers in the region. The PNW region is also influenced by long-range 20 transport of air pollutants from Asia (Jaffe et al., 1999). Extreme air pollution events such as stratospheric ozone intrusion and wildfires makes air quality forecasting challenging. As AIRPACT is based a 3-D gridded air quality model, our forecasts are also subject to uncertainties in input datasets (e.g., emissions, chemical boundary conditions and meteorology) and from parameterizations of sub-grid scale and complex physical and 25 chemical processes. Thus, our group has constantly evaluated our forecasts against observations and have modified AIRPACT system to provide more accurate forecasts to the public. As a routine process, AIRPACT is evaluated daily against AIRNOW (pre-qualitycontrol) observations, and performance statistics are published online. AIRPACT underwent more thorough evaluations against the surface observation networks and satellite products; 30 for example, Chen et al. (2008)  In this paper, we have evaluated AIRPACT ozone and PM2.5 forecasts from the last 10 10 years (2009 to 2018) against the EPA's AQS observations, with a primary focus on how the AIRPACT forecast skill has progressed as our modeling system went through the major updates. Our archived data is limited to hourly ozone and PM2.5 and basic meteorology data at EPA AQS sites, because we were not able to save all AIRPACT forecast products due to data storage costs; each day of forecast data takes many gigabytes of space. The meteorology 15 evaluation is provided in the supplementary materials as the evaluation is limited because of large gaps in meteorological output during 2009-2012. During the 2009-2018 period, AIRPACT underwent two major updates: from version 3 (hereafter, AP-3) to version 4 (hereafter, AP-4) and to version 5 (hereafter, AP-5). Table 1 provides a summary of each AIRPACT version. Note that we provide only model version 20 number for WRF, CMAQ, and SMOKE. Please refer the details of updates made to each model in the relevant developer group, using the version number. Given that each major AIRPACT update involved many minor updates and we have not maintained access to older AIRPACT models, our analysis is mainly focused on describing the changes in forecast skills among the AIRPACT versions and, when possible, we provide a potential cause for such changes. 25

The AIRPACT Air Quality Forecast System
Our AIRPACT system simulates hourly O3 and PM2.5 and related precursor levels over the PNW region, and consists of WRF, SMOKE, and CMAQ. The WRF forecasts used in AIRPACT are generated daily by the University of Washington (UW); the specific details of the WRF model setup are available at this website (https://atmos.washington.edu/wrfrt/info.html).
With the completion of the WRF forecast at the UW, the MCIP (Meteorology-Chemistry Interface Processor) preprocessor is run to extract WRF output fields for transfer to WSU, where the MCIP meteorology files are used as input both for SMOKE emissions processing 5 and for CMAQ chemical transport model, which results in the AQ forecast. As supplementary information, we have provided the list of publications related to AIRPACT from Laboratory for Atmospheric Research, Washington State University.
The current version, uses SMOKE v3.5.1,CMAQ v5.0.2,and WRF v3.7.1 over a domain that includes the entirety of Washington, Oregon, and Idaho and the adjoining parts 10 of Canada, western Montana, and small northern sections of California, Nevada, and Utah.
The AP-5 modeling system is depicted in S- Fig. 1. The model horizontal grid spans 285 columns west to east, spans 258 rows south to north, with grid cells of 4 km x 4 km and with 37 vertical layers, the lowest of which is ~40 meters deep. The AP-4 system used CMAQ v4.7.1, SMOKE v2.7 and v3.5, and WRF v3.4.1 and v3.5. AP-4 used the same domain and 15 horizontal grid as AP-5, but with only 21 vertical layers. The AP-3 system used CMAQ v4.6, SMOKE v2.1, and WRF v3.1.1 with 12 km x 12 km grid cells and 21 vertical layers. AP-3 had a slightly larger domain that extended further north (see S- Fig. 2). The number of vertical layers in AP-5 was changed from 21 layers to 37 layers in order to better resolve the tropopause and to better capture stratospheric ozone intrusion events. Note that the WRF 20 meteorology from UW provided 37 vertical layers for AP-3 and AP-4, but layer collapsing in MCIP had condensed those to 21 layers for AIRPACT to control CMAQ computing time.
Throughout the last three versions of AIRPACT, the CMAQ model has been updated with an available newer version (see Table 1). One of the significant updates occurred in AP-5, which uses 1) CMAQ 5.0.2 that has an improvement in particulate matter (PM) speciation 25 (i.e., separated the old term "PMother" into 12 more PM categories) and 2) the carbon bond gas-phase mechanism (CB05) instead of SAPRC99. The latter change was based on the two main findings from Luechen et al. (2008): CB05 is faster than SAPRC99, which helps to reduce CMAQ computing time, and it tends to predict lower ozone concentrations than SAPRC99 on average. AIRPACT with SAPRC99 tended to overpredict ozone values, so the 30 switch to CB05 was in part an attempt to reduce ozone overprediction.
The AIRPACT system includes a comprehensive set of emissions, including mobile, non-mobile, biogenic and fire sources, that account for spatial and temporal variation (see the details in Table 1). Different emissions types and inputs are combined and assigned to grid cells using SMOKE. To handle mobile emissions, AIRPACT used the MOBILE6 model in 5 AP-3 and in AP-4 but switched partway through AP-4 to MOVES. MOVES was developed in response to concerns from the National Research Council that the MOBILE6 model was insufficient and was designed to be more adaptive and easier to use (Koupal, Cumberworth, Michaels, Beardsley, & Brzezinski, 2003); the EPA no longer uses MOBILE6 and no longer accepts the use of that model for regulatory analysis. MOVES emissions are based on modes, 10 which is a method within MOVES to characterize local emissions on a finer scale than the MOBILE6 vehicle emissions which are based solely on a regional patterns. Thus these modes in MOVES allow a finer definition of emissions (Beardsley, Warila, Dolce, & Koupal, 2009).
MOVES is the currently maintained model and gets updated with newer emissions and activity data (U.S. EPA, 2016a). 15 Non-mobile emissions for AIRPACT are gathered from state emissions inventory reports and the National Emissions Inventory (NEI). The NEI is a product produced by the EPA that contains an estimate of criteria air pollutant emissions; a new NEI version is released every three years (U.S. EPA, 2015b). To maintain emissions up-to-date, the NEI inventory used in AIRPACT has been updated when a newer NEI was released. States release 20 their own emissions inventories which allow partial updates to emissions. AIRPACT currently uses the 2014 NEIv2 with some local modifications or updates provided by AIRQUEST member agencies.
For biogenic emissions, the Biogenic Emissions Inventory System (BEIS3) (Vukovich & Pierce, 2002) was used in AP-3 but it was replaced with the Model of Emissions of Gases 25 and Aerosols from Nature (MEGAN) (Guenther et al., 2006) starting with AP-4. The dataset in the BEIS3 model used a 1-km grid and was normalized by season (Chen et al., 2008). The emissions factors used in BEIS3 were based on a land use cover database for North America.
Although MEGAN is designed to be used as a global emissions model for terrestrial aerosols This is a non peer-reviewed preprint submitted to EarthArxiv. 8 and gases, we run MEGAN at 1-km resolution like BEIS3. According to Hogrefe et al., (2011), MEGAN can result in higher ozone concentrations because it predicts higher isoprene concentrations than BEIS3.
For wildfire emissions, AIRPACT has used the USDA Forest Service BlueSky systems.
The original process depended on the USDA Forest Service BlueSky forecasts, which 5 provided wildfire emissions and plume rise directly to AIRPACT. Several years later, the BlueSky Framework was installed and operated independently, allowing multiple options as to how emissions would be generated; the Framework offers options for each of the various steps that determine fuels, consumption, timing, and finally emissions from fires (Larkin, 2016), resulting in CO, PM2.5, coarse PM, and heat flux projections. AIRPACT now uses the 10 BlueSky Framework to acquire fire size and locations from SMARTFIRE, while emissions processing is streamlined through customized lookup tables and plume rise is calculated using the DEASCO3 method, which generates improved plume characterization.

BlueSky depends on SMARTFIRE (Satellite Mapping Automatic Reanalysis Tool for
Fire Incident Reconciliation) to characterize fires to be modeled. SMARTFIRE gathers fire 15 information from NOAA's Hazard Mapping System (HMS) and fire perimeters from GeoMAC (Geospatial Multi-Agency Coordination, https://www.geomac.gov) and merges them into a compatible format (Larkin, 2016). Originally, SMARTFIRE also used wildfire Incident Status Report (ICS209) fire area; however, the electronic accessibility of ICS209 reports for wildfires was transferred to an incompatible system (IRWIN). This has degraded the daily 20 accuracy of SMARTFIRE results, especially when HMS misses fires in cloudy conditions. As a forecast system, AIRPACT is constrained to using detected fires and thus must make assumptions. Detected fires are assumed to persist at their reported size for the twoday forecast, an assumption we refer to as 'persistence'; our inability to reflect fire suppression, fire growth, or extinguishment by weather or fuel shortage, is a limitation. 25

Results
We have evaluated the AIRPACT surface ozone levels and surface PM2.5 concentrations at AQS sites during 2009-2018. We have also evaluated temperature, specific humidity, wind speed and direction from WRF simulations from 2009-2018, but we present those results in S- Table 1 in the supplementary materials because of the large data gap in the archived meteorology data. We used daily maximum 8-hour average (DM8A) ozone levels and daily 24 hr average PM2.5 concentrations in all the evaluations for this paper. For ozone and PM2.5 evaluation, we have categorized the AQS site by the location type (i.e., rural, suburban, or 5 urban) in order to better capture any systematic forecast issues. We again note that this AIRPACT performance evaluation is limited to the common AQS sites for each species or variable, which means that if observations for a site are missing during any AIRPACT version, then we excluded that site. This reduces the number of monitoring sites, but it allows us to make a fair comparison among different AIRPACT versions: total 26 AQS sites for O3 and total 10 89 sites for PM2.5 (see S- Table 2 for the details of AQS sites used in this paper).
To evaluate the AIRPACT systems forecasting accuracy progress over the time span, we used several statistical measures including mean bias (MB), mean error (ME), root mean square error (RMSE), normalized mean bias (NMB), normalized mean error (NME), fractional bias (FB), fractional error (FE), and correlation of determination (r 2 ). Although we 15 present all these measures, we will mainly discuss meteorology evaluation using MB, RMSE or ME and ozone and PM2.5 evaluations using FB and FE, because of the benchmarks values   , however AP-4 shows the highest correlation of determination (R 2 ) value, 0.68. From AP-3 to AP-4, most evaluation metrics values are quite similar, which indicates reducing grid spacing from 12 km to 4 km did not help to improve 5 O3 predictions. Updating to AP-5 shows noticeable improvements; all the bias terms are reduced by nearly a half and error terms are decreased by a few percent, even though the AP-5 version shows the lowest R 2 value, 0.54. Note that O3 evaluation at individual AQS site is presented in S- Table 3. Figure 1 shows ratios of forecast to measured DM8A O3 against the corresponding 10 measured O3 levels at the 26 AQS sites for each AIRPACT version. Note that a ratio of 1 means perfect agreement between the forecast and measurement. All AIRPACT versions performed well, mostly within a factor of two. AIRPACT shows better agreement in higher concentration regimes (over 30 ppbv) than in lower concentration regimes (below 30 ppbv). A systematic overprediction in low concentration regimes was also reported by Chen et al. (2008), which 15 evaluated the AIRPACT-3 during the two-month period (August and September) in 2004.

Ozone evaluation
This problem is also shown in other air quality models. For instance, the multi-model intercomparison study (i.e., Air Quality Model Evaluation International Initiative, AQMEII) on tropospheric ozone, presented by Im et al. (2015), found a similar systematic overprediction in surface level ozone below 30 ppbv from all participating air quality 20 modeling systems over North America including WRF with CMAQ as is used in this study (see Figure 9b in Im et al., 2015).
To understand how AIRPACT's ability to forecast ozone has changed by seasons, we compared the observed and measured DM8A O3 distributions by season for each AIRPACT version using a box plot (in Fig. 2). All AIRPACT versions overpredict O3 in all season, except 25 for the AP-5 summer season (see the details of seasonal evaluation statistics in S- Table 4).
The overprediction is worse during the low O3 season such as fall and winter (mean bias of 4.4-7.8 ppbv with fractional bias of 13-26%), compared to the high O3 seasons such as the spring and summer (mean bias of 2.1-5.5 ppbv with fractional bias of -4.5 to 12%). The systematic overprediction of O3 in the low concentration regime, shown in Fig. 1, should be mostly from the fall and winter seasons, as their O3 levels are frequently below 30 ppbv. The observed highest O3 season is not well captured for all versions: AP-4 and AP-5 show the spring as the highest while the observation shows the summer; for AP-3 peaks during the summer while the observation during the spring. Figure 3 shows the observed and simulated diurnal ozone profiles in summer for each 5 AIRPACT version. All AIRPACT versions capture the observed diurnal patterns remarkably well (R 2 = 0.94-0.96). The simulated max and min O3 levels occurred at 1-2 pm and 5-6 am, respectively, which are comparable to the observation within 1-2 hours difference. Our It is important to understand what contributes to such large overprediction in low O3 regime in air quality models, but most studies have been focused on high O3 regime because a high O3 level is more of concern for air quality control managements and public health.
Even though a further study is needed to find those contributing factors, we suspect that this overprediction might be caused by the followings: 1) missing nighttime NO titration of O3, 25 especially over urban areas where the observed O3 values often go zero but the models do not predict such low O3 values; 2) incorrect background level in the models; and 3) too weak boundary layer mixing at night.
To examine the forecast skills spatially for each AIRPACT version, the FB and FE values of AIRPACT DM8A O3 forecast performance are presented at individual AQS site in performance (highest FE) tends to occur at an urban site and the best performance (lowest FE) at a rural site, but we do not find any distinct difference in overall forecast performance 10 for those groups. Figures 6d and 6e show clearly that the model-to-observation agreement became worse from AP-3 to AP-4, despite the finer grid size applied in AP-4, and better from AP-4 to AP-5, likely because AP-5 adapted CB05 gas chemical mechanism that resulted in lower O3 level than the SAPRC mechanism and thus alleviated one O3 overprediction issue.

PM2.5 evaluation 15
The evaluation statistical summary of AIRPACT daily PM2.5 forecasts at 89 AQS sites for each AIRPACT version and entire 2009-2018 period are presented in Table 3. First of all, overall PM2.5 performance is roughly twice as poor compared to the overall O3 performance (e.g., the fractional error, FE, of O3 and PM2.5 are 16% and 31%, respectively). Over the major AIRPACT updates, PM2.5 performance appears to get worse, unfortunately. For example, the FB of PM2.5 20 has been increased from -4.5% to -32%, and for FE, from 26% to 38%. The coefficient of determination, R 2 , is above 0.5 for all versions except for AP-4. Even though AP-5 shows the worst performance of PM2.5 compared to previous versions, it still meets the criteria benchmark of FB, which is ± 60%. The AP-3 and AP-4 meet the goal benchmark of FB which is ± 30%. The PM2.5 evaluation at individual AQS site is presented in S- Table 5. 25 The ratios of forecast to measured daily PM2.5 concentrations against the corresponding measured concentrations for each version are shown in Fig. 8. The daily PM2.5 data points are rather equally distributed around 1 in AP-3 but started to move below 1 in the newer versions. AP-5 shows that many data points (shown as yellow colors in Fig. 8) are noticeably below 1, which reflects an underprediction issue in that version. AP-5 also has several extremely poor forecasts (i.e., ratio > 3 or ratio < 0.3), especially for the regime of PM2.5 above 10 µg m -3 .
Daily PM2.5 forecast skills by each season is presented in Fig. 9. Model PM2.5 is generally underestimated in all seasons, particularly worse during summer. Based on the 5 benchmark values, AP-3 performed well for all seasons (FB values < ±30%). AP-4 also meets the benchmark goal for all seasons except for summer that has the FB value of -57%. In the case of AP-5, fall and winter forecasts were good but spring and summer do not meet the benchmark goals (i.e., -41% and -70%, respectively): summer does not even meet the criteria benchmark value (± 60%). The poor PM2.5 forecasts during summer seasons in AP-5 might 10 be partly due to missing the large observed PM2.5 spikes during the summer of 2015 and 2018, which is shown in time series plots of monthly mean PM2.5 concentrations by AQS sites (grouped by rural, suburban, and urban) in S- Fig. 3. Air quality forecasts over PNW seem to be the most challenging during summer season (wildfire season in PNW) because wildfires can result in large PM2.5 spikes and poor air quality in a region with otherwise good 15 air quality: the area-burn time-series plot in S- Figure 4 shows the similar pattern as PM2.5, which indicates the influence of wildfires on the spikes.
To show how the PM2.5 forecast performed spatially for each AIRPACT major updates, we present the FB and FE values of daily PM2.5 evaluation in a spatial distribution ( Fig. 10) and in a scatter plot (Fig. 11). The spatial distribution of FE difference from AP-3 to AP-5 at 20 individual site is shown in Fig. 12. AIRPACT PM2.5 performance ranges from underestimates (down to the FB of -126%) to overestimates (up to the FB of 84%), with the FE range of 32-134%. As shown in Fig. 10, it tends to underpredict daily PM2.5 across the model domain, particularly at rural areas. Unlike O3, the daily PM2.5 performance shows a distinct difference in overall forecast performance by site location type: rural (total 25 sites), suburban (total 25 34 sites), and urban (total 30 sites). The sites showing overestimates are primarily urban and suburban areas (e.g., Seattle WA and Portland OR), although some urban and suburban sites are underpredicted. Compared to the rural sites, urban and suburban sites have larger FE values. As shown in Figs. 11 and 12, the AIRPACT's PM2.5 FE has increased with each major update; the differences of FE from AP-3 to AP-5 (in Fig. 12 because the MOZART4 from NCAR was discontinued; otherwise, it is likely that AP-5 may have a similar performance as previous versions, if not better. 5

Conclusion
Since May 2001, the Laboratory for Atmospheric Research (LAR) group at Washington State University has been operating the AIRPACT air quality forecast system that predicts immediate future air quality over the PNW region. Currently, we are running the AIRPACT version 5 (AP-5) which forecasts the next 48 hours of high-resolution air quality over the 10 PNW region. Our AIRPACT system comprises three main models: WRF meteorology model, SMOKE emission processing tool, and CMAQ chemical transport model. The CMAQ simulations in AIRPACT use a comprehensive set of emissions that is based on up-to-date emission inventories (i.e., EPA's NEI2014v2 and new state emission inventories) and emission models such as MOVES mobile emission model, MEGAN biogenic emission model, 15 and BlueSky fire emission model.
In this paper, we have evaluated the last 10 years of archived AIRPACT forecast data, from 2009 to 2018, against the EPA's AQS monitoring sites. Our evaluation is limited to the forecast products at the EPA's AQS sites. Over this time period, the AIRPACT system went through two major updates that resulted in system version change: from AP-3 (2007 to 20 2012) to AP-4 (2013AP-4 ( to 2015 and to AP-5 (2016 to present). The major updates made to the AIRPACT system include: a) incorporating newer model versions for CMAQ, WRF, and SMOKE; b) switching to a different chemical mechanism (e.g., from SAPRC99 to CB05) or to a different sub-model (from MOBILE6 to MOVES; from BEIS to MEGAN); c) using finer horizontal and vertical grids; and d) adapting newer input dataset such as emission 25 inventories and chemical boundary conditions (see the details in Table 1).
AIRPACT O3 forecasting has improved over time. Between AP-3 and AP-4 there are minimal forecast skill differences; however, the update to AP-5 showed notable improvements. AP-5 is improved from AP-3 and AP-4 according to all statistical metrics used, excluding R 2 . In AP-5, the MB and FB are nearly twice as good compared to either preceding version. The switch to CB05 gas chemical mechanism from SAPRC99 lowered the forecasted O3 levels and thus lessened AIRPACT's tendency to overpredict O3, which likely explains the better performance of AP-5. For all versions of AIRPACT, the FB and FE have met the goal benchmark values (i.e., FB of ±15 % and FE of ±30%) for O3. We find that O3 5 levels above 30 ppbv were forecasted with higher accuracy than levels below 30 ppbv and O3 forecast performance is generally better in the summer than in the winter, including their diurnal cycles. All versions of AIRPACT struggle at forecasting wintertime O3, with constant overprediction, especially during the night.
Unlike the O3 forecast performance, as AIRPACT has progressed through versions, the 10 PM2.5 forecast performance has worsened from slight overprediction (FB of 4.5%) in AP-3 to large underprediction (FB of -32%) in AP-5. The poor performance of PM2.5 in AP-5 was likely due to the large underpredictions during spring and summer that are contributed by missing wildfires emissions and using monthly mean chemical boundary condition.
However, all versions of AIRPACT meet the criteria benchmark for FB of ± 60%. 15 It is important to understand that our comparisons between different AIRPACT versions are not based on the same period and thus the changes in forecast skills over time are also influenced by the changes in extreme air quality events such as stratospheric ozone intrusion and wildfires. For example, the significant underprediction of PM2.5 in AP-5 is attributed to the summers of 2015 and 2018; where the AP-5 forecast severely 20 underpredicted large PM2.5 concentrations due to wildfires. The PM2.5 underprediction in summer 2018 were largely due to missing smoke from Canadian fires in the chemical boundary conditions: we were using archived monthly mean chemical boundary conditions as MOZART was no longer available and it was before our transition to WACCM. This suggests that a more reliable forecast system to handle extreme air quality events may 25 improve overall forecast accuracy.
This multi-year evaluation provides a unique opportunity to examine how a regional air quality system has evolved over the last 10 years, particularly how the substantial science advances and technical updates applied to the system have affected air quality forecast ability. Even though our long-term evaluation reveals that some major updates such as reducing grid size did not improve the forecast skills, the updates made to the system have been based on the latest science, to the best of our knowledge, and thus AIRPACT has evolved into a more advanced air quality forecast system over time. Compared to the substantial efforts went into the AIRPACT updates, our forecast accuracy has improved little, which 5 reflects the challenges in improving forecast skills in the current AIRPACT system that is based on 3-D air quality modeling alone. This finding suggests the need to consider new approaches including data-assimilation, ensemble forecasting, and statistical postprocessing that accounts for systematic model errors and sub-grid processes.  This is a non peer-reviewed preprint submitted to EarthArxiv.    The FAR has an ideal value of zero, and the POD has an ideal value of one. The FAR describes how often the forecast predicted a higher AQI than observations. The POD describes how often the forecast predicted a lower AQI than observation.  This is a non peer-reviewed preprint submitted to EarthArxiv.   This is a non peer-reviewed preprint submitted to EarthArxiv.     This is a non peer-reviewed preprint submitted to EarthArxiv. This is a non peer-reviewed preprint submitted to EarthArxiv. Figure 12. Spatial distribution of daily PM2.5 fractional bias (FE) difference between AP-5 and AP-3 at each monitor site, FE is represented by color bar. Due to the long-term storage issues, we lost some archived meteorology forecasts, which hinder the portion of the meteorology analysis. We still evaluated WRF performances against the observation at EPA AQS sites, which are compared to two sets of benchmark values: simple case benchmark from Emery et al.  Table 1). First of all, specific humidity and wind direction are well within the simple case benchmark values for all three AIRPACT versions. Wind speed performance are slightly worse than the simple MB benchmark values (i.e., -0.5 m s -1 ) but is well within the RMSE benchmark (i.e., ±2 m s -1 ). Temperature is shown to be the least satisfactory, as it even falls out of the complex case MB benchmark value (i.e., ±2.0 K) before the AP-5 period.

Decadal Evaluation of the AIRPACT Regional Air Quality
Updates made into the WRF model does not necessarily improve meteorology forecasts in all aspect. Based on the ME/RMSE values, wind speed and wind direction are improved approximately by 10-30% from AP-3 to AP-4 and by 0-15% from AP-4 to AP-5. Temperature and humidity are worsened approximately by 40-130% from AP-3 to AP-4, although it is improved by 30-60% from AP-4 to AP-5. The poor performance in temperature and humidity in AP-4, compared to AP-3, is rather unexpected as the grid resolution was changed from 12 km x 12 km to 4 km x 4 km along with other updates. This is a non peer-reviewed preprint submitted to EarthArxiv. S- Table 4. Seasonal evaluation results. Note that O3 is based on total 26 AQS sites and for PM2.5, total 89 sites.