Open Access Weather and Climate Data

Importing qne using them in R

Using R
Author

Pedro J. Aphalo

Published

2023-07-31

Modified

2024-08-24

Abstract

Example R code to download, import and plot wheather and climate data for Europe and the whole Earth. The data bases used in the examples are E-Obs, TerraClimate and NASA Power.

Keywords

R packages, weather data

Warning

Something has changed in R package ‘tidync’ between April and August 2024, in a way that broke the code examples. When selecting a single value in two out of three grid variables, none of the grid variables are included in the tibble returned by function hyper_tibble(). Another change is that in the examples below, the values returned in the tibble for grid variable time are not the numeric values stored in the NCDF file, but instead character strings. However, selection of grid points seems still to work based on the numeric values.

In the case of e-obs the changed behaviour was also when using the same locally saved NCDF file as earlier.

I have edited the code examples that extracted data using ranges to work with the current version of ‘tidync’ and other packages. I was unable to make the example of extracting data for a single geografic grid point to work, so I have deleted it.

Tip

In this page code chunks are “folded” so as to decrease the clutter when searching for examples. Above each plot you will find one or more “folded” code chuncks signalled by a small triangle followed by “Code”. Clicking on the triangle “unfolds” the code chunk making visible the R code used to produce the plot.

The code in the chunks can be copied by clicking on the top right corner, where an icon appears when the mouse cursor hovers over the code listing.

The Code drop down menu to the right of the page title makes it possible to unfold all code chunks and to view the Quarto source of the whole web page.

Names of functions and other R objects are linked to the corresponding on-line help pages. The names of R extension packages are linked to their documentation web sites when available.

Note

The content of this page was originally written for the Hints and Tips section of the UV4Plants Bulletin but was not published as the Bulletin ceased to be published.

1 Introduction

A problem faced when designing studies comparing plant responses over broad regions is how to obtain consistent and reliable weather and climatological data. Inspired by Danielle Griffoni’s presentation at the most recent UV4Plants network meeting and my own need for such data, I have chosen as the subject of the present column Weather and climate: Open data sources. By necessity the contents of this article deals with just a few examples out of the many data sources available, and is far from being comprehensive. The included code examples are in R, simply because R is the language I use most frequently. Frequent users Python and other computer languages would surely be able to produce similar code examples in their favourite language.

We are in the era of Big Data and Open Data but how can we find, download and use these data? As usual, I will use R for the examples. Most spatially and temporarily indexed data are nowadays stored in a file format called NetCDF4, or the fourth update of the NetCDF format’s definition. The key advantage of these files is that they can be read selectively. In other words, data can be read and used from files that are larger than the available computer memory. This is crucial because for daily data on a dense grid, such as those at 1/24\(^\circ\) of latitude and longitude, with global or continental coverage and many decades of data, a single variable can require tens of giga bytes of storage. How much of the data included in these sets one will actually use depends on the number of sites and length of time of interest, and in many cases it is only a tiny portion of the whole data set.

Recently I was working with daily data for nearly \(1\,000\) sites over more than 60 years, producing multiple rolling and accumulated summaries in R and I ended creating a data frame (\(\approx\) a worksheet) containing nearly \(1\,000\,000\,000\) values! The bright side is that I could do this processing both on my four-years-old slim laptop (ThinkPad T395s) and on a 12-years-old desktop PC (a Lenovo ThinkStation E30 from 2011). Of course, reading and processing the data are time-consuming. In this case my script took over 1/2 a day to run for each meteorological variable read and summarised.

Getting access to such a trove of curated and validated meteorological data opens the door to many data analyses, while on the other hand, as data on UV irradiance not weighted with the erythremal action spectrum is rather scarce, things are not as easy for UV research as one would like. Global radiation and ozone column are more easily obtainable, opening the door to radiation transfer modelling.

2 The NetCDF format

Network Common Data Form (NetCDF) files are described as self-descriptive, machine and computer-language independent, scalable, appendable, and shareable. What does this all mean in practice?

NetCDF files contain meta-data indicating the units of expression of the values stored and a short description for each variable. Some variables are tagged as describing the grid on which data are stored, for geographical data the grid is given by latitude and longitude. In the case of meteorological data, time adds a third dimension to the grid. For transcriptome data from a microarray the grid will describe the coordinates the ``cells’’ in the array.

A NetCDF file saved under one operating system (e.g., Linux, Windows, etc.) or computer architecture (e.g., Intel 8086, ARM, etc.) can be read and written to under any other. This works not only locally, but also remotely through the internet. Mulitple ``clients’’ can read the same file simultaneously, locally or remotely, and selectively.

NetCDF files can be huge (several gigabytes in size), but as they can be selectively read from and written to based on positions along grid dimensions, even remotely, they avoid unnecessary data transmission if read remotely, or the need for enough RAM capacity to read the whole file at once. As the files can be reopened to incrementally add data, once again, this can be done efficiently and frequently. Remote access allowing selective download is implemented using the THREDDS , OPeNDAP and similar DAP servers (DAP is an acronym for data access protocol). These servers can be accessed both interactively though a web browser or through software such as from R or Python scripts. When the file format is NetCDF some R packages make remote access consistent with local file access, which is an advantage.

The use of the NetCDF file format is widespread for meteorological and geophysical data, but also used in other fields. NetCDF files are well supported by software as the core software libraries needed are available for free and are open-source. The format was developed by UniData primarily for storing geophysical data.

3 Gridded data and interpolation

Original data collected at ground level by weather stations are not available on a regular grid, but instead stations are irregularly distributed in space, and measurements describe conditions at these specific locations. Data derived from satellite instrumentation are distributed following the overpass tracks and represent spatial averages as dictated by the resolution of the cameras or other instruments. The merging of data from multiple sources, using models, receives the name of ``data assimilation’’.

Many of the currently available climatological data sets integrate data from multiple sources, and re-express them on a regular grid to facilitate their use. Spatial interpolation is used to achieve this, and obviously the uncertainty of the interpolated values as re-expressed on the grid depends on the density of observations available as well as their uncertainty. The interpolation process is also affected by topography, which even when explicitly taken into account, adds to the uncertainty of the data. In these data sets, the uncertainty varies among variables, as for example, more meteorological stations have instrumentation for air temperature than for global radiation. Depending on the data set, long term trends may be meaningful or not, as bias can be introduced by changes in the location of weather stations or more frequently by changes to their surroundings, say expansion of the built-up area of a city. Some data bases include data derived from the primary observations, such soil moisture or climatic water deficit. Obviously, when variables like potential evapotranspiration (PET) is derived from primary observations, uncertainties accumulate. As with any derived or interpolated data, it is important to understand their origin, uncertainties involved and the use they have been designed for. Data sets are prepared with an objective in mind and validated for a certain use. For example, in some cases interpolation has prioritized the minimization or temporal biases and in other cases it has not.

One final reminder: spatial grid density after interpolation does no longer reflect the spatial density of measurements. So if data are interpolated to a grid with 1/24 degree resolution using data from stations that are 100’s of kilometers away, what we gain by gridding is not new information but only an informed guess of what may have been the conditions at a location for which there is no data available. Sometimes, other information is used to improve the predictions, such as the elevation above sea level at each grid point. In any case, topography decreases the reliability of the gridded estimates in mountainous regions.

4 Data use restrictions and acknowledgements

The data in each data base may have multiple origins and been subjected to various processing steps. There is much personal effort and money invested in them, and proper recognition is due to authors. In practice, continued availability of these data resources depends on demonstration of their usefulness and contribution to scientific progress and practical applications being demonstrated. This makes crucial that every user without exception correctly acknowledges the use of open data sources.

The use of the data in these databases is subject to restrictions, for example, E-OBS data cannot be used for commercial purposes, at least without obtaining explicit authorization. Each database has instructions about how the source should be acknowledged. Do check these before use and make sure to comply with acknowledgements as requested by the authors. These restrictions and specially how to cite the databases can change in time. For example when the data set is updated or when a new article is published describing the data or the method used. This means that in most cases we should use the most recent version of the data, and always check the current use restrictions and include the citation(s) and acknowledgement(s) as required or suggested at the time we are preparing the publication-ready version of our manuscripts.

5 The data bases

I will show how to retrieve data from several databases containing data of interest for research with plants and vegetation. Most databases provide multiple ways to retrieve data, in most cases using the NetCDF file format described above. Interactive access with a browser is demonstrated for E-OBS in a video (Video S1), while here I demonstrate code-based data access, download as well as some post-precessing with examples using R.

5.1 TerraClimate

TerraClimate provides monthly data with worldwide coverage at a very high resolution (1/24 degree), and in addition to monthly summaries for weather measurements it contains several derived variables very useful for ecological studies and agriculture (Table 1). The data have been validated against different measurements, but it must be remembered that the availability of primary data is very uneven across the Earth surface, and uncertainties vary accordingly. The data are derived from the assimilation of multiple data sources, including both surface and remote measurements. It is important to be aware that although the data are provided on a very dense spatial grid, the original data are much more sparse, so the true spatial resolution is not as fine as one would think at first sight. On the other hand the interpolated values can in many cases be the best approximation available to the unknown conditions at a given point in space.

Table 1: Variables in the TerraClimate data set. Variables names used in the data set and brief text descriptions. The metadata included in the netCDF files provides up-to-date information on the units and basis of expressing and there encoding in the files.
Variable Description
aet Actual Evapotranspiration, monthly total
def Climate Water Deficit, monthly total
pet Potential evapotranspiration, monthly total
ppt Precipitation, monthly total
q Runoff, monthly total
soil Soil Moisture, total column at end of month
srad Downward surface shortwave radiation
swe Snow water equivalent at end of month
tmax Max Temperature, average for month
tmin Min Temperature, average for month
vpd Vapor Pressure Deficit, average for month
vap Vapor pressure, average for month
ws Wind speed, average for month
PDSI Palmer Drought Severity Index, at end of month

TerraClimate also provides future climate projections developed for two different climate futures: when global mean temperatures are (1) 2,\(^{\circ}\)C and (2) 4,\(^{\circ}\)C warmer than pre-industrial, and (2) when global mean temperatures are 4,\(^{\circ}\)C above preindustrial. The future climate data are based on actual data from years 1985–2015, which for the manipulated data are called pseudo-years. The values of variables for future climate are based on multiple simulation models.

This data base supports the OPeNDAP protocol which allows selective download from a remotely served NetCDF file, in addition to allowing download of the whole data set and ready-made subsets. The data set has 37.3 million grid points for 14 variables for each of 12 months for 61 years, i.e., \(\approx 382\times10^9\) data values plus the values describing each of the coordinates for each grid point. This is Big Data.

Once one understands how to remotely read a NetCDF file, and how to select the observations from the 3D grid (latitude, longitude and time), the R code needed is rather simple. In this case we need to download variables one by one. The information we need to know is the URL to use. In this database the URLs differ only by the name of the variable.

Package ‘tidync’ can be used to retrieve the data in one step directly getting a tibble. There is a difficulty when retrieving data: we need to know the coordinates for the grid points that are nearest to our target locations. In this example we will retrieve data for a single grid point and time.

We retrieve data from all grid points within an area and for a time period. To save typing we define a function, which, thanks to its name, makes also clear the intention of the code.

Code
# name of variable to fetch

var <- "tmax"

# URL to remotely access data and selectively download them

data_url <- 
  paste0(paste0("http://thredds.northwestknowledge.net:8080/thredds/dodsC/agg_terraclimate_",
                var),
         "_1958_CurrentYear_GLOBE.nc")

tnc <- tidync(data_url)
not a file: 
' http://thredds.northwestknowledge.net:8080/thredds/dodsC/agg_terraclimate_tmax_1958_CurrentYear_GLOBE.nc '

... attempting remote connection
Connection succeeded.
Code
activate(tnc, "tmax")

Data Source (1): agg_terraclimate_tmax_1958_CurrentYear_GLOBE.nc ...

Grids (5) <dimension family> : <associated variables> 

[1]   D2,D1,D3 : tmax    **ACTIVE GRID** ( 29561241600  values per variable)
[2]   D0       : crs
[3]   D1       : lat
[4]   D2       : lon
[5]   D3       : time

Dimensions 4 (3 active): 
  
  dim   name  length     min     max start count    dmin    dmax unlim coord_dim 
  <chr> <chr>  <dbl>   <dbl>   <dbl> <int> <int>   <dbl>   <dbl> <lgl> <lgl>     
1 D1    lat     4320   -90.0    90.0     1  4320   -90.0    90.0 FALSE TRUE      
2 D2    lon     8640  -180.    180.      1  8640  -180.    180.  FALSE TRUE      
3 D3    time     792 21184   45259       1   792 21184   45259   FALSE TRUE      
  
Inactive dimensions:
  
  dim   name  length   min   max unlim coord_dim 
  <chr> <chr>  <dbl> <dbl> <dbl> <lgl> <lgl>     
1 D0    crs        1     3     3 FALSE TRUE      
Code
print(tnc)

Data Source (1): agg_terraclimate_tmax_1958_CurrentYear_GLOBE.nc ...

Grids (5) <dimension family> : <associated variables> 

[1]   D2,D1,D3 : tmax    **ACTIVE GRID** ( 29561241600  values per variable)
[2]   D0       : crs
[3]   D1       : lat
[4]   D2       : lon
[5]   D3       : time

Dimensions 4 (3 active): 
  
  dim   name  length     min     max start count    dmin    dmax unlim coord_dim 
  <chr> <chr>  <dbl>   <dbl>   <dbl> <int> <int>   <dbl>   <dbl> <lgl> <lgl>     
1 D1    lat     4320   -90.0    90.0     1  4320   -90.0    90.0 FALSE TRUE      
2 D2    lon     8640  -180.    180.      1  8640  -180.    180.  FALSE TRUE      
3 D3    time     792 21184   45259       1   792 21184   45259   FALSE TRUE      
  
Inactive dimensions:
  
  dim   name  length   min   max unlim coord_dim 
  <chr> <chr>  <dbl> <dbl> <dbl> <lgl> <lgl>     
1 D0    crs        1     3     3 FALSE TRUE      
Code
# Function to selecy the area and period of interest

is_within <- function(x, range) {
  # protect from ranges given backwards
  range <- range(range)
  x > range[1] & x < range[2]
}

# we save the limits to grid variables
my.lon <- c(50, 50.5)
my.lat <- c(30.5, 31)
my.period <- as.integer(ymd(c("1960-03-01", "1960-09-01")) - ymd("1900-01-01"))

hyper_tibble(tnc,
             lon = is_within(lon, my.lon),
             lat = is_within(lat, my.lat),
             time = is_within(time, my.period),
             na.rm = FALSE) %>%
  # convert time from character to a date
  mutate(date = ymd(time),
        tmax = ifelse(abs(tmax) == 32768, NA, tmax)) %>%
  select(-time) -> grid_area_in_time.tb

grid_area_in_time.tb
# A tibble: 720 × 4
    tmax lon              lat              date      
   <dbl> <chr>            <chr>            <date>    
 1  27.8 50.0208333333333 30.9791666666667 1960-04-01
 2  27.5 50.0625          30.9791666666667 1960-04-01
 3  26.6 50.1041666666667 30.9791666666667 1960-04-01
 4  23.2 50.1458333333333 30.9791666666667 1960-04-01
 5  23.8 50.1875          30.9791666666667 1960-04-01
 6  26.0 50.2291666666667 30.9791666666667 1960-04-01
 7  26.8 50.2708333333333 30.9791666666667 1960-04-01
 8  26.7 50.3125          30.9791666666667 1960-04-01
 9  26.6 50.3541666666667 30.9791666666667 1960-04-01
10  26.6 50.3958333333333 30.9791666666667 1960-04-01
# ℹ 710 more rows

In principle we could download the whole files, and access them locally, but the catch is in the size of the files: up to a few GB per variable. For this reason, unless we plan to use a large chunk of the data, or repeatedly extract different subsets it is more efficient to use selective downloads.

5.2 E-OBS

The E-OBS (Cornes et al. 2018) covers Europe plus nearby regions including land bordering the Mediterranean sea to the South and East. Data is daily and based on measurements at thousands of weather stations and extends at the time of writing from 1920-01-01 to 2023-07-30. These data have estimates of uncertainty for each variable, grid and time point. The maximum spatial resolution available is 1/10 degree for both latitude and longitude. Download of data is free but requires registration. Data is restricted to non-commercial use. In this case, after logging-in into the web site one can download the NetCDF files to a local disk (Video: XXXXX). Once downloaded, as a second step, one can access the file as shown above for TerraClimate data. The most reliable data are from 1950 to the present, and at 1/10 degree resolution. Files contain data for a single variable, and are large (6 to 15 GB each). Estimates of uncertainty are also in individual files, one per variable.

We can adjust the code shown above to work with these files. First step is to use a file path instead of an URL. When exploring the properties of the file we can notice some differences, even in the names of the grid variables and how time is encoded (the starting date is different).

OPeNDAP access is available at (http://opendap.knmi.nl/knmi/thredds/dodsC/) through a THREDDS server maintained by the Royal Netherlands Meteorological Institute, but the version currently available is two years behind the Copernicus server of EU. With catalog.html listing the data sets and for example the link ending in e-obs_0.25regular/tg_0.25deg_reg_v17.0.nc giving access to the variable tg with daily mean air temperature.

Code
# We will use local file downloaded from web page
# This file is nearly 800 MB in size but online query is cumbersome for this
# server.

eobs_tnc <- tidync("data/tg_ens_mean_0.25deg_reg_v27.0e.nc")

activate(eobs_tnc, "tg")

Data Source (1): tg_ens_mean_0.25deg_reg_v27.0e.nc ...

Grids (4) <dimension family> : <associated variables> 

[1]   D1,D2,D0 : tg    **ACTIVE GRID** ( 2486698032  values per variable)
[2]   D0       : time
[3]   D1       : longitude
[4]   D2       : latitude

Dimensions 3 (all active): 
  
  dim   name      length   min     max start count  dmin    dmax unlim coord_dim 
  <chr> <chr>      <dbl> <dbl>   <dbl> <int> <int> <dbl>   <dbl> <lgl> <lgl>     
1 D0    time       26663   0   26662       1 26663   0   26662   TRUE  TRUE      
2 D1    longitude    464 -40.4    75.4     1   464 -40.4    75.4 FALSE TRUE      
3 D2    latitude     201  25.4    75.4     1   201  25.4    75.4 FALSE TRUE      
Code
print(eobs_tnc)

Data Source (1): tg_ens_mean_0.25deg_reg_v27.0e.nc ...

Grids (4) <dimension family> : <associated variables> 

[1]   D1,D2,D0 : tg    **ACTIVE GRID** ( 2486698032  values per variable)
[2]   D0       : time
[3]   D1       : longitude
[4]   D2       : latitude

Dimensions 3 (all active): 
  
  dim   name      length   min     max start count  dmin    dmax unlim coord_dim 
  <chr> <chr>      <dbl> <dbl>   <dbl> <int> <int> <dbl>   <dbl> <lgl> <lgl>     
1 D0    time       26663   0   26662       1 26663   0   26662   TRUE  TRUE      
2 D1    longitude    464 -40.4    75.4     1   464 -40.4    75.4 FALSE TRUE      
3 D2    latitude     201  25.4    75.4     1   201  25.4    75.4 FALSE TRUE      
Code
# we save the limits to variables
my.lon <- c(-30, -10)
my.lat <- c(55, 65)
my.period <- as.integer(ymd(c("2020-03-01", "2021-09-01")) - ymd("1950-01-01"))

hyper_tibble(eobs_tnc,
             longitude = is_within(longitude, my.lon),
             latitude = is_within(latitude, my.lat),
             time = is_within(time, my.period),
             na.rm = FALSE) %>%
  # convert time from character to a date
  mutate(date = ymd(time)) %>%
  select(-time) -> grid_area_in_time.tb

sum(!is.na(grid_area_in_time.tb$tg))
[1] 43082
Code
sum(is.na(grid_area_in_time.tb$tg))
[1] 1710518
Code
print(na.omit(grid_area_in_time.tb))
# A tibble: 43,082 × 4
       tg longitude latitude date      
    <dbl> <chr>     <chr>    <date>    
 1  0.760 -15.125   64.375   2020-03-02
 2  1.38  -14.875   64.375   2020-03-02
 3 -0.620 -14.875   64.625   2020-03-02
 4  0.870 -14.625   64.625   2020-03-02
 5  1.18  -14.375   64.625   2020-03-02
 6  0.230 -14.375   64.875   2020-03-02
 7  0.920 -14.125   64.875   2020-03-02
 8  1.25  -13.875   64.875   2020-03-02
 9  0.220 -15.125   64.375   2020-03-03
10  0.800 -14.875   64.375   2020-03-03
# ℹ 43,072 more rows

From here onwards we can continue as shown above for TerraClimate, but replacing the starting date for time of ymd("1900-01-01") by ymd("1950-01-01") and the missing value marker 32768 by 9999. These values are taken from the metadata we printed with the code above.

5.3 NASA POWER

NASA POWER hosts data mainly intended for renewable energy estimates such as temperature, relative humidity, precipitation, solar radiation, clouds, wind speed and wind direction, all in all data for 161 parameters. The maximum spatial resolution available is 1/10 degree for both latitude and longitude. Geographic coverage is global, hourly values are available in near-real-time. Gridded values are based on remote data acquired with satellite instruments. The available solar and weather data can be useful for other purposes in addition to renewable energy assessments. These data can be accessed using R package ‘nasapower’ built on top of the online API. In this case, the data set specific package works differently, returning a tibble, without exposing NetCDF files to the user.

In the case of NASA POWER any valid longitude and altitude can be passed, so we do not need to define R code to search for the nearest available grid point as we did above. In addition, the value passed to community parameter affects the units in which the values are returned. By passing "AG", for agro-climatology, we obtain the radiation in MJ/m-2/d-1. Metadata is also retrieved and stored in the object containing the data.

Code
library(nasapower)

# long printout, use to se what is available
# 
# query_parameters(community = "ag",
#                  temporal_api = "hourly")

query_parameters(par = "T2M",
                 community = "ag",
                 temporal_api = "hourly")
$T2M
$T2M$type
[1] "METEOROLOGY"

$T2M$temporal
[1] "HOURLY"

$T2M$source
[1] "MERRA2"

$T2M$community
[1] "AG"

$T2M$calculated
[1] FALSE

$T2M$inputs
NULL

$T2M$units
[1] "C"

$T2M$name
[1] "Temperature at 2 Meters"

$T2M$definition
[1] "The average air (dry bulb) temperature at 2 meters above the surface of the earth."
Code
# we download the data for one grid point, Joensuu, Finland
daily_single_ag <- get_power(
  community = "ag",
  lonlat = c(29.76, 62.60),
  pars = c("RH2M", "T2M", "ALLSKY_SFC_SW_DWN"),
  dates = c("2023-06-01", "2023-06-30"),
  temporal_api = "daily"
)

# print the downloaded data
daily_single_ag
NASA/POWER CERES/MERRA2 Native Resolution Daily Data  
 Dates (month/day/year): 06/01/2023 through 06/30/2023  
 Location: Latitude  62.6   Longitude 29.76  
 Elevation from MERRA-2: Average for 0.5 x 0.625 degree lat/lon region = 111.91 meters 
 The value for missing source data that cannot be computed or is outside of the sources availability range: NA  
 Parameter(s):  
 
 Parameters: 
 RH2M                  MERRA-2 Relative Humidity at 2 Meters (%) ;
 T2M                   MERRA-2 Temperature at 2 Meters (C) ;
 ALLSKY_SFC_SW_DWN     CERES SYN1deg All Sky Surface Shortwave Downward Irradiance (MJ/m^2/day)  
 
# A tibble: 30 × 10
     LON   LAT  YEAR    MM    DD   DOY YYYYMMDD    RH2M   T2M ALLSKY_SFC_SW_DWN
   <dbl> <dbl> <dbl> <int> <int> <int> <date>     <dbl> <dbl>             <dbl>
 1  29.8  62.6  2023     6     1   152 2023-06-01  76.6  7.23              17.6
 2  29.8  62.6  2023     6     2   153 2023-06-02  63.1  6.65              19.7
 3  29.8  62.6  2023     6     3   154 2023-06-03  79.4  7.18              14.7
 4  29.8  62.6  2023     6     4   155 2023-06-04  79.4  9.26              13.4
 5  29.8  62.6  2023     6     5   156 2023-06-05  79.2 10.9               19.4
 6  29.8  62.6  2023     6     6   157 2023-06-06  69.3 11.4               24.2
 7  29.8  62.6  2023     6     7   158 2023-06-07  81.3 10.2               16.9
 8  29.8  62.6  2023     6     8   159 2023-06-08  84.1  9.96              12.2
 9  29.8  62.6  2023     6     9   160 2023-06-09  76.7  8.76              15.7
10  29.8  62.6  2023     6    10   161 2023-06-10  69.6 10.2               26.5
# ℹ 20 more rows

We must be aware that near-real-time data from remote sensing is not subjected to as thorough a quality control as validated data made available with a longer delay.

5.4 Erythemal UV radiation and ozone column

We will import data from a file downloaded manually (using URL https://avdc.gsfc.nasa.gov/index.php?site=164609600&id=79) and corresponding to Ushuaia, Argentina. When we first download data from a server as a text file, we need to inspect it. Most files will have a header with metadata and which needs to be read separately from the data itself. In this case, opening the file in a text editor, such as RStudio, notepad or nano, we can see that lines from 51 onwards contain the data in tabular shape and that line 51 contains the variable (= column) names. From this we know that we need to separately read the first 50 lines as a header, and that for reading the data themselves we need to skip 50 lines at the top of the file. We save to the header in the ‘comment’ attribute of the data frame.

Code
library(readr)
library(anytime)
my.file <- "data/aura_omi_l2ovp_omuvb_v03_ushuaia.txt"
header <- readLines(my.file, n = 50)
omiuvb.tb <- read_table(file = my.file,
                        skip = 50,
                        col_types = cols(.default = col_double(),
                                         Datetime = col_character(),
                                         DOY = col_integer(),
                                         Orbit = col_character()))
comment(omiuvb.tb) <- header

We will most likely be interested in a few of the columns rather than in all of them. We can inspect the header of the file itself, or the documentation of the data set to identify the columns containing the data we need, and then select these columns and study the description of these fields, and as shown below.

Code
# cat(grkmisc::pretty_string(comment(omiuvb.tb), wrap_at = 70, truncate_at = Inf), sep =  "\n")
colnames(omiuvb.tb)
 [1] "Datetime"   "MJD2000"    "Year"       "DOY"        "sec.(UT)"  
 [6] "Orbit"      "CTP"        "Lat."       "Lon."       "Dis"       
[11] "SZA"        "VZA"        "GPQF"       "OMAF"       "OMQF"      
[16] "UVBQF"      "OMTO3_O3"   "CSEDDose"   "CSEDRate"   "CSIrd305"  
[21] "CSIrd310"   "CSIrd324"   "CSIrd380"   "CldOpt"     "EDDose"    
[26] "EDRate"     "Ird305"     "Ird310"     "Ird324"     "Ird380"    
[31] "OPEDRate"   "OPIrd305"   "OPIrd310"   "OPIrd324"   "OPIrd380"  
[36] "CSUVindex"  "OPUVindex"  "UVindex"    "LambEquRef" "SufAlbedo" 
[41] "TerrHgt"   
Code
column_selector <-  c("Year", "DOY", "OMTO3_O3", "EDDose", "UVBQF", "OMQF")
pattern <- paste(column_selector, collapse = "|")
header[grepl(pattern, header)]
[1] "Year      : Year"                                        
[2] "DOY       : Day Of Year"                                 
[3] "OMQF      : OMTO3 Quality Flags (dimensionless)"         
[4] "UVBQF     : Quality Flags on Pixel Level (dimensionless)"
[5] "OMTO3_O3  : Total column ozone (DU)"                     
[6] "CSEDDose  : Clear Sky Erythemal Daily Dose (J/m^2)"      
[7] "EDDose    : Erythemal Daily Dose (J/m^2)"                
Code
omiuvb.tb[ , column_selector]
# A tibble: 9,508 × 6
    Year   DOY OMTO3_O3 EDDose UVBQF  OMQF
   <dbl> <int>    <dbl>  <dbl> <dbl> <dbl>
 1  2004   275     282.   447. 20480     0
 2  2004   275     295.   875. 20482 32768
 3  2004   276     317.  1846. 20480     0
 4  2004   276     317.  1438. 20480     0
 5  2004   277     323.  1552. 20482 36864
 6  2004   277     321.  1340. 20480     0
 7  2004   278     320.  1631. 20480     0
 8  2004   278     320.  1650. 20480     0
 9  2004   279     297.  1385. 20480     0
10  2004   279     288.  1433. 20482 36864
# ℹ 9,498 more rows

These data originate from satellite observations and are based on cloudiness and aerosols as observed during overpass. Ozone column values are more reliable than erythemal UV doses, as ozone column thickness varies very little between overpasses while cloudiness can vary much more. EDDose can be up to 30% overestimated. It is important to read the data product documentation before using the data and to understand the meaning of the quality flags, as some observations may need to be discarded. Estimates are specially unreliable when there is a snow or ice cover on the ground as it affects estimates of cloudiness.

5.5 Other data sources

The European Centre for Medium-Range Weather Forecasts is another source of useful data. For example the “ECMWF Reanalysis v5” or ERA5 is a reanalysis of world-wide weather going back all the way to 1950. These data rely more strongly on modelling but the highest temporal resolution available is hourly. The file format used is not NetCDF but, instead, GRIB.

The National Oceanic and Atmospheric Administration (NOAA) in the USA, has for many years provided open access to data. Both data sets with national and world-wide coverage are hosted. The R package ‘rNOMADS’ facilitates access to these data.

Various national weather services also make at least some data available on-line as part of the recent emphasis on “open government”. For example, for Spain precipitation data with less uncertainty than that in E-OBS is available. In the case of Finland, historical data from meteorological stations is one of the data sets available on-line through an API interface giving access to several variables from FMI. In this case there is a well documented R package which has not yet been published in CRAN but that can be installed from a git repository hosted at GitHub.

NASA maintains a metadata catalogue for Earth Observation Data. At this site there is a link “Find Data” allowing searches. Searching the catalogue can be overwhelming unless one starts by limiting the scope of data sources where to search, say by type of instrument, or some other broad criterion.

The CEOS International Directory Network (IDN) can be also of help. For example, searching for “ozone layer” and then selecting “OMI” as instrument provides a list of useful data sets of manageable length.

NASA’s World View makes it possible to interactively explore data sets graphically as “layers” on a world map, but also provides links to download the same data as numerical values. Some data are also available through the Google Earth Engine after creating a user account.

The UNdata site gives access to world-wide and country statistics including the data from the World Meteorological Organization.

Although off-topic, it is worthwhile mentioning that other types of data useful in our research are also becoming available on-line. These data sets include several related to species distributions, phenology and other data relevant to ecological research. In many cases, specific R packages are available for retrieval and import of the data. For R users it is worthwhile keeping a close eye on the ROpenScience web site and on their catalogue of peer-reviewed R packages.

Few researchers in our field, including myself, extract all the information contained in the data they collect. Fewer, extract all the information obtainable by combining the data they collect with data about the context in which they do research. Even fewer quantitatively combine data over long periods or time, across research groups, institutions, countries, and continents, etc. In contrast, meteorologists do it all the time: they extract information, about very slow rates of change with background variation that is comparatively huge. How do they achieve this? Extreme care about methods: calibrations, cross-calibrations, and all sorts validations, cross validations and inter-comparisons. What does this mean in practice? With careful computations and further validation they can produce long time series of data covering most of our planet, based on measurements done by many thousands, possibly even millions, of different observers, sometimes using different methods, over hundreds of years! This is the source of the data we have now in our hands, for free. And how valuable this is!

I see this as the most cogent demonstration of what can be achieved when most researchers or data collectors'' in a field aim consistently over a long time at what we now callreproducible research’‘. In biology, we can seldom even get compatible results from a repeat of a single experiment. The complexity of biological systems is no excuse for this, as long as we do not even try to have enough replicates, document methods in enough detail and spend enough time and effort validating and calibrating methods against each other. I think, it is good time for biological research to catchup and follow the meteorologists’ example. To achieve this, we need to get researchers in our field to think about the long term value of the data we produce and how to enhance it. This is made difficult, when the evaluation of our productivity is so often based on the most recent 5 years and on the newsworthiness of individual papers rather than on long-term impact and reliability of experimental design and methods.

The has encouraged open sharing of data and consistency of methods since its creation as the International Meteorological Organization in 1873. In my opinion, this demonstrates that open availability of original data, allowing comparison and cross-checking can be a strong incentive for adoption of reproducible research approaches. As demonstrated by meteorological observations, the value added by generating data and its metadata in ways that allows the build up of large sets of data by combining a multitude of individual sources can be enormous. Just, how could we have dealt or even detected global climate change lacking consistently collected data from well before we realized that climate is changing?

Considering the shorter time frame of the current pandemic, that useful data are becoming so easily available opens an opportunity for keeping us busy. We can extract new information from these freely available data sets and also by combining them with the data we have earlier collected ourselves. So even if we are confined at home because of the COVID epidemic we can do original research!

7 Acknowledgements

I warmly thank the discussions and thoughtful comments from Titta Kotilainen that helped me write and revise this column. I also thank Anders Lindfors for introducing me to the use of some of these sources of information and Heikki Hänninen for sharing a manuscript from his research group, a manuscript that woke up my curiosity about open data even more. This last step made the “energy level” surpass my activation threshold and got me playing with these databases, updating an old computer and downloading many gigabytes of climate, weather, phenological and other data. What started as tinkering several weeks ago, ended providing useful context data for three manuscripts that I am working on at the moment, an ideas for a couple of future articles! Not bad at all, and so an experience worthwhile sharing with our readers.

8 References

Aphalo, Pedro J. 2020. Learn R: As a Language. The R Series. CRC/Taylor & Francis Ltd. https://www.learnr-book.info/.
Arola, A., S. Kazadzis, A. Lindfors, N. Krotkov, J. Kujanpää, J. Tamminen, A. Bais, et al. 2009. “A New Approach to Correct for Absorbing Aerosols in OMI UV.” Geophysical Research Letters 36 (22). https://doi.org/10.1029/2009gl041137.
Bai, Jinshun, Xinping Chen, Achim Dobermann, Haishun Yang, Kenneth G. Cassman, and Fusuo Zhang. 2010. “Evaluation of NASA Satellite- and Model-Derived Weather Data for Simulation of Maize Yield Potential in China.” Agronomy Journal 102 (1): 9–16. https://doi.org/10.2134/agronj2009.0085.
Cornes, Richard C., Gerard van der Schrier, Else J. M. van den Besselaar, and Philip D. Jones. 2018. “An Ensemble Version of the e-OBS Temperature and Precipitation Data Sets.” Journal of Geophysical Research: Atmospheres 123 (17): 9391–9409. https://doi.org/10.1029/2017jd028200.
Kosmopoulos, P. G., S. Kazadzis, A. W. Schmalwieser, P. I. Raptis, K. Papachristopoulou, I. Fountoulakis, A. Masoom, et al. 2021. “Real-Time UV-Index Retrieval in Europe Using Earth Observation Based Techniques and Validation Against Ground-Based Measurements.” Atmospheric Measurement Techniques Discussions 2021: 1–65. https://doi.org/10.5194/amt-2020-506.
Parisi, Alfio V., Damien Igoe, Nathan J. Downs, Joanna Turner, Abdurazaq Amar, and Mustapha A. A Jebar. 2021. “Satellite Monitoring of Environmental Solar Ultraviolet a (UVA) Exposure and Irradiance: A Review of OMI and GOME-2.” Remote Sensing 13 (4): 752. https://doi.org/10.3390/rs13040752.