Authors:
This month, the National Oceanic and Atmospheric Administration (NOAA), the National Aeronautics and Space Administration (NASA), and the tech startup Brightband are releasing the first version of a new observational archive formatted for training AI models (NNJA-AI v01). This AI-ready archive builds on top of the NOAA-NASA Joint Archive (NNJA) of observations for reanalysis that includes observations of the atmosphere, ocean, ice, and land from 1979 to near real time, and is freely available in legacy formats from the archive hosted on Amazon Web Services (AWS). The NNJA is a unique contribution to reanalysis science that has resulted from a three-year collaboration between the NOAA Physical Sciences Laboratory (PSL), NOAA Environmental Modeling Center (EMC), and NASA Global Modeling and Assimilation Office (GMAO). The effort was funded by NASA, The National Environmental Satellite, Data, and Information Service (NESDIS), and OAR’s Weather Program Office (WPO). The NNJA project emerged as a direct response to workshop findings and was sponsored for the last three years through coordinated funding from NOAA WPO, NOAA National Environmental Satellite, Data, and Information Service (NESDIS), and NASA. This blog post describes the journey that the government labs and the tech startup undertook to provide this high-value dataset to the general public in a contemporary, AI-ready format.
Creating a Curated NNJA dataset: Central role of Reanalysis
Over the past two centuries – and particularly during the satellite era of the past fifty years – scientists have collected vast amounts of observations of the Earth system. But piecing these observations together to answer critical questions about how our weather works can be extremely challenging. To help achieve this understanding, scientists combine these historical observations with physics-based models to create “reanalysis” products. Unlike the individual datasets, such as those from weather balloons or satellites, reanalysis products provide a continuous and physically consistent, movie-like playback of many different important variables such as temperature, humidity, winds, and precipitation for the entire atmosphere.
The majority of historical Earth observations used to create a reanalysis are available from public agencies free of charge, and are distributed through archives managed by national weather and satellite agencies. While the archives of these data are accessible, variations in the file formats, quality control, and continuity of records present major barriers to the public’s ability to use these invaluable data for either analysis or training of AI models. Testing and development of a homogeneous observational record from these data is a major undertaking that requires years of testing and curation. This curation activity is also a necessary step for creating modern gridded reanalysis datasets like NASA’s Modern-Era Retrospective analysis for Research and Applications version 2 (MERRA-2), NOAA’s Climate Forecast System Reanalysis (CFSR), or European Centre for Medium-Range Weather Forecasts’ (ECMWF) Reanalysis version 5 (ERA5).
NOAA and NASA are pioneers in the creation of the reanalysis datasets, starting from NASA GEOS-1 in 1993 and NCEP/NCAR R1 in 1995. The observational data used for these early reanalyzes have since been reused and refined in production of modern reanalysis like NASA MERRA (2008), NOAA CFSR (2010), NASA MERRA-2 (2017) and ECMWF ERA5 (2020). While informal exchange of observational data between reanalysis centers is common, these data usually reside behind institutional firewalls and are inaccessible to the wider public.
In 2022, a workshop on the future of the U.S. reanalysis efforts identified several strategies to expedite the development and production of the next generation of reanalysis. Creating a joint, curated, quality-controlled archive of observations developed in collaboration between NOAA and NASA was identified as the top opportunity for enabling the next generation Earth system and regional atmospheric reanalysis.
To create the NNJA dataset, NOAA extracted historical archives of observations from deep tape storage to make them readily available through highly accessible Amazon Web Services (AWS). The historical archives included observations that were prepared for the production of the NOAA CFSR reanalysis (from 1979 to 2010), historic archive of operational observational files (from 2010 to present), observation archive of ocean and ice observations used in production of the NOAA marine reanalysis (1979-2022), and archive of snow observations that were used in production of the Unified Forecast System (UFS)-replay dataset. NASA contributions included additions of reprocessed data for the most recent MERRA-21C reanalysis which included the ozone data and the microwave imager data. In addition to providing reprocessed data, NASA also provided knowledge about the quality of the historic satellite record in the form of so-called black and white lists. In addition to the collection of NOAA and NASA archival observations, NNJA also includes reprocessed observations from the European partners, including reprocessed GPS radio occultations and retrievals of cloud motion vectors.
Part of the NNJA creation included reconciling the observational record at NOAA and NASA. During this process, we concluded that most of the files in the archive are bit-wise identical as they originated from the NOAA operational data stream. Bit-wise non-identical file pairs contained data generated using different processing steps, resulting in slight differences in file size and observation count. If the difference was scientifically neutral, we retained the NOAA version and if it was scientifically meaningful, both NOAA and NASA versions were included in the NNJA Archive. A great example of the latter case is the data from Atmospheric Infrared Sounder (AIRS) and Advanced Microwave Sounding Unit-A (AMSU-A) sensors onboard NASA’s Aqua satellite, with the NASA stream offering better data coverage.
Diving Deeper into NNJA
NNJA includes a comprehensive archive of observations of the Earth’s components, including atmosphere, ocean, ice, and land. Figure 1 shows the number of observations in NNJA (on a log10 scale) for major components of the Earth observation system.
A complete inventory of the observations shared through NNJA is maintained on the NNJA Observations for Earth System Reanalysis webpage. Figure 2 below provides a quick overview of the chronology of satellite radiance sensors in the archive that form the core of the reanalysis record. The satellite sensor coverage in NNJA is comprehensive and is comparable to the record used in modern reanalysis like the ERA5. NNJA is focused on the modern satellite observational record that was started with the launch of the NOAA Television Infrared Observation Satellite – N series (TIROS-N) satellite in October of 1978.
Figure 2: Observation coverage of atmospheric satellite data in NNJA. For up-to-date coverage information, see NNJA webpage.
To enable public release of the data, we had to be cognizant of “restricted data” that have varying limitations on data sharing. Some datasets have restrictions which expire a short time after they are collected. To honor these restrictions, we publish our operational (real-time) data once a day with a 72 hours delay. In some cases, there are permanent restrictions on data sharing. Such data are permanently removed from the archive before it is published. For access to the operational data stream within the first 72 hours, users can switch to the NOAA NOMADS server that provides more granular control on data restrictions and serves real-time operational data in the same file format as NNJA. However, unlike NNJA, the NOMADS server does not provide a continuous, long-term archive of observational data.
In addition to the core observations, NNJA also includes so-called black- and white-list that specify known periods of time when the satellite’s sensors had degraded performance and observational errors for periods when the sensor is deemed operational on the satellite. This information is critical for users applying the data contained in NNJA for a variety of scientific projects.
The archive of the legacy NNJA files is hosted on an AWS S3 bucket with public access. Egress-free access to the data is sponsored by AWS and is facilitated by the NOAA Open Data Dissemination (NODD) program.
From legacy to AI-ready

Although the NNJA archive is comprehensive, working with the data it contains can be challenging. The vast majority of the data are distributed in a special-purpose binary format called Binary Universal Form for the Representation of meteorological data (BUFR), which was developed by the World Meteorological Organization (WMO) in the late 1980’s. Although some weather and climate modeling software can read and work with BUFR-format data, it is incompatible with contemporary big data frameworks and toolkits used widely beyond the meteorology community. The BUFR format also makes the data difficult to integrate with machine learning or AI applications.
To overcome these challenges, the teams at Brightband and NOAA sought to reformat the NNJA dataset into something more suitable for working with today’s large-scale data processing tools (Figure 3). To do this, they first developed proprietary software to efficiently read BUFR data and serialize it to a collection of Apache Avro-formatted records. Data archived in BUFR format are intrinsically “record-oriented” – think the rows of a spreadsheet or more generally a database. Avro preserves this structure, but provides a modern set of software tools for working with extremely large collections of records (easily handling trillions of them). A data scientist today can use cloud-based data processing tools like Google Cloud BigQuery or Apache Spark to analyze such data with ease.
But the team opted to take things one step further. Many applications – such as pulling out data for feeding into the training loops of a machine learning system – require filtering on a few attributes of the data and then reading just a handful of columns. Oftentimes, huge swaths of the archive are simply unneeded for these workflows. To empower this common access pattern, the team further re-processed the data into a “column-oriented” format called Apache Parquet. When chronologically sorted and written to Parquet, end-users can rapidly filter and query the different sensors in the NNJA dataset over small time windows, and extract portions of the archive (say, a few channels of brightness temperature from one sensor flying on several satellites) with very simple code.
Rewriting the NNJA archive to Parquet required defining a new schema or standard for the original BUFR-format data. The original BUFR schema for many sensors is very flexible but also highly complex, and requires users to have intimate knowledge with the actual data contents before they can read any of it. A major goal of the re-formatted archive was to empower more interactive, exploratory data analyses by a larger group of users – many of whom are not meteorologists. To aid with this, the Brightband team has built an open-source, Python Software Development Kit (SDK) which allows users to interact with and query the contents of the archive using a simple catalog. Once users have identified the data they want to use for an analysis, the SDK lets them easily pull it into common Python analytics tools, or re-write the data to an output format of their choice.
In the first release of the re-formatted archive, NNJA-AI v1, the team has reprocessed a large collection of both satellite (including microwave sounders, infrared sounders, and geostationary imagery) and conventional (surface weather stations, weather balloons, and aircraft observations) data, spanning their complete online histories (see Figure 4 for illustration). The data is hosted on a Google Cloud Platform, a publicly-accessible storage bucket coordinated with the NODD team and a replica will be available shortly after its launch on Amazon Web Services.
Figure 4: First version of the NNJA-AI archive converts a subset of the full NNJA archive into AI ready formats that could be all accessible through a single Python Based Application Programming Interface (API). See, for example this notebook.
What is next for NNJA and NNJA-AI?
The original NNJA collaboration between NOAA and NASA concluded in the Spring of 2025. At the conclusion of this project NNJA archive will enter its maintenance phase driven by the needs of the NOAA and NASA reanalysis projects. This includes continuous archival of the operational data streams at NOAA, and continuous curation, testing, and repair of the existing archival record. It is likely that the reanalysis efforts at NOAA and NASA will also motivate an update of the historic observational record with modern, re-processed versions of the same observations (for example as used in the ECMWF ERA6 reanalysis). The collaboration between NOAA Office of Oceanic and Atmospheric Research (OAR) and Brightband is also planned to continue and will be focused on expanding the list of sensors and the duration of the record in the NNJA-AI website.
NNJA and NNJA-AI projects demonstrated a new way to curate and distribute high-quality observational data about the Earth system. NNJA provides a continuously evolving, homogenized observational record with the dual goal of reproducibility and continuing improvement. Complementary to NNJA, NNJA-AI provides a modern access pattern to the historic observations that is revolutionizing data access for training of novel AI-models and for traditional numerical weather prediction applications.
What would a possible future for NNJA and NNJA-AI look like? The requests from early users indicate that a tighter integration between NNJA and NNJA-AI with the producers of the satellite observations (including re-processed observations) such as European Organization for the Exploitation of Meteorological Satellites (EUMETSAT) and NESDIS would be welcomed. In such a collaboration, satellite data producers could directly publish their data in a format compatible with NNJA/NNJA-AI, making it readily accessible to a wide variety of customers. A similar collaboration with producers of real time data could enable a seamless integration between the historic observational archive and the development and delivery of innovative real-time products.