GI for Sustainability

GI & Sustainability III

5708.3 - BigQuery: Parallel Query Analytics of Big Climate Data

Wednesday, July 5
4:50 PM - 5:10 PM
Location: Maryland C

Observations, model simulations, and reanalysis produce vast amounts of climate data. The unprecedented data volume and intrinsic complexity of geospatial statistics and analysis requires efficient analysis to investigate global problems such as climate change, natural disasters, diseases, and other environmental issues. However, this requirement poses grand challenges due to the unprecedented data volume and intrinsic complexity of geospatial statistics and analysis. Addressing these challenges requires efficient data management strategies, complex parallel algorithms and scalable computing resources. Existing solutions are not sufficient to address these challenges because they 1) lack flexibility. Most climate analysis tools (web-based or standalone) only allow users to conduct analysis using pre-built operations, and provide non-customizable functions and operations; and 2) have limitations in handling multi-dimensional, large-scale climate data since they lack the seamless integration of parallel computing and geospatial analytical functions.

To tackle the challenges and bridge the gap, we propose BigQuery, a parallel query analytical framework for big climate data, by leveraging Hive and cloud computing technologies. BigQuery enables large-scale climate data to be analysed in parallel with intuitive SQL-style queries using a highly scalable Hadoop cluster as the parallel processing engine. Specifically, in this framework, 1) massive climate datasets are abstracted as a pool of grids. A grid here refers to a two-dimensional image with each pixel representing the value of a specific climate variable at a specific spatial location and time. Such a grid-based abstraction offers an integrative space and time framework for managing, querying and processing big climate data; 2) a novel Grid Transformation concept is proposed to view climate analysis from a new perspective, that is, complex climate analysis can be conducted by applying a series of atomic grid transformations to a large grid pool (abstracted from big climate data). Four categories of atomic grid transformations, including temporal, spatial, focal, and arithmetic, are introduced. Numerous climate analysis (from the basic spatiotemporal aggregations to more sophisticated spatiotemporal anomaly detection) can be created by combining different atomic transformation functions, chaining them in different orders, and using different spatiotemporal criteria, and 3) these atomic grid transformations are implemented as user-defined-functions to be embedded in the SQL-style queries. The queries are executed in parallel as MapReduce jobs within a Hadoop high performance environment, and Hadoop distributed file system (HDFS) is employed as a scalable distributed storage that stores massive amounts of climate data in their original format (Li et al., 2016). The results are presented with different formats including tables, charts, maps, and map animations.

To demonstrate the feasibility and performance of BigQuery, a proof-of-concept prototype is implemented and tested using MERRA Land data product (Rienecker et al. 2011). The prototype is deployed on a Hadoop cluster of fourteen nodes connected with 1 Gigabit Ethernet (Gbps). Each node is configured with 12 physical CPU cores (2.35GHz) and 20 GB RAM. Over 30 atomic grid transformation functions were developed and integrated into the system. Experimental results show that the proposed framework is able to support various data-intensive climate applications such as exploring how urbanization may affect temperature trends by analysing the spatiotemporal variability of surface temperature. While using climate data as a demonstration, this research shed the lights on potential solutions for addressing the contemporary data challenges for a variety of applications that go beyond climate studies.

Zhenlong Li

Assistant Professor
University of South Carolina

I am an Assistant Professor in the Department of Geography at University of South Carolina. I received my Ph.D. in Earth Systems and Geoinformation Sciences from George Mason University (GMU) in 2015. Previously, I received my B.S. (2006) in GIS from Wuhan University and M.S. (2010) in Earth System Science from GMU.

My primary research field is GIScience with a focus on spatial computing, big data mining, and geospatial cyberinfrastructure. My research aims to accelerate geospatial information extraction thus to advance geospatial knowledge discovery within the area of data- and computational- intensive geography and GIScience.

Presentation(s):

Send Email for Zhenlong Li

Qunying Huang

Presentation(s):

Send Email for Qunying Huang

Gregory Carbone

Presentation(s):

Send Email for Gregory Carbone

Fei Hu

Presentation(s):

Send Email for Fei Hu

Horst Kremers

CODATA-Germany Chair, Berlin (Germany)

Presentation(s):

Send Email for Horst Kremers


Assets

5708.3 - BigQuery: Parallel Query Analytics of Big Climate Data



Attendees who have favorited this

Please enter your access key

The asset you are trying to access is locked. Please enter your access key to unlock.

Send Email for BigQuery: Parallel Query Analytics of Big Climate Data