Case Study overview
We will outline and explain in detail the implementation of a framework built on top of Spark to enable agile and iterative data discovery between legacy systems and new data sources generated by IoT devices.
The internet of things (IoT) is certainly bringing new challenges for data practitioners. It’s basically a network of connected objects (the “things”) that can communicate and exchange information. The concept is not new, however there is a significant shift in the list of objects that could be part of this network. Very soon, it will be nearly every object on the planet. There are already few applications emerging around the internet of things, such as smart cities, smart agriculture, demotic & home automation, smart industrial control…etc.
The data* for our case study contain various information such as:
• Call Detail Records (CDRs) generated by the Telecom Italia cellular network over the Province of Trento.
• Measurements about temperature, precipitation and wind speed/direction taken in 36 Weather Stations.
• Amount of current flowing through the electrical grid of the Trentino province at specific instants.
The primary focus of the framework is not to collect in real-‐time the data from the devices although it can be extended to support such capability thanks to Spark.
The ultimate goal of the framework approach here is to speed up development effort to ingest new data sources available and reduce time to market for data consumers. It aims to avoid rewriting new scripts for every new data sources available and enables a team of data engineer to easily collaborate on a project using the same core engine.
Overview of the framework
The combination of Spark and Shell scripts enables seamless integration of the data. The data is first stored as parquet files in a staging area. It enables Spark to execute fast SQL queries for processing the data. These SQL queries are pre-‐defined in templates files that will be executed by the framework engine at the run time. Thus data engineers can focus on writing their transformation to extract meaningful information and let the framework engine process these queries. The results are stored in hive for data consumers. The key benefit of Spark here is to be able to execute various workloads. The framework engine can then be easily extended to support different types of operations in addition to processing SQL templates files.
Framework architecture & technologies
The framework is composed by 2 independents modules and one administration module that contains common functions. The choice has been made to create independents modules, as they achieve very different logical functions:
- Data collection: get the various data sources as they are available and stored the data in hdfs.
- Data integration: apply transformation to the data to extract meaningful information and store the results in hive.
The administration module contains common functions that are not directly related to the data collection and ingestion such as installation scripts, generic scripts to handle log message, global properties files…etc.
Data collection module
This module is designed to collect data from various data sources as they are available and stored them in Hadoop file system.
The data sources to be ingested are described in data catalog which contains each data source properties such as the type (local, aws, api..etc), connection properties..etc.
Data integration module
The data ingestion is realized in two steps:
Step 1: Collect all data available in the hdfs hub, convert them in parquet and save them in a staging folder. The goal of this step is to build a source agnostic representation of the data. Spark does not really need to know if the data was originally in csv, txt, json…etc. Parquet format has been chosen to enable fast execution of Spark queries.
Step 2: The purpose of this function is to use Spark SQL engine to execute pre-‐defined sql template that contains the transformations to apply to the data. Most of the time, the requirement for the data to be extracted can change. We need to avoid rewriting the entire script, every time that we need to extract some pieces of information for the data consumers. To support this functionality, we rely on two concepts:
- Configuration files that contains the information about the data that we want to compute and where we want to store the results in Hive
- Templates files that contains SQL queries to be executed to apply transformation/extraction
Summary & Benefits
This concept of a framework aims to avoid rewriting new scripts for every new data sources available. Spark has been identified as a good candidate, as it offers the ability to execute various workloads and support multiple type of computations.
The framework has been designed with flexibility in mind. The current two core modules are independent and can be replaced or customized without impacting the whole solution.
In addition, it enables a team of data engineer to easily collaborate on a project using the same core engine. Each member can focus on writing their templates and updating configuration files. Everyone can benefit from evolution to the core modules.
Many companies are looking to enrich their conventional data warehouse or data platform with new sources of information that usually come in a variety of format. With the rise of the internet of the things, this concern will certainly become more and more important, as the companies would like to ingest and derive value from these new data-sets. They need to do it in an agile way:
- Acquire the data
- Test concepts
- Validate and industrialize
A framework is definitely the way to go. The key benefits are:
- Agility & time to market – give the ability to quickly acquire new data sources
- Re-usability – data engineers need to focus on data integration without re-‐writing core functionalities for every new project or proof of concept.
- Accessibility – core ETL transformations are written in SQL making therefore this framework accessible to most data engineers.
- Flexibility – each module of the framework can be enhanced independently and everyone will benefit from these evolution’s.