Course Code

ΜΔΑ-210

Semester

1st Semester

ECTS Credits

7,5

Type of Course

Mandatory

Faculty

Dimosthenis Kyriazis

Α. Vlachou

 

Cloud Computing with Hadoop

Objective

The main objective of this course is to present both architecture- and implementation- related topics in the domain of cloud and edge computing, with an emphasis on delivering an infrastructure as required for big data management and processing. The course will provide the necessary theoretical background for cloud and edge / fog environments, while aiming to ensure that participants get familiar with functional cloud technologies as well as with topics related to the implementation and execution of cloud-based data analytics / processing tasks (through the corresponding laboratory seminars). Approaches and methodologies across all infrastructure layers will be addressed, emphasizing in emerging cloud architectures (computing, storage, event-based, etc) and their utilization for data management, as well as the main building blocks in cloud environments (resource types, service categories, service and event level agreements, multi-level workflow management). What is more, emphasis will be put upon technologies for data management and processing based on Hadoop, exploiting the Map Reduce framework. Open-source middlewares for the development of cloud infrastructures (i.e. OpenStack) will be analysed and exploited in the laboratory sessions, while the corresponding Hadoop clusters will also be developed in these sessions. Furthermore, the course will focus on declarative querying and high-level languages to specify the data analytics tasks to be performed.

 

Course Contents

Cloud and edge computing concepts and architectures

Definitions. Goals. Challenges. Application areas. Service level agreements. Service phases. Distinct layers based on the Service-Platform-Infrastructure (SPI) model. Architectural design. Service oriented architecture. Next generation architectures / internet of services. Fog and edge computing concepts.

Platform as a Service and Software as a Service layers

Service level agreements negotiation. Service selection. Execution. Monitoring. Evaluation. Accounting and billing. Workflow management. Wrappers for control, monitoring and configuration of application service components. Methodology for developing, modeling and deploying applications. Performance estimation through analytical models and artificial neural networks. Application classification based on stereotypes.

Hands-on laboratory 1

Development. Configuration. Execution of applications in Google cloud, using Google AppEngine platform.

Infrastructure as a service

Virtualization types (native, hardware, OS-level, application). Cloud network infrastructure management. Performance management. Connectivity. Routing. Traffic engineering. Security policies in cloud monitoring systems.

Hands-on laboratory 2

Installation of cloud computing infrastructure using the mainstream middleware OpenStack.

Storage cloud technologies

Architectures addressing various issues (e.g. scalability, data integrity, namespace management, replication) in distributed object data management approaches. Computational storage. Computational and data issues. Execution constraints. Triggering conditions. Interactivity with other data or services. Content-centric access to data Metadata annotation. Content network implementation techniques (based on content linking). Storage objects access mechanisms.

Batch processing and analysis of Big Data

Scalability. Efficiency. Fault-tolerance. Programming solutions for batch processing and analysis of Big Data. MapReduce framework. Programming in MapReduce/Hadoop, HDFS.

Hadoop cluster development and configuration. Hands-on laboratory 3

Installation. Configuration. Exploitation of Hadoop cluster on top of an OpenStack cloud environment for the execution of batch processing and analytics tasks (using MapReduce).

Processing joins in MapReduce

The case of join processing. Lack of inherent support in MapReduce. Types of joins and processing algorithms in MapReduce. Equi-joins. Theta joins. Set similarity joins. Top-k joins. Overview of techniques for efficient processing of joins.

Read/Write Access to Big Data

Random access to Big Data. Read/write functionality vs. read-only. Google’s BigTable. Column-oriented key-value stores. Apache HBase. Developing data-intensive processing in HBase.

Declarative querying and high-level languages

The case of declarative query languages. Advantages of declarative querying. Data warehousing for Big Data. Apache Hive. High-level languages for writing data analysis programs. Apache Pig platform. Pig Latin language. Processing workflow jobs.

Limitations of Hadoop

Restrictions of Hadoop. Cases with lack of efficiency. Data layouts. Indexing. Early termination. Load balancing. Recomputation. Iterative processing. In-memory processing. Research prototypes. Systems that improve Hadoop.

Recommended Readings

  • M. Trovati, R. Hill, A. Anjum, “Big-Data Analytics and Cloud Computing: Theory, Algorithms and Applications”, January 2016
  • T. Erl, “Cloud Computing: Concepts, Technology & Architecture”, May 2013
  • T. White, “Hadoop: The Definitive Guide”, September 2015
  • A. Holmes, “Hadoop in Practice”, October 2012