Code

ΜΔΑ-285

Semester

2nd

ECTS

7,5

E- Services

Category

Obligatory

Instructors

Objective

The main objective of this course is to present to the students modern techniques, systems and platforms for Big Data management and scalable processing.

Emphasis will be given to issues related to scalability, efficiency and fault-tolerance in the complete life-cycle of Big Data, from data acquisition and integration to data processing and interpretation. In terms of expected results, the students will acquire strong technical skills in management of Big Data and they will be enabled to design and implement algorithms for data processing at scale.

After successfully completing the course, students will be able to:

  • develop data-centric applications with an emphasis on performance and scalability
  • use the most appropriate big data processing tool and system
  • evaluate and improve computationally intensive parts of a big data processing algorithm
  • apply the most appropriate data processing techniques suitable for the data under analysis
  • develop efficient big data processing algorithms

Learning outcomes

  • Search for, analysis and synthesis of data and information, with the use of the necessary technology
  • Adapting to new situations
  • Decision-making
  • Working independently
  • Production of new research ideas
  • Criticism and self-criticism

Syllabus

  • Big data, advanced modeling techniques and MapReduce

    Basic concepts. Applications. Use cases. Definitions. 6Vs -Volume, Variety, Velocity, Veracity, Validity and Volatility. Advanced modeling techniques related to Big Data. Problem formulation. Requirements for large-scale data management platforms. Research opportunities and challenges. The process of analyzing Big Data. Challenges associated with large-scale data. The MapReduce programming framework.

  • Hadoop & HDFS

    The Hadoop distributed file system, replication, fault tolerance, high read throughput. Apache Hadoop as an implementation of MapReduce. Limitations of Hadoop. Designing MapReduce jobs. Data partitioning techniques. Simple operations (counting, addition) and complex operations (conjunctions).

  • Batch Processing I (Apache Spark)

    Parallel Processing, Main Memory Processing, Dataframes in Spark, Columnar and Rowwise Storage Example Usage.

  • Batch Processing II (Apache Spark)

    Resilient Distributed Datasets (RDDs), immutable variables, actions and transformations, lazy valuation, the Spark shell, comparison between Spark and Hadoop.

  • Batch Processing III (Apache Spark)

    Declarative query processing, Spark SQL, programming with Dataframes, Spark’s processing engine, data partitioning, working with JSON data.

  • Real-Time Processing I (Apache Storm)

    Dataflow Management Systems, Dataflow Processing, Programming in Apache Storm, Bolts and Spouts, Topologies in Storm.

  • Real-Time Editing II (Spark Streaming)

    Micro-batching, Spark streaming, stateless and stateless processing, windowing mechanisms.

  • Real-Time Processing III (Apache Kafka)

    Apache Kafka, basic concepts, publish/subscribe architecture, real-time pipelined data processing.

  • The HBase system

    Storing data for random access, columnar storage, basic HBase concepts, advanced concepts and features.

  • Big Data Research Topics

    Selected research topics for Big Data management and processing.

Bibliography