Big Data Processing: Techniques and Tools - Μεταπτυχιακό Πληροφοριακά Συστήματα και Υπηρεσίες

Code

ΜΔΑ-285

Semester

2nd

ECTS

7,5

E- Services

Lefkippos

Category

Obligatory

Instructors

C. Doulkeridis

N. Koutroumanis

Objective

The main objective of this course is to present to the students modern techniques, systems and platforms for Big Data management and scalable processing.

Emphasis will be given to issues related to scalability, efficiency and fault-tolerance in the complete life-cycle of Big Data, from data acquisition and integration to data processing and interpretation. In terms of expected results, the students will acquire strong technical skills in management of Big Data and they will be enabled to design and implement algorithms for data processing at scale.

After successfully completing the course, students will be able to:

develop data-centric applications with an emphasis on performance and scalability
use the most appropriate big data processing tool and system
evaluate and improve computationally intensive parts of a big data processing algorithm
apply the most appropriate data processing techniques suitable for the data under analysis
develop efficient big data processing algorithms

Learning outcomes

Search for, analysis and synthesis of data and information, with the use of the necessary technology
Adapting to new situations
Decision-making
Working independently
Production of new research ideas
Criticism and self-criticism

Syllabus

Big data, advanced modeling techniques and MapReduce

Basic concepts. Applications. Use cases. Definitions. 6Vs -Volume, Variety, Velocity, Veracity, Validity and Volatility. Advanced modeling techniques related to Big Data. Problem formulation. Requirements for large-scale data management platforms. Research opportunities and challenges. The process of analyzing Big Data. Challenges associated with large-scale data. The MapReduce programming framework.
Hadoop & HDFS

The Hadoop distributed file system, replication, fault tolerance, high read throughput. Apache Hadoop as an implementation of MapReduce. Limitations of Hadoop. Designing MapReduce jobs. Data partitioning techniques. Simple operations (counting, addition) and complex operations (conjunctions).
Batch Processing I (Apache Spark)

Parallel Processing, Main Memory Processing, Dataframes in Spark, Columnar and Rowwise Storage Example Usage.
Batch Processing II (Apache Spark)

Resilient Distributed Datasets (RDDs), immutable variables, actions and transformations, lazy valuation, the Spark shell, comparison between Spark and Hadoop.
Batch Processing III (Apache Spark)

Declarative query processing, Spark SQL, programming with Dataframes, Spark’s processing engine, data partitioning, working with JSON data.
Real-Time Processing I (Apache Storm)

Dataflow Management Systems, Dataflow Processing, Programming in Apache Storm, Bolts and Spouts, Topologies in Storm.
Real-Time Editing II (Spark Streaming)

Micro-batching, Spark streaming, stateless and stateless processing, windowing mechanisms.
Real-Time Processing III (Apache Kafka)

Apache Kafka, basic concepts, publish/subscribe architecture, real-time pipelined data processing.
The HBase system

Storing data for random access, columnar storage, basic HBase concepts, advanced concepts and features.
Big Data Research Topics

Selected research topics for Big Data management and processing.

Suggested bibliography

Özsu, M. T., Valduriez P. (2011): Principles of Distributed Database Systems, Third Edition. Springer, ISBN 978-1-4419-8833-1, pp. I-XIX, 1-845.
Jagadish, H. V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J. M., Ramakrishnan, R., Shahabi, C. (2014): Big Data and Its Technical Challenges. Communications of the ACM, Vol. 57 No. 7, pages 86-94.
Marz, N., Warren, J. (2015): Big Data: Principles and best practices of scalable realtime systems. Manning publications. ISBN: 9781617290343.
White, T. (2012): Hadoop: The Definitive Guide, 3rd Edition. O’Reilly Media, ISBN-10: 1449311520.
Karau, H., Konwinski, A., Wendell, P., Zaharia, M. (2015): Learning Spark: Lightning-fast big data analysis. O’Reilly Media. ISBN-10: 1449358624.
Golab, L., Özsu, M.T. (2010): Data Stream Management. Morgan & Claypool Publishers, Synthesis Lectures on Data Management.
Kleppmann, M., (2017): Designing data-intensive applications. O’Reilly Media. ISBN-10: 1449373321

Related academic journals

ACM Transactions on Database Systems (TODS): https://tods.acm.org/
VLDB Journal: http://vldb.org/vldb_journal/
IEEE Transactions on Knowledge and Data Engineering (TKDE): https://www.computer.org/web/tkde
IEEE Transactions on Big Data: https://www.computer.org/web/tbd
Big Data Research: http://www.journals.elsevier.com/big-data-research/

Advanced Information Systems

Big Data and Analytics

IT Governance

Area: Big Data and Analytics

Ειδίκευση: Big Data and Analytics