Course Code



1st Semester

ECTS Credits


Type of Course



Christos Doulkeridis

Α. Vlachou

Big Data and Analytics Ι: Techniques and Tools


The main objective of this course is to present to the students modern techniques, systems and platforms for Big Data management and scalable data analytics. Emphasis will be given to issues related to scalability, efficiency and fault-tolerance in the complete life-cycle of Big Data, from data acquisition and integration to data processing and interpretation. Another important direction is data analytics over miscellaneous data types, including text, web data and social data. As expected results the students will acquire strong technical skills in management of Big Data and they will be enabled to design and implement algorithms for data analytics at scale.


Course Contents

Big Data and advanced modelling techniques

Basic concepts. Applications. Use cases. Definitions. 6Vs – Volume, Variety, Velocity, Veracity, Validity and Volatility. Advanced modelling techniques in relation with Big Data. Problem formulation. Re-quirements for Big Data management platforms. Opportunities and research chal-lenges. The Big Data analysis pipeline. Challenges related to Big Data.

Principles of distributed and parallel data management

Physical storage. Row vs. column layout. Local-global indexing. Partitioning techniques. Distributed query processing. Query optimization. Load balancing.

Data integration

Data types (text, semi-structured, structured, multidimensional). From data to information to knowledge. Data acquisition. Data cleaning. Data trans-formation. Data fusion. Data integration. Semantic data integration. Privacy and se-curity issues.

NoSQL stores

Motivation for NoSQL stores. Comparison with relational databases. ACID properties. BASE properties. Eventual consistency. Key-value stores. Document stores (MongoDB, CouchDB). Extensible record stores (Google’s BigTable, Cassandra).


Presentation of MongoDB. Architecture of MongoDB. Query router. Config servers. Shards. Replicas. Operations supported by MongoDB.

Data mining and analytics

Challenges for large-scale data analytics. Apache Ma-hout. Data mining and analytics. Clustering. Classification. Recommender systems.

Real-time analytics I

Stream processing and analytics. Analytics in real-time. Pro-gramming with Apache Storm. High-level abstractions over Storm (Trident).

Real-time analytics II

The case of in-memory processing and analytics. Complex event processing (CEP). Apache Spark. Comparison between Spark and Hadoop. Micro-batching and Spark Streaming. SparkSQL.

Web analytics

Web science. Search algorithms. Ranking. Log analysis. Analysing website traffic web logs, click streams, query logs, and page views.

Time series analysis

Examples and motivation. Trend detection. Moving averages. Smoothing. The correlation function.

Recommended Readings

  • Özsu, M. T., Valduriez P. (2011): Principles of Distributed Database Systems, Third Edition. Springer, ISBN 978-1-4419-8833-1, pp. I-XIX, 1-845.
  • Jagadish, H. V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J. M., Ramakrishnan, R., Shahabi, C. (2014): Big Data and Its Technical Challenges. Communica-tions of the ACM, Vol. 57 No. 7, pages 86-94.
  • Catell, R. (2010): Scalable SQL and NoSQL data stores. ACM SIGMOD Record, Volume 39 Issue 4, December 2010, pages 12-27.
  • White, T. (2012): Hadoop: The Definitive Guide, 3rd Edition. O’Reilly Media, ISBN-10: 1449311520.
  • Abadi, D. et al. (2016): The Beckman Report on Database Research. Communications of the ACM, Vol. 59 No. 2, pages 92-99.

Additional Readings

  • Golab, L., Özsu, M.T. (2010): Data Stream Management. Morgan & Claypool Publishers, Synthesis Lectures on Data Management.
  • Aggarwal, C.C. (2011): Social Network Data Analytics, Springer, ISBN: 978-1-4419-8462-3.
  • Mohan, C. (2013): History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla. Proceedings of EDBT’13, Genoa, Italy.
  • Selected research articles.