The main goal of the course is to introduce students to modern techniques, systems and platforms for efficient management and processing of large-scale data. Emphasis will be placed on issues related to scalability, efficiency and error tolerance in the full lifecycle of large-scale data, from data collection to data completion and interpretation. In addition, issues related to the processing of multi-source data using processing services streams will be analyzed, with the aim of combining the information generated by the aforementioned data processing by the different flow services. Through this course, students are expected to acquire important technical skills in large-scale data management and learn to design and implement large-scale data processing algorithms.
Big data and advanced modeling techniques
Basic concepts. Applications. Cases of use. Definitions. 6Vs -Volume, Variety, Velocity, Veracity, Validity and Volatility. Advanced Modeling Techniques Related to Big Data. Problem formulation. Requirements for large data management platforms. Opportunities and research challenges. The process of analyzing Big Data. Challenges related to large-scale data.
The MapReduce programming framework. MapReduce job design. Data segmentation techniques. Simple functions (counting, aggregation) and complex functions (coupling).
Hadoop & HDFS (Lab)
Distributed Hadoop file system, backup, error tolerance, high readability. Apache Hadoop as an implementation of MapReduce. Limitations of Hadoop.
Bulk processing I (Laboratory)
Parallel processing, main memory processing, Apache Spark, Resilient Distributed Datasets (RDDs), non-convertible variables, actions and transformations, sluggish valuation, Spark shell, comparison between Spark and Hadoop.
Bulk processing II (Laboratory)
Dataframes in Spark, storage in columns and rows, declaration query processing, Spark SQL.
Real-time processing I
Data processing, programming in Apache Storm, Bolts and Spouts, topologies in Storm, Apache Kafka, real-time intubation data processing.
Real-time processing II (Laboratory)
Micro-batching, processing with and without maintenance, window mechanisms, Spark streaming.
Lambda architectures for data analysis
Approaches to storing, using and analyzing data through data analytics service streams. Batch layer for storing data on a medium, Serving layer for indexing and Real-time processing layer.
Asynchronous data processing
Technician approaches on the server side for serving multiple simultaneous connections with minimal overhead (CPU / memory) per processing service, the concept of callbacks, use of a single thread to implement event loops. To framework Node.js.
Node – Red for data processing service streams (Laboratory)
Display nodes / services for various functions, data format transformation, creation / use of REST services, file reading, creation of data processing services / functions.
- Özsu, M. T., Valduriez P. (2011): Principles of Distributed Database Systems, Third Edition. Springer, ISBN 978-1-4419-8833-1, pp. I-XIX, 1-845.
- Jagadish, H. V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J. M., Ramakrishnan, R., Shahabi, C. (2014): Big Data and Its Technical Challenges. Communications of the ACM, Vol. 57 No. 7, pages 86-94.
- Marz, N., Warren, J. (2015): Big Data: Principles and best practices of scalable realtime systems. Manning publications. ISBN: 9781617290343.
- White, T. (2012): Hadoop: The Definitive Guide, 3rd Edition. O’Reilly Media, ISBN-10: 1449311520.
- Karau, H., Konwinski, A., Wendell, P., Zaharia, M. (2015): Learning Spark: Lightning-fast big data analysis. O’Reilly Media. ISBN-10: 1449358624.
- Golab, L., Özsu, M.T. (2010): Data Stream Management. Morgan & Claypool Publishers, Synthesis Lectures on Data Management.
- Kleppmann, M., (2017): Designing data-intensive applications. O’Reilly Media. ISBN-10: 1449373321.