Spark

[FR] Passer de EMR vers Kubernetes pour les workloads Spark

Introduction AWS EMR est un service AWS largement utilisé principalement pour le traitement des données massives avec Apache Spark dans un Cluster Hadoop dédié. Au-delà de sa fonction principale, EMR embarque un bon nombre d’outils open-source, certains pour le monitoring (Ganglia), et d’autres pour le requêtage des données (Hive). Plus d’informations peuvent être trouvées par ici. Dépendamment du contexte, EMR peut être utilisé soit en tant qu’instance d’un cluster éphémère (par exemple en lançant un Cluster tous les 6 heures pour exécuter des jobs Spark), soit en tant que cluster permanent. C’est le cas notamment lorsque celui-ci est utilisé par plusieurs équipes, fait tourner des jobs de streaming ou lorsque l’attente de son instanciation est plus coûteuse que de le laisser tourner de manière permanente. Cet article n’est pas nécessairement un texte pour comparer EMR à Kubernetes vu que les deux ne répondent pas aux mêmes besoins. Kubernetes s’impose de plus en plus aujourd’hui pour des raisons diverses et variées, et Spark supporte Kubernetes comme Scheduler et Resources Manager nativement, donc ça aurait été dommage de ne pas s’y pencher. ...

[EN] Migrating from a plain Spark Application to ZparkIO

Migrating from a plain Spark Application to ZIO with ZparkIO In this article, we’ll see how you can migrate your Spark Application into ZIO and ZparkIO, so you can benefit from all the wonderful features that ZIO offers and that we’ll be discussing. What is ZIO? ZIO is defined, according to official documentation as a library for asynchronous and concurrent programming that is based on pure functional programming. In other words, ZIO helps us write code with type-safe, composable and easily testable code, all by using safe and side-effect-free code. ZIO is a data type. Its signature, ZIO[R, E, A] shows us that it has three parameters: ...

[EN] CI/CD pipeline using Github Actions, SBT and AWS S3 - Part 1

Github now allows us to build continuous integration and continuous deployment workflows for our Github Repositories thanks to Github Actions, for almost all Github plans. In this tutorial, we’re going to go through building a CI/CD pipeline based on a Scala / Spark project. We will be using SBT, the Scala Build Tool, which will allow us to get a jar that we’re then going to deploy to AWS S3 using a custom Github Action. ...

Why combine asynchronous and distributed calculations to tackle the biggest data quality challenges

Article co-authored by Martin Delobel and available on Medium.

[EN] 10+ Great Books for Apache Spark

This article was co-authored by Matthew Rathbone image by Ed Robertson Apache Spark is a super useful distributed processing framework that works well with Hadoop and YARN. Many industry users have reported it to be 100x faster than Hadoop MapReduce for in certain memory-heavy tasks, and 10x faster while processing data on disk. While Spark has incredible power, it is not always easy to find good resources or books to learn more about it, so I thought I’d compile a list. I’ll keep this list up to date as new resources come out. ...