Independent Big Data Engineer
Please, tell a bit about the work you do. How did you first get involved into this work?
From doing ETL on very large data warehouses modeled using Kimball’s methodology 10 years ago, to more recently working with large Cassandra clusters, to nowadays developing distributed pipelines on top of Apache Spark, I have always been working with large amounts of data for my entire professional life. I was passionate about data in all shapes and sizes even before the term “big data” was coined.s.
What is the most challenging part of your work, and what is the most rewarding one?
Going from requirements to architectural plan is always where most of the outside-the-box thinking happens. I find it very rewarding to see how other developers build extensions and plug-ins for a system I architected and engineered.
Do you have any advice for someone interested in pursuing the same career?
There are many freely available datasets out there. Find a dataset that interests you and ask yourself what insights you could extract from it by applying transformations, groupings and aggregations. Then write single-threaded code to do that. Then imagine that you have a similar dataset but this one is a billion times bigger. Does it fit into the memory of the largest cloud server you can possibly rent? How much longer would your single-threaded code take to process the bigger dataset? Once you answered these questions, start reading about the fallacies of distributed computing and the CAP theorem. Then look into how you might load your dataset into Cassandra or HDFS and parallelize your single-threaded code with Apache Spark.
Can you give us a sneak peek into your GDG DevFest Romania talk? Why did you choose this subject?
Running Spark workloads on GCP is easy with Cloud Dataproc, a managed solution. Your Spark cluster spins up in just over a minute, and you start thinking differently about your ETL jobs. Have the input data available on GCS, spin up an ad-hoc cluster, run a specific job that reads input from that GCS bucket, and have Dataproc immediately shut the cluster down as soon as the job is finished.
In your opinion, what are the benefits of attending events like GDG DevFest Romania?
Networking, getting to understand how other people think about the same technical challenges that one is facing.