Where data engineering is concerned, agility is everything.
Generally speaking, data engineers are responsible for building and maintaining data systems and architecture, helping their organizations to move, store and process data at scale — be it structured, semi-structured and unstructured. That work can involve evolving data stacks, parsing undermined data and constructing pipelines for data scientists to utilize, not to mention scaling all this infrastructure as a company grows. But the range of responsibilities data engineers can take on is as virtually limitless as data itself.
Those who succeed in the field do so by embracing its constant sense of challenge and possibility. To thrive as a data engineer, one must master data literacy and be technically skilled, with a knowledge of coding languages and acumen for developing and designing data models; and yet it’s equally important to know how to apply all that knowledge in unique, practical, hands-on environments.
Just ask Richard Dederian, data engineer at Tempo Automation. An industry veteran who’s seen computing evolve dramatically across his lifetime, Dederian has consistently scaled his own expertise in data collection and insight even as innovations of AI, cloud-based storage, and customer data platforms have revolutionized the industry time and time again.
At Tempo Automation, a San Francisco-based, software-accelerated electronic manufacturer, Dederian finds that his daily challenges are endless and interdisciplinary, though they all revolve around the sets and statistics he’s been analyzing for his entire career.
“Data is always the heart of any computer system and business,” he said.
Below, Dederian and three other data engineers shed light on the inner workings of the essential roles they play at their respective companies and what drives them forward in a field split between traditional software engineering and data science.
Tempo Automation is an electronics manufacturer, wielding a differentiated smart manufacturing platform that uses software and data to accelerate the digital transformation of electronics prototyping and on-demand production.
What led you into a career as a data engineer?
At an early tinkerer’s age in the late ‘80s and early ‘90s, computers were fascinating but much more simplistic in their mission of storing data. The choice for myself was easy, as I continued to follow the trends of computers and their capabilities to process and store information. This ultimately held my interests in studying information throughput and retrieval as a career focus; data just keeps moving along, seemingly proportionally to Moore’s Law.
How does data engineering differ from more traditional software development, and what are the key technical skills that you use most often during your work day?
Data engineering is interdisciplinary, necessitating work with many elements, stakeholders and technologies; I think this evolutionary process keeps the field full of newer, faster and often sensational ideas and team-based collaboration. NoSQL, map-reduce, normal form come to mind technically as topics, but most importantly profiling, implementing and supporting the correct tools for a given job is essential in an open and transparent forum. Everyone has ideas, and data engineers need to balance the technical with constructive data stewardship and attention to quality.
Critically, knowing SQL and Python seem like obvious and ubiquitous choices, as is having a background in application development and researching good design patterns. Also important is being consistently supportive of the software development life cycle, as data is always the heart of any computer system and business.
Describe a project you’re working on right now.
My current project covers all of the bases in data engineering. On one side is attention to the small details of our parts data across our business, with data integration of all major systems of an automated factory. On the other side, we have concurrent initiatives to grow our entire data platform to continue our mission of learning from every order, and to provide that data to train our data science models in support of automation, quality and AI. It’s an exciting time to be a data engineer, data steward and data collaborator all-in-one here at Tempo Automation.
Lark Technologies offers a chronic disease prevention and management platform that uses a cognitive behavioral therapy framework, conversational AI and connected devices to help people stay healthy and in control of their conditions.
What led you into a career as a data engineer?
I have always been drawn to more challenging and more interesting problems. Around 25 years ago, I was working on the cutting edge of high-performance computing: Beowulf clusters, MOSIX, DIPC, etc. That led naturally into distributed systems and eventually into very large-scale and hyperscale systems. Such systems are inherently data-intensive, and so I found myself working with and building data tools and big data tools.
Once the job title of “data engineer” started to pop up, I realized that was what I had already been doing: building large-scale, data-intensive applications and making data ready for serving, analytics, machine learning and just in general ready to be used at large.
For me, data engineering is a combined expertise, building robust high-quality software as applied to data-intensive applications. And that is what I enjoy working on.
How does data engineering differ from more traditional software development?
I have always seen and continue to see data engineering as an engineering-first field. Data engineers must be strong software engineers. Data engineers must rapidly write high-quality software that’s easy to test, easy to extend, easy for other engineers to intuit and understand, and easy to live with.
What’s different is that every bug, for us, is a “data-affecting” bug. If we ship a bug to production, we can’t just ship the fix and call it a day. We will have corrupt data, and we have to address that. So software quality is key; data quality starts with software quality.
Another difference from other related disciplines is that functional programming concepts crop up all the time in data engineering. Repetitive, side-effect-free applications can be done in parallel and, as a consequence, data engineers are often thinking functionally. We tend toward functional programming as a matter of course.
I think the most important skills for data engineers are that they’re excited about learning new technologies, that they deeply understand time — as well as data races and how to avoid them — and that they know how to navigate distributed systems. Being familiar with developing software for a production environment is also important, as is knowing software development best practices and how to apply them to data. You should know Apache Spark, Scala, Python and AWS.
Describe a project you’re working on right now.
Currently, we’re focused on a few key initiatives: building a fifth-generation data platform, building critical ETL/ELT components, building Lark’s first data warehouse, and modernizing our production ML infrastructure.
As our primary warehousing strategy, we’re leveraging Spark, Databricks and Delta.io in a so-called “lakehouse” architecture, utilizing a data lake as a warehouse. Around this, we’re developing capabilities for self-service but with guarantees of semantic meaning, availability, data governance, data security, data quality, change control, and so on.
Additionally, utilizing the same infrastructure we use to manage our ODS and lakehouse, we’re providing data virtualization, data federation and platform services for streaming and batch applications of assorted variety.
What’s exciting about all of this is that it’s largely greenfield: replacing legacy systems and breaking new ground on things that have never existed here. We’re on the ground floor for a lot of the work we’re doing.
Curebase is a software startup reinventing decentralized clinical trials to help the vast majority of potential patients that currently cannot access clinical research.
What led you into a career as a data engineer?
I am currently an engineering manager at Curebase, which means my main focal point is the development of engineering team members. I am excited to grow in this space, because I consider this the area where I can most contribute to the entire organization. Having said that, I did work in previous companies as a data engineer, and I have always enjoyed back-end tasks, specifically working with data itself.
What leads you toward a career as a data engineer is, first and foremost, a passion or even obsession for getting the data right. This means developing skills across infrastructure, database logic, database modeling, mathematics and proper usage of types. Most importantly, it means understanding the nature of the problem you are trying to solve. Data drives tactical and strategic decision-making processes, and it’s also an important feature for any team at a company, not necessarily just a tech team. Everyone benefits from having the data they need be easily accessible and consumable, in order to support their goals and objectives.
How does data engineering differ from more traditional software development?
Essentially, the responsibilities of a data engineer differ from those of a more traditional software development role due to one keyword: data (and everything around it, which includes many things).
Some of the key skills that I use during my day-to-day work are as follows: analyzing and organizing raw data; building data systems and pipelines; evaluating business needs and objectives; interpreting trends and patterns; preparing data for prescriptive and predictive modeling; building algorithms and prototypes; developing analytical tools and programs; and collaborating with data scientists and architects on projects.
All of the projects a data engineer works on are, of course, related to data, so it is important to be familiar with where data is stored, how it is consumed, how it is shown, how to organize it from different resources such as structuring data lakes, and how to make decisions and evaluate those made by others. These might not all be skills required on a daily basis, but they are important for developing strategy and working with managers to implement efficient processes.
Describe a project you’re working on right now.
Data projects are an essential component of all of our initiatives at Curebase. We work with sensitive data on a daily basis, related to participants involved in each clinical trial. In the coming months, we are excited to be launching clinical trials across different countries and regions in which various privacy laws apply. This presents unique challenges for many disciplines besides engineering, as you can imagine. But on the engineering side, we are working to ensure we have all the right localization resources and the correct infrastructure in place. This work involves utilizing many of the skills I mentioned previously, in addition to project-specific responsibilities. At the end of the day, to be a data engineer at Curebase means enabling patients to be part of clinical trials wherever they are.
GRAIL is a healthcare company utilizing the power of high-intensity sequencing, population-scale clinical studies, and state-of-the-art computer science and data science to enhance our scientific understanding of cancer biology, in an effort to detect it earlier.
What led you into a career as a data engineer?
After I spent my undergraduate studying applied mathematics and biology, I worked in a few different fields as a data scientist before coming to GRAIL and moving into data engineering. Although I love uncovering patterns in data, complex data analysis can’t even start to happen without clean, organized, easily accessible datasets. As I learned the importance of data infrastructure and curation, I realized that adequately managing large, disparate datasets is a difficult and common challenge in the biotech and tech industries. The tech world rewards collecting as much data as possible, but many companies struggle with planning ahead, scaling, reproducibility, data silos and documentation. As a data engineer, those are the problems I get to solve. I have dealt with those challenges across research and industry as a user of data systems. Now, as a data engineer, I get to advocate for the needs of data analysts and data scientists when impacting the systems they will use on a daily basis.
How does data engineering differ from more traditional software development?
Data engineering is a combination of data science and software development. You are “close” to the data in a way that most software engineers are not, but you also work with software engineers to design and create infrastructure in a way that data scientists do not. As an expert on your company’s data, you’re in the unique position to drive change where it is needed. As a data engineer, you need to be flexible and tool-agnostic, able to switch between different softwares and datasets with ease. I use software engineering skills like writing unit tests and designing and creating tools for myself and others to pull, combine, and manipulate data. But I also use data science skills like identifying anomalies, cleaning and combining data from different sources, and curating datasets and results for different audiences.
Describe a project you’re working on right now.
As a data engineer at GRAIL, I work with data from many different projects and many different data sources on a daily basis. I am currently working on a project with external partners to transfer clinical, lab and results data back and forth smoothly. The main goals of this project are to support the operations of a clinical trial while upholding data quality and integrity. This requires technical skills, knowledge of available tools and people skills to synthesize the needs of multiple different parties and come up with the best workflow. It also pays off to think about how smaller collaborative projects like this fit into the larger data ecosystem at GRAIL, and how we can be consistent in data collection and storage within our own systems. There are many challenges when trying to overcome or fit in with legacy software or data systems while also working with companies across different industries and thinking about the big picture to accommodate possible future needs. Those challenges make data engineering fun.