By Ivan Liao
What is a data science engineer?
The data science engineer is at the center of the big data revolution where every company scrambles to integrate machine learning, deep learning, artificial intelligence, and data-driven decisions.
As a refresher, let's go over what a data engineer is. Data engineering is a software engineering approach to designing and developing information systems.
You can think of it as creating a digital road that data travels on. The industry term for this is the data pipeline or ETL pipeline. ETL stands for Extract, Transform, and Load. It's a fancy term that stands for taking data from one place, manipulating or performing some sort of algorithm on it, and then putting the result somewhere else.
It may also be useful to know what a data scientist is. According to Wikipedia,
data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.
To sum it up, a data science engineer is a specialized role that applies the responsibilities of the data engineer to the specific problems that data scientists solve. Because data science engineers are engineers, the emphasis of the job is on implementing solutions by creating digital infrastructure through the use of coding and graphical user interfaces (GUI).
What is a data science engineer career like?
What kind of companies would you work for? What opportunities are there as a data science engineer? What is the salary?
You will be working with companies at the forefront of technological advances.
- FAANG companies like Facebook, Apple, Amazon, Netflix, and Google.
- Popular sharing economy companies like Uber, Lyft, and Airbnb.
- Aggregation platforms like Youtube, Twitch, Spotify, and Valve.
- Social media companies like LinkedIn, Instagram, Snapchat, Twitter.
- Software as a Service (SaaS) companies like Salesforce, SAP, Square.
LinkedIn's 2020 Emerging Jobs Report shows Data Engineering is the 8th fastest growing job. Industry insiders also talk about the need for data engineers over data scientists, which was the 3rd fastest growing job in 2020.
As of February 3rd, there were 86 results for "Data Science Engineer" compared to 9,271 for "Data Scientist" and 9,251 for "Data Engineer" on LinkedIn in the United States.
We also highly recommend reading the actual job description because the role is a combination of data scientist and data engineer and your day-to-day responsibilities may be somewhat ambiguous. Do your due diligence and make sure you're applying for a job that fits your skill set.
One job, for instance, asks for the data scientist's ability to
develop custom data models, algorithms, and predictive models to perform multifaceted analysis
as well as the data engineer's ability to
ETL new data sources along with mining for insights.
It's often the case that companies will completely mislabel jobs as well. They may actually want a data analyst or statistician instead.
According to Glassdoor, the average yearly salary for data science engineers is $102,864, ranging from $72,000 to $158,000. The average salary has trended consistently upward in recent years.
Data Science Engineer Qualifications
What education, experience, and skills do you need to become a data science engineer?
Education and Experience
An entry level data science engineer job will require at least an undergraduate computer science or related engineering degree or 3+ years of relevant work experience.
We've made a list of the main skills data engineers need. They're all followed by a few examples of a specific tool. Don't feel overwhelmed by the sheer number of tools. Often, data science engineers choose to focus on only a few of them and become experts at those.
For example, it's entirely possible to get a job as a data science engineer by simply knowing Python, Amazon S3, and Postgres. Many of these tools are also interchangeable.
- Soft skills - communication within the technical team, with the managerial department, and with clients
- Relational Databases - Oracle, PostgreSQL, MySQL
- Non-relational Databases - MongoDB, Google Firebase, Apache Cassandra
- Data Warehouses - Snowflake, Amazon Redshift, Google Bigquery
- Data Lakes - Amazon S3, Google Cloud Storage
- Distributed Computing Frameworks - Apache Spark, Apache Hadoop
- DevOps tools - Git, Docker, Kubernetes, Airflow, Amazon Lambda, Jenkins, JIRA
Data science engineer day to day responsibilities
Let's follow Mary, a data science engineer at the United States Centers for Disease Control and Prevention (CDC). Mary works with a team of other data science engineers to set up a ETL pipeline. At the local level, it is standard for hospitals to report data using intake forms that go to Excel tables or .csv files.
Mary uses this knowledge to create a data pipeline where .csv files are stored on Amazon S3 cloud storage. An Amazon Lambda function is written to automate loading and structuring the data to Amazon Redshift. Mary’s team sets up the Redshift data warehouse in order to allow scientists and other health organizations to access, query, and analyze the nation’s pandemic data.
In this case Mary had the following responsibilities:
- Pooling various sources of data into one data warehouse
- Communicating status of data engineering projects to management and other technical departments
- Documenting data infrastructure configurations
- Maintaining and updating existing data infrastructure
Other typical responsibilities a data science engineer may have include:
- Connecting the data pipeline between data sources and machine learning models or artificial intelligence programs
- Migrating data from an old location to a new location. Most often this is from company-owned servers to serverless, cloud-native services.
- Recommending cost-effective changes to existing data infrastructure
- Developing custom solutions for the needs of other technical departments
But what about a real life example of the data infrastructure that data science engineers build? Let's look at Airbnb's data infrastructure in the following section.
Airbnb: A Data Infrastructure Example
You can see there are various elements in the Airbnb data infrastructure:
- MySQL, a relational database that serves as one of Airbnb's data sources
- Airflow, a DevOps workflow scheduling tool
- Tableau, a visualization and dashboard creation tool at the end of the ETL pipeline
- Hive clusters for data integrity and data manipulation
- Amazon S3, a data lake
Example data science engineer interview questions
The data science engineer is a subset of the data engineer. The landscape is new, constantly shifting, and filled with challenges. It may sound a little daunting, but if you enjoy coding and working with large streams of data, the data science engineer role may just be for you. Check out our free interview questions from real interviews at companies like Amazon, Google, and Facebook to start preparing for your next interview.
Here are three of our recommended data science engineer interview questions:
'transactions' table +------------+----------+ | column | type | +------------+----------+ | id | integer | | user_id | integer | | created_at | datetime | | product_id | integer | | quantity | integer | +------------+----------+
We're given a table of product purchases. Each row in the table represents an individual user product purchase.
Write a query to get the number of customers that were upsold by purchasing additional products.
Note that if the customer purchased two things on the same day that does not count as an upsell as they were purchased within a similar timeframe.
Here's a hint:
An upsell is determined by multiple days by the same user. Therefore we have to group by both the date field and the user_id to get each transaction broken out by day and user.
SELECT user_id , DATE(created_at) as date FROM transactions GROUP BY 1,2
See if you can figure out this question using the SQL editor on Interview Query.
One Million Rides
Let's say we have 1 million app rider journey trips in the city of Seattle. We want to build a model to predict ETA after a rider makes a ride request.
How would we know if we have enough data to create an accurate enough model?
Here's a hint:
Are there any metrics that we could use to baseline how well a model developed off the data would actually perform?
Do you have it solved? Check out the solution on Interview Query.
Click Data Schema
How would you create a schema to represent client click data on the web?
Here's a hint:
These types of questions are more architecture based and are generally given to test experience within developing databases, setting up architectures, and in this case, representing client side tracking in the form of clicks.
What exactly does click data on the web mean? Any form of button clicks, scrolls, or action at all as an interaction with the client interface, in this case desktop, would be somehow represented into a schema form for the end user to query. This does not include client views however.
A simple but effective design schema would be to first represent each action with a specific label. In this case assigning each click event a name or label describing its specific action.
See if you have what it takes to solve this problem on Interview Query.