Data engineers are called the backbone of all data science and engineering. Running data analytics, building predictive models, and developing software is how all modern businesses are built and scaled, but this isn’t possible without having good data to begin with.

Data engineers are responsible for ingesting data in a way that is reliable, timely, and accurate. In more technical terms, the responsibilities of data engineers are in developing interfaces and processes for the flow and access of data across a business.

There’s a cultural shorthand that’s applicable here. You’ve probably heard it: garbage in, garbage out. A data analyst or data scientist’s ability to derive value is limited by the quality of the data itself. The data engineer ensures the quality of that data. Since no company wants to generate insights based on garbage, data engineering is one of the fastest-growing jobs, with a 50% year-over-year growth in 2019.

Data Engineer Interview Questions

Skills for data engineers correlate with their responsibilities. Data engineers need to focus on these main topics for interview questions:

  • SQL Queries
  • Database Design and Data Modeling
  • ETL (Extract, Transfer, Load)
  • Data Structures and Algorithms
  • Behavioral

We analyzed 10K+ data engineering interview experiences and found out that SQL questions are the most frequent question asked on data engineering interviews.

Typical Data Engineer Interview Format

SQL Interview Questions

(NULL ALT)
Image by Do Exploit from Pixabay

SQL databases are the most commonly used tool that data engineers use to facilitate data transactions. It is the standard programming language for developing and managing relational database systems. Other tools such as Redshift, Talend, and Informatica are used to develop large distributed data storages (noSQL), cloud warehouses, or deploy data into data platforms.

SQL is a constant presence in technical interviews for the data engineer role. Being able to demonstrate your technical skills during this portion of the interview goes a long way towards landing the data science job you want.

Q1. Write a query that returns all of the neighborhoods that have 0 users.

users
column type
id int
name varchar
neighborhood_id int
created_at datetime

neighborhoods
column type
id int
name varchar
city_id int

Whenever the question asks about finding values with 0 something (users, employees, posts, etc..) immediately think of the concept of LEFT JOIN! An inner join finds any values that are in both tables, a left join keeps only the values in the first table.

Our predicament is to find all the neighborhoods without users. To do this we must do a left join from the neighborhoods table to the users table which will give us an output like this:

neighborhoods.name users.id
castro 123
castro 124
cole valley null
castro heights 534

How do we remove the NULL?

And what SQL query would we use to get our data so that we can display the neighborhood of Cole Valley as not having any users?

Answer this data engineer interview question on Interview Query.

Q2. Select the top 3 departments with at least ten employees and rank them according to the percentage of their employees making over 100K in salary.

employees
column type
id int
first_name varchar
last_name varchar
salary int
department_id int

departments
column type
id int
name varchar

Let's approach this problem by looking at the output of what the response would look like.

We know that we need to calculate the total number of employees that are making over $100K by each department. This means that we're going to have to run a GROUP BY on the department name since we want a new row for each department.

We also need a formula to represent how we can differentiate employees that make over $100K and those that make less. We can calculate that by formulating:

(Number of people making over $100K) / (Total number of people in that department)

Now in terms of implementation, if we first do a JOIN between the employees table and the departments table, then we can get all of the data points we need together. Then all that is left is:

  • a function to get all of the employees
  • a function to get all employees making over 100K
  • dividing those values by each other

See the full solution to this data engineer interview question on Interview Query.

Q3. Write a query that returns true or false whether or not each user has a subscription date range that overlaps with any other user. (Asked by Twitch)

subscriptions
column type
user_id int
start_date date
end_date date

Let's take a look at each of the conditions first and see how they could be triggered. Given two date ranges, what determines if the subscriptions would overlap?

If one range is neither completely after the other, nor completely before the other, then the two ranges must overlap.

De Morgan's law says that:

Not (A Or B) <=> Not A And Not B.

What is that equivalent to? And how could we model that out for a SQL query?

See the full solution to this data engineer interview question on Interview Query.

Data Modeling and Database Design Interview Questions

A man in the middle of whiteboarding database design
Photo by Campaign Creators on Unsplash

The data model is an important part of the data engineering pipeline. During the analysis and design phase, data models are built to make sure that the requirements of a new application are completely understood.

A data engineer should have knowledge of entity-relationship modeling, along with a clear understanding of normalization and denormalization tradeoffs. Moreover, they should be familiar with dimensional modeling and other related concepts.

Key concepts to know within data modeling:

  1. Many times the interviewer will present you with an application and have you build the schema for it.
  2. Subsequently, the interviewer will ask you to write a query to produce some metric for the business.
  3. Lastly, if the question is super hard, the interviewer will ask you to brainstorm new metrics that MATTER for the business itself.

Q4. Design a database to represent a Tinder-style dating app. What does the schema look like?

Let’s first approach this problem by understanding the scope of the dating app and what functionality we must design around.

If we were to list the key Tinder app capabilities, it would be something like:

  • Onboarding - User opens up, adds preferences, add pictures, and starts swiping on users.
  • Matching - If the user matches with another user, we notify them and create a messaging system.
  • Messaging - Users can message each other only if both have been matched. Users can also leave conversations at any time.

Next, we should figure out if we have specific feature goals that we have to account for within engineering and system design.

See the full solution to this data engineer interview question on Interview Query.

Q5. Let's say we want to run some data collection on the Golden Gate bridge.

What would the table schema look like if we wanted to track how long each car took coming into San Francisco to enter and exit the bridge?

In this case, we're given that we have to track time entered and exited leaving the bridge, but also the car make and model along with license plate information. We know that the car model to license plate information will be one-to-many, given that each license plate represents a single car, and a car model can be replicated many times.

Let's model the schema in this format then of each drive across the bridge as a distinct event. We can set an enter_time and exit_time for each car as they go through the bridge.

Q6. Write a query using the given tables to get the time of the fastest car on the current day.

crossings
column type
id integer
license_plate varchar
enter_time datetime
exit_time datetime
car_model_id integer

car_models
column type
id integer
model_name varchar

Since we only need the crossing time and the license plate number, we can write a query that only uses the crossings table.

Let's set conditions where the enter_time and exit_time are both set to the current date. Let's compute the fastest time as being the minimum difference between the exit_time and the enter_time.

Q7. Write a query on the given tables to get the car model with the fastest average times for the current day.

(Example: Let's say three Ferraris crossed the bridge in an average of one minute).

Let's compute the average fastest car model as the sum of all of the crossings by the specific car models divided by all of the crossings. We don't care about duplicates in this scenario given that total crossings should be the denominator and total crossing time as the numerator.

We can join the two tables together on the car_model_id field and then group by the model name. Since we only need the fastest car model we can order by our computation metric, then limit the table size to one.

See the full solution to this data engineer interview question on Interview Query.

Q8. How would you create a schema to represent client click data on the web?

These types of questions are more architecture-based and are generally given to test experience within developing databases, setting up architectures, and in this case, representing client-side tracking in the form of clicks.

What exactly does click data on the web mean? Any form of button clicks, scrolls, or action at all as an interaction with the client interface, in this case, desktop, would be somehow represented into a schema form for the end-user to query. This does not include client views, however.

A simple but effective design schema would be to first represent each action with a specific label. In this case, assigning each click event a name or label describing its specific action.

See the full solution to this data engineer interview question on Interview Query.

ETL (Extract, Transform, and Load) Data Engineering Questions

Data engineers and data scientists work hand in hand. Data engineers are responsible for developing ETL processes, analytical tools, and storage tools and software. Thus, expertise with existing ETL and BI solutions is a much-needed requirement.

ETL refers to how the data is taken (extraction) from a data source, converted (transformation) into a format that can be easily analyzed, and stored (loading) into a data warehouse. The ETL process then loads the converted data into a database or BI platform in order to be used and viewed by anyone in the organization.

This skill set becomes especially important when dealing with big data.

Q9. Let's say we want to create an aggregate table called lifetime_plays that records the song count by date for each user.

Write a SQL query that could make this ETL each day.

song_plays
column type
id integer
created_at datetime
user_id integer
song_id date

For this problem, we use the INSERT INTO keywords to add rows to the lifetime_plays table. If we set this query to run daily, it becomes a daily extract, transform, and load (ETL) process.

The rows we add are selected from the subquery that selects the created_at date, user_id, song_id, and count columns from the song_plays table for the current date.

See the full solution to this data engineer interview question on Interview Query.

Q10. Let's say you have a table with a billion rows. How would you add a column inserting data from the original source without affecting the user experience?

Before jumping into the question we should remember to clarify a few details that we can potentially get out of the interviewer. It helps to ask questions to understand and show that you can think holistically about the problem. Rushing too fast into a solution is a red flag for many interviewers.

Given the problem statement, let's clarify. What is the potential impact of downtime?

We always have to remember to get more context out of the question. In this scenario when we are talking about a table with a billion rows, this could be a table that is powering an entire company in which downtime would affect thousands of dollars in sales or could just be an offline analytics events table that would cause impact to internal employees.

It's crucial to then mention to the interviewer and assess, what are the potential effects of downtime for seconds, minutes, and hours. Figuring out the impact is pertinent then to determine our strategy going forward.

See the full solution to this data engineer interview question on Interview Query.

Data Structures and Algorithms

Data engineers focus mostly on data modeling and data architecture, but a basic knowledge of algorithms and data structure is also needed.

Of particular importance is the data engineer’s ability to develop inexpensive methods for the transfer of large amounts of data. If you’re responsible for a database with potentially millions (let alone billions) of records, it’s important to find the most efficient solution.

These questions are common for software engineers and similarly apply to data engineers as well.

Q11. Write a function that returns the missing number in the array. Complexity of O(N) required.

Example:

nums = [0, 1, 2, 4, 5]
missingNumber(nums) -> 3

There are two ways we can solve this problem. One way through logical iteration and another through mathematical formulation. We can look at both as they both hold O(N) complexity.

The first would be through general iteration through the array. We can pass in the array and create a set that will hold each value in the input array. Then we create a for loop that will span the range from 0 to n, and look to see if each number is in the set we just created. If it isn't, we return the missing number.

def missingNumber(nums):
    num_set = set(nums)
    n = len(nums) + 1
    for number in range(n):
        if number not in num_set:
            return number

See the full solution to this data engineer interview question on Interview Query.

Q12. Given a grid and a start and end, find what the max water height that you can traverse to before there exists no path. You can only go in horizontal and vertical directions.

Example

S 3 4 5 6
2 4 6 1 1
9 9 9 9 E

-> 3 since our best path is 3->4->9->9->9. If water level is 3, then there is no path past S.

Solution:
Recursive backtrack to the end while saving the max path water level on each function call. Track a visited cell set to trim the search space.
O(n^2)

Q13. Given a string, determine whether any permutation of it is a palindrome.

The brute force solution would be to try every permutation, and verify if it's a palindrome. If we find one, then return true, otherwise return false.

The number of possible permutations is on the order of N!, so computing all of them and checking for palindromes would be expensive. Instead, notice that in a palindrome each character must be matched by the other side, except at the middle. That means that if we count up all the characters in our input string, at most one can have an odd count. That means that if we have more than one letter with an odd count, we can discount the string as not being a palindrome.

See the full solution to this data engineer interview question on Interview Query.

Q14. Given a stream of numbers, select a random number from the stream, with O(1) space in selection.

A function that is O(1) means it does not grow with the size of the input data.

That means, for this problem, the function must loop through the stream, inputting 2 entries at a time and choosing between the two of them with a random method.

The input data for the function should be the current entry in the stream, the subsequent entry in the stream, and the count (i.e. total amount of entries cycled through thus far).

What happens if the count is at 1?

See the full solution to this data engineer interview question on Interview Query.

Data Engineering Behavioral Interview Questions

As with any type of job interview, hiring managers will also typically want to assess whether you are a good culture fit, work well with others, and can deal productively with things like failure. They may also want to assess your relevant past experience and quiz you on your familiarity with various tools and architectures. Here are some of the behavioral data engineer interview questions you can expect on your interview:

Q15. What Big Data tools do you use and for what tasks?

Q16. What is your experience using cloud technologies?

Q17. Share some details or insights about projects where you've used huge datasets. What was your approach in solving the data problem?

Q18. When would you use Java instead of Python?

Q19. How would you design a data warehouse given X criteria?

Q20. How would you design a data pipeline?

Q21. What is the architecture of Spark?

Q22. What is a challenge you've faced handling big data for a company?

Q23. Tell me about a situation where you dealt with a technology that you weren't familiar with.  

Data Engineering Roles and Responsibilities

Regardless of the domain they’re working in (finance, social media, etc.), data engineers have common responsibilities that combine the knowledge and skills of information technology, engineering, and databases.

  • Data Pipeline Testing/ Maintenance: Data engineers test the performance and efficiency of each part of a system or model during the initial phase. Moreover, they collaborate with the testing team to make sure everything is running smoothly.
  • Designing Data Platform Architecture: At its core, data engineers have to design the architecture of a data platform, from the implementation of a fact table to the various dimension tables that inform it.
  • Provide Data-Access Tools: In the majority of cases, data-access tools are not needed, as data scientists can use data lakes to fetch or extract data from storage. But if the enterprise requires business intelligence (BI) for analysts and other users, data engineers set up the tools to create visuals, observe data, and generate reports.
  • Development of Data Instances: Data engineers use their coding skills to build, customize, and overlook integration tools, warehouses, databases, and analytical systems.
  • Handle Data and Metadata: Data is stored in a warehouse in a structured or unstructured manner. However, storage systems may also contain meta-data. A data engineer is responsible for managing all the data stored in the warehouse. Moreover, they have to structure it precisely through database management systems (DBMS).
  • Monitoring the Stability of Pipelines: Managing the entire performance, efficiency, and stability of the system is essential while the warehouse is being cleaned after equal time intervals. The automated parts of the system also need to be tracked and upgraded according to the data and requirements.

The responsibilities of a data engineer differ according to their team size, platform size, the complexity of the projects, and seniority level. In some enterprises, the roles of a data engineer and data scientist may be much more detailed and granular.

Conclusion

Over the last few years, the demand for the role of a data engineer has been on the rise exponentially. Companies of all sizes are actively searching for data engineers to address their data challenges. Everyone is looking to hire data engineers and the demand is far from being flooded like other fields. The skillset is important and those who gain some experience will have the opportunity to earn big. Even if you don’t have the right amount of experience, you can do a relevant certification to land yourself the perfect job!

Thanks for Reading

If you're currently in the process of interviewing for a job in the data science field, or just looking to brush up on your DS skills, consider signing up for a free account on Interview Query for access to our question bank filled with real interview questions from companies like Google, Facebook, Amazon, and more. Our questions are sorted by question type, position, and difficulty to make sure that you're practicing exactly what you need to practice to level up your skills and ace your next interview.