Table of Contents
As a follow-up to the Product Analyst Interview guide, here’s a number of interview questions and answers specifically for product analysts. The questions are divided into three parts: product-specific questions, SQL questions, and statistics questions. Enjoy!
What’s the first change that you would make to product X? / What is the first new feature that you would add to product X?
If the interviewer asks this question, he or she is looking to see that you’ve done your research. Do you understand what the product is and the product features? Do you know who the target audience of the product is? You should understand the problems that the product solves and speak about how a particular change or additional feature will enhance the solution or address a related problem. Know the customers, know the product, and know the company’s overarching objective.
What metrics would you use to evaluate the performance of a particular feature?
There are a number of metrics that you can use to evaluate the performance of a product feature — the specific metric that you choose ultimately depends on the objective of the feature and the objective of the company overall.
For example, mature companies may solely be focused on reducing churn rates, and thus, it should be the desired metric used to evaluate a feature. Sometimes, however, a product feature is not always aligned with the company’s main objective at the time. For example, a company’s main objective is to increase conversions, but they may have a feature that is focused on retention. Coursera is an example of this, as they implemented a feature that offers users a 50% discount before users attempt to cancel any subscription.
Overall, there are a number of metrics that you can consider, including but not limited to:
- Lifetime value (LTV)
- Customer Acquisition Cost (CAC)
- Monthly Active Users
- Session duration
- Retention rate
- Net Promoter Score (NPS)
Q: How do you create a histogram using SQL?
For example, if you wanted to create a histogram for ‘variable_name’ with bin sizes of 5 then you could use the following code.
SELECT floor(variable/5.00)*5 as variable_name, count(*) as count FROM table_name GROUP BY 1 ORDER BY 1
Q: What is the difference between the WHERE and HAVING clause?
Both WHERE and HAVING are used to filter a table to meet the conditions that you set. The difference between the two is shown when they are used in conjunction with the GROUP BY clause. The WHERE clause is used to filter rows before grouping (before the GROUP BY clause) and HAVING is used to filter rows after grouping.
Q: What are the different types of joins? Explain them each.
There are four different types of joins:
- Inner join: Returns records that have matching values in both tables
- Left join: Returns all records from the left table and the matched records from the right table
- Right join: Returns all records from the right table and the matched records from the left table
- Full join: Returns all records when there is a match in either left or right table
Q: Sample Problem: Duplicate Emails
Write a SQL query to find all duplicate emails in a table named
+----+---------+ | Id | Email | +----+---------+ | 1 | firstname.lastname@example.org | | 2 | email@example.com | | 3 | firstname.lastname@example.org | +----+---------+
SELECT Email FROM Person GROUP BY Email HAVING count(Email) > 1
Q: How do you assess the statistical significance of an insight?
You would perform hypothesis testing to determine statistical significance. First, you would state the null hypothesis and alternative hypothesis. Second, you would calculate the p-value, the probability of obtaining the observed results of a test assuming that the null hypothesis is true. Last, you would set the level of the significance (alpha) and if the p-value is less than the alpha, you would reject the null — in other words, the result is statistically significant.
Q: What is an outlier? Explain how you might screen for outliers and what would you do if you found them in your dataset. Also, explain what an inlier is and how you might screen for them and what would you do if you found them in your dataset.
An outlier is a data point that differs significantly from other observations.
Depending on the cause of the outlier, they can be bad from a machine learning perspective because they can worsen the accuracy of a model. If the outlier is caused by a measurement error, it’s important to remove them from the dataset. There are a couple of ways to identify outliers:
Z-score/standard deviations: if we know that 99.7% of data in a data set lie within three standard deviations, then we can calculate the size of one standard deviation, multiply it by 3, and identify the data points that are outside of this range. Likewise, we can calculate the z-score of a given point, and if it’s equal to +/- 3, then it’s an outlier.
Note: that there are a few contingencies that need to be considered when using this method; the data must be normally distributed, this is not applicable for small data sets, and the presence of too many outliers can throw off z-score.
Interquartile Range (IQR): IQR, the concept used to build boxplots, can also be used to identify outliers. The IQR is equal to the difference between the 3rd quartile and the 1st quartile. You can then identify if a point is an outlier if it is less than Q1–1.5*IRQ or greater than Q3 + 1.5*IQR. This comes to approximately 2.698 standard deviations.
Q: You’re about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it’s raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that “Yes” it is raining. What is the probability that it’s actually raining in Seattle?
You can tell that this question is related to Bayesian theory because of the last statement which essentially follows the structure, “What is the probability A is true given B is true?” Therefore we need to know the probability of it raining in London on a given day. Let’s assume it’s 25%.
P(A) = probability of it raining = 25%
P(B) = probability of all 3 friends say that it’s raining
P(A|B) probability that it’s raining given they’re telling that it is raining
P(B|A) probability that all 3 friends say that it’s raining given it’s raining = (2/3)³ = 8/27
Step 1: Solve for P(B)
P(A|B) = P(B|A) * P(A) / P(B), can be rewritten as
P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
P(B) = (2/3)³ * 0.25 + (1/3)³ * 0.75 = 0.25*8/27 + 0.75*1/27
Step 2: Solve for P(A|B)
P(A|B) = 0.25 * (8/27) / ( 0.25*8/27 + 0.75*1/27)
P(A|B) = 8 / (8 + 3) = 8/11
Therefore, if all three friends say that it’s raining, then there’s an 8/11 chance that it’s actually raining.
Q: Given two fair dices, what is the probability of getting scores that sum to 4? to 8?
There are 4 combinations of rolling a 4 (1+3, 3+1, 2+2):
P(rolling a 4) = 3/36 = 1/12
There are combinations of rolling an 8 (2+6, 6+2, 3+5, 5+3, 4+4):
P(rolling an 8) = 5/36