Detailed Data Science interview questions covering statistics, SQL, Python, data cleaning, visualization, experiments, metrics, machine learning basics, and business analysis.
Data Science is the practice of using data, statistics, programming, domain knowledge, and machine learning to discover insights and support decisions. A data scientist does not only build models; they define the problem, collect and clean data, analyze patterns, communicate findings, and help teams act on evidence. For example, an e-commerce team may use data science to understand why checkout conversion dropped and which customer segments are affected.
Data Analytics focuses on answering business questions from data, such as what happened and why. Data Science is broader and includes analytics, experimentation, predictive modeling, machine learning, and decision support. Machine Learning is a subset of Data Science that trains algorithms to learn patterns from data. For example, monthly sales reporting is analytics, churn prediction is machine learning, and deciding how to reduce churn using analysis plus modeling is data science.
A typical project starts with problem framing, then data collection, data validation, exploratory data analysis, cleaning, feature engineering, modeling or statistical analysis, evaluation, communication, deployment if needed, and monitoring. In interviews, explain that the workflow is iterative: findings from EDA or error analysis often send you back to improve data quality, metric definitions, features, or assumptions.
Exploratory Data Analysis, or EDA, is the process of understanding a dataset before formal modeling or decision-making. It includes checking shape, missing values, distributions, outliers, correlations, target balance, time patterns, and segment-level behavior. EDA helps catch data quality issues early and often reveals the simplest useful business insight before any complex model is needed.
import pandas as pd
df = pd.read_csv("customers.csv")
print(df.shape)
print(df.info())
print(df.describe(numeric_only=True))
print(df.isna().mean().sort_values(ascending=False).head())
Missing values can be handled by deletion, simple imputation, model-based imputation, adding missing indicators, or treating missing as its own category. The best method depends on why the value is missing. If income is missing because a user skipped the field, that missingness may be meaningful. Always check missingness by segment and avoid fitting imputation rules on the full dataset before train-test splitting.
df["age"] = df["age"].fillna(df["age"].median())
df["city"] = df["city"].fillna("Unknown")
df["income_missing"] = df["income"].isna().astype(int)
Outliers are observations that are unusually far from the rest of the data. They may be valid rare events, data entry errors, fraud signals, or measurement issues. Treatment depends on context. You may investigate, cap values, transform skewed features, remove clear errors, use robust statistics, or keep them if they represent important business behavior.
Data cleaning fixes or standardizes issues that make analysis unreliable. It includes handling missing values, duplicate rows, inconsistent categories, invalid dates, impossible values, mixed units, whitespace, incorrect types, and broken joins. For example, the values "NY", "New York", and "newyork" may need to be normalized into one category before analysis.
Data profiling is a systematic summary of data quality and structure. It checks row counts, column types, null percentages, uniqueness, ranges, duplicate records, invalid values, category frequencies, and distribution changes. Profiling is useful before EDA, before pipeline deployment, and after upstream data changes.
Sampling bias occurs when collected data does not represent the population you want to understand. For example, analyzing only active app users may ignore churned users, leading to overly positive conclusions. Sampling bias can produce wrong metrics and unfair or ineffective models. A strong answer should mention how the sample was collected and which groups may be missing.
Descriptive statistics summarize the main properties of data. Common measures include mean, median, mode, variance, standard deviation, percentiles, minimum, maximum, and frequency counts. They help quickly understand scale, spread, skewness, and unusual values. For skewed data such as income, median and percentiles are often more useful than mean.
summary = df.groupby("plan")["monthly_spend"].agg(
count="count",
mean="mean",
median="median",
p90=lambda x: x.quantile(0.90)
)
print(summary)
Inferential statistics uses sample data to make conclusions about a larger population. It includes confidence intervals, hypothesis tests, regression inference, and estimation. For example, instead of saying a sample conversion rate is 8 percent, inferential statistics helps estimate the likely population conversion rate and the uncertainty around it.
Hypothesis testing evaluates whether observed data provides enough evidence against a default assumption called the null hypothesis. For example, the null hypothesis may be that a new landing page has the same conversion rate as the old one. A test calculates how surprising the observed difference is if the null hypothesis were true.
A p-value is the probability of observing results at least as extreme as your sample result, assuming the null hypothesis is true. It is not the probability that the null hypothesis is true. A small p-value suggests the observed result would be unlikely under the null, but practical importance still depends on effect size and business impact.
A confidence interval gives a range of plausible values for a population parameter. For example, a 95 percent confidence interval for conversion lift may be 1.2 percent to 3.8 percent. This communicates uncertainty better than a single estimate. If the interval is wide, the sample may be too small or the data too noisy for a confident decision.
Statistical power is the probability that a test detects a real effect when one exists. Low power increases the chance of missing meaningful changes. Power depends on sample size, effect size, variance, and significance level. In A/B testing, power analysis helps decide how long an experiment should run before interpreting the result.
A/B testing compares two or more versions of a product, page, message, or model by randomly assigning users to groups and measuring outcomes. For example, a company may test whether a new checkout page increases purchases. A good A/B test has random assignment, clear metrics, guardrail metrics, enough sample size, and a pre-defined decision rule.
Correlation means two variables move together. Causation means one variable directly influences another. For example, ice cream sales and drowning incidents may both increase in summer, but ice cream does not cause drowning. Temperature is a confounding variable. To support causation, use experiments, natural experiments, causal methods, or strong domain reasoning.
Feature engineering creates or transforms variables so patterns become easier to analyze or model. Examples include extracting weekday from a timestamp, calculating customer tenure, creating average order value, counting failed login attempts, or aggregating transactions over the last 30 days. Good features reflect domain knowledge and avoid using future information.
Data leakage happens when analysis or model training uses information that would not be available at decision time. For example, predicting churn using a cancellation_reason field leaks future information because the reason is known only after churn. Leakage leads to overly optimistic results and poor production performance.
Safe joins require checking key uniqueness, row counts before and after the join, duplicate keys, missing matches, and whether the join type matches the business question. A bad join can multiply rows and inflate metrics. For example, joining customers to orders without aggregating orders first can duplicate customer-level fields.
SELECT
c.customer_id,
c.segment,
COUNT(o.order_id) AS order_count,
SUM(o.amount) AS total_spend
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.segment;
Important SQL concepts include filtering, aggregation, joins, subqueries, common table expressions, window functions, date functions, ranking, conditional aggregation, and handling nulls. Data scientists use SQL to extract clean datasets, build metrics, analyze funnels, create cohorts, and validate dashboards.
Window functions calculate values across related rows without collapsing rows like GROUP BY does. They are useful for ranking, running totals, moving averages, session analysis, retention analysis, and deduplication. Common functions include ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM OVER, and AVG OVER.
SELECT
user_id,
order_date,
amount,
SUM(amount) OVER (
PARTITION BY user_id
ORDER BY order_date
) AS running_spend
FROM orders;
Pandas is used for loading, cleaning, transforming, grouping, joining, and summarizing tabular data. Common operations include filtering rows, selecting columns, handling missing values, grouping by categories, merging datasets, pivoting, and calculating new features. In interviews, show that you understand both syntax and data quality checks.
active_users = (
df[df["status"] == "active"]
.groupby("plan", as_index=False)
.agg(users=("user_id", "nunique"), avg_spend=("spend", "mean"))
.sort_values("avg_spend", ascending=False)
)
print(active_users)
Duplicate detection depends on the definition of uniqueness. Sometimes duplicates are exact row matches. Other times they are repeated customer IDs, emails, transaction IDs, or combinations such as user_id plus event_time. After detecting duplicates, decide whether to remove them, aggregate them, keep the latest record, or investigate the source pipeline.
duplicates = df[df.duplicated(subset=["email"], keep=False)]
latest = df.sort_values("updated_at").drop_duplicates("email", keep="last")
A metric is a quantified measure used to track performance or behavior. Metric definition matters because different definitions can lead to different decisions. For example, active users may mean users who logged in, users who performed a core action, or users who spent at least five minutes. A good metric has a clear formula, owner, data source, refresh cadence, and known limitations.
A North Star metric is the primary metric that represents long-term customer value and business success. For a streaming platform it might be weekly engaged viewers. For a marketplace it might be successful transactions. It should guide teams without replacing supporting metrics, because one metric alone rarely captures product health.
Leading indicators move before the final business outcome and can help teams act early. Lagging indicators measure results after the fact. For example, trial activation may be a leading indicator for paid conversion, while quarterly revenue is a lagging indicator. Good dashboards often include both.
Cohort analysis groups users by a shared starting point or behavior and tracks them over time. For example, users who signed up in January can be compared with users who signed up in February to understand retention. It is useful for measuring product changes, onboarding quality, churn, repeat purchases, and lifecycle behavior.
Funnel analysis measures how users move through a sequence of steps, such as visit, signup, add to cart, checkout, and purchase. It helps identify where users drop off. A strong answer should mention step definitions, ordering rules, time windows, segmentation, and whether users can repeat or skip steps.
Churn analysis studies why customers stop using a product or service. It can include churn rate calculation, segmentation, survival analysis, feature usage patterns, customer feedback, and churn prediction models. The first challenge is defining churn correctly: for a subscription product it may be cancellation, while for a marketplace it may be inactivity for a certain number of days.
Segmentation divides users, customers, products, or events into meaningful groups. Segments can be rule-based, such as new versus returning users, or model-based, such as clusters from behavior data. Segmentation helps teams understand different behaviors and avoid average metrics hiding important differences.
Choose charts based on the message. Use line charts for trends over time, bar charts for category comparisons, histograms for distributions, scatter plots for relationships, box plots for spread and outliers, and heatmaps for matrix-style patterns. Avoid decorative charts that make values hard to compare. The goal is clarity, not visual complexity.
A good dashboard has a clear audience, clear metric definitions, useful filters, readable charts, visible time periods, and a layout that supports scanning. It should show key metrics first, then diagnostic breakdowns. Avoid overcrowding, unclear colors, vanity metrics, and charts without actionable context.
Data storytelling is communicating insights through a clear narrative supported by evidence. It connects the business question, data, analysis, finding, impact, and recommendation. Instead of saying "conversion is down," a strong story says which segment changed, when it changed, how large the impact is, what likely caused it, and what action should be taken.
Time series analysis studies data ordered by time, such as daily sales, hourly traffic, or monthly revenue. Important concepts include trend, seasonality, autocorrelation, stationarity, lag features, moving averages, and forecast error. Time-series validation should respect chronological order instead of randomly splitting rows.
Linear regression predicts a continuous outcome and helps estimate relationships between variables. It can be used for forecasting, baseline modeling, and explanation. For example, a data scientist may use regression to estimate how marketing spend, price, and seasonality relate to revenue. Check assumptions, residuals, multicollinearity, and whether the relationship is reasonably linear.
Logistic regression is used for classification, especially binary classification. It estimates the probability of an event such as churn, fraud, conversion, or loan default. It is popular because it is fast, interpretable, and a strong baseline. Coefficients can help explain which features increase or decrease the odds of the outcome.
Regression predicts continuous values, such as revenue or delivery time. Classification predicts categories, such as churned or not churned. The target type determines the algorithm and metric. Regression may use MAE, RMSE, or R-squared. Classification may use accuracy, precision, recall, F1, ROC AUC, or PR AUC.
Classification evaluation depends on business cost. Accuracy works only when classes are balanced and error costs are similar. Precision matters when false positives are costly. Recall matters when false negatives are costly. F1 balances precision and recall. ROC AUC and PR AUC evaluate ranking quality across thresholds.
from sklearn.metrics import classification_report, roc_auc_score
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_prob))
Regression models are evaluated using metrics such as MAE, MSE, RMSE, R-squared, and MAPE. MAE is easy to explain because it is average absolute error. RMSE penalizes larger errors more heavily. R-squared explains variance captured by the model but can be misleading if used alone. Always compare against a simple baseline.
Train-test split separates data used for model training from data used for final evaluation. The test set should represent unseen data. For time-based problems, split chronologically. For classification, stratify if class balance matters. Never tune many decisions on the final test set because it stops being an unbiased estimate.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
stratify=y
)
Cross-validation evaluates a model over multiple train-validation splits. In k-fold cross-validation, the data is divided into k folds, and each fold is used once for validation. This gives a more stable estimate than a single split, especially with smaller datasets. For time series, use time-aware validation instead of random folds.
Overfitting happens when a model learns noise or accidental patterns from training data and performs poorly on new data. Symptoms include high training performance and low validation performance. Common fixes include simpler models, regularization, cross-validation, more data, feature selection, pruning, dropout, or early stopping.
Underfitting happens when a model is too simple to capture real patterns. It performs poorly on both training and validation data. Causes include weak features, excessive regularization, insufficient training, or using a model that cannot represent the relationship. Fixes include better features, more flexible models, and reducing unnecessary constraints.
Bias is error from overly simple assumptions. Variance is error from sensitivity to training data noise. High-bias models underfit, while high-variance models overfit. The goal is to choose a model that captures real structure while generalizing well. Ensemble methods, regularization, and cross-validation help manage this tradeoff.
Feature scaling puts numeric features on comparable scales. It is important for algorithms that depend on distance or gradients, such as KNN, SVM, logistic regression, neural networks, PCA, and k-means. Tree-based models usually do not require scaling. Common methods include standardization and min-max normalization.
A data pipeline moves and transforms data from sources to usable outputs such as tables, dashboards, reports, features, or models. A reliable pipeline includes validation, logging, error handling, freshness checks, lineage, and monitoring. In Data Science, pipelines reduce manual notebook work and make analysis reproducible.
Reproducible analysis means another person can rerun the work and get the same result using the same data, code, environment, and assumptions. It requires versioned code, documented data sources, fixed random seeds where needed, dependency tracking, clear notebooks or scripts, and saved outputs. Reproducibility makes insights trustworthy.
Start with the business question and answer, then explain the evidence, impact, confidence, limitations, and recommended action. Avoid leading with technical details unless they are necessary. For example, instead of explaining every model parameter, say which customer segment is at risk, expected revenue impact, and what intervention should be tested.
A complete case study starts by clarifying the goal, defining metrics, checking data quality, performing EDA, segmenting results, testing assumptions, building a baseline or model if needed, evaluating impact, and recommending action. In interviews, explain tradeoffs and ask clarifying questions. The best answer is not just technically correct; it is useful for the business decision.
Explore 500+ free tutorials across 20+ languages and frameworks.