Data Science Interview Questions

Data Science is a combination of algorithms, tools, and machine learning techniques which helps you to find common hidden patterns from the given raw data.

In this article, you can go through the set of Data Science interview questions most frequently asked in the interview panel. This will help you crack the interview as the topmost industry experts curate these at HKR training.

Let us have a quick review of the Data Science interview questions.

1. What is Machine Learning?

Ans: Machine Learning is a process of exploring and constructing the algorithms which gain the learning ability to make the predictions on data. It is very closely related to computational statistics. It implements complex models and algorithms for predictions known as predictive analytics.

2. What is Supervised Learning?


  • Supervised learning is the machine learning task of inferring a function from labelled training data. 
  • The training data consist of a set of training examples.
  • Algorithms in supervised learning include: 
  1. Support Vector Machines. 
  2. Regression. 
  3. Naive Bayes.
  4. Decision Trees. 
  5. K-nearest Neighbor Algorithm.  
  6. Neural Networks.

Example: If you built a flower classifier, the labels will be “this is a lotus, this is a rose and this is a sunflower”, based on showing the classifier examples of lotus, rose and sunflower.

3. What is Unsupervised Learning?


  1. Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labelled responses.
  2. Algorithms in unsupervised learning include: 
  1. Clustering. 
  2. Anomaly Detection.
  3. Neural Networks.
  4. Latent Variable Models.

Example: In the same example, a flower clustering will categorize as “flower with various colours”, “flower which is fresh” and “flowers which are dry”.

4. What is Data Science? List the differences between supervised and unsupervised learning.

Ans:  Data Science is a combination of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data. 

Supervised Learning:

  • Input data is labelled.
  • Uses a training dataset.
  • It is used for prediction.
  • It enables classification and regression.

Unsupervised Learning:

  • Input data is unlabelled.
  • Uses the input dataset.
  • It is used for analysis.
  • It enables Classification, Density Estimation, & Dimension Reduction.

5. Differentiate between univariate, bivariate and multivariate analysis.


Univariate analysis:

These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. 

Example: The pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.

Bivariate analysis:

If the analysis attempts to understand the difference between two variables at a time as in a scatterplot, then it is referred to as bivariate analysis.

Example: Analysing the volume of sale and spending can be considered as an example of bivariate analysis.

Multivariate analysis:

Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.

6. What do you understand by the term Normal Distribution?


  • Data can be distributed in different ways with a bias to the left or to the right or it can all be jumbled up. 
  • There are also chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell-shaped curve.
  • Properties of Normal Distribution is as follows:
  1. Unimodal data represented only one mode.
  2. Symmetrical data represented on both left and right halves which are mirror images.
  3. Bell-shaped: It is the curve with maximum height (mode) at the mean.
  4. Mean, Mode, and Median are all located in the centre.
  5. Asymptotic.

7. What are the differences between overfitting and under-fitting?

Ans: The most common tasks in machine learning is to fit a model to a set of training data such that reliable predictions can be made on general untrained data.


  • In this, a statistical model describes random error or noise instead of the underlying relationship. 
  • It occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. 
  • A model that has been overfitted, has low predictive performance, as it overreacts to minor fluctuations in the training data.


  • Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. 
    Example: when fitting a linear model to non-linear data. 
  • Such a model too would have low predictive performance.

8. How can outlier values be treated?


Outlier values can be identified by using univariate or any other means of the graphical analysis method. If there are only a few numbers of outlier values then they can be assessed individually but for a large number of outliers, the values can be substituted with either the 99th or the 1st percentile values.

All extreme values are not outlier values. The most common ways to treat outlier values are.

  1. For changing the value and bringing it within a range.
  2. For simply removing the value.

9. Explain the steps for a Data analytics project.

Ans: The data analytics project involves the following steps.

  1. Understanding the Business problem.
  2. Exploring the data and studying it carefully.
  3. Preparing the data for modelling by finding missing values and transforming variables.
  4. Start running the model and analyze the Big data result.
  5. Validate the model with a new data set.
  6. Implement the model and track the result to analyze the performance of the model for a specific period.

Data Science Certification Training

  • Master Your Craft
  • Lifetime LMS & Faculty Access
  • 24/7 online expert support
  • Real-world & Project Based Learning


10. What is the difference between machine learning and deep learning?


Machine Learning:

It is a field of computer science that provides the learning ability to computers without explicitly programming it. There are three categories in machine learning.

  • Supervised machine learning.
  • Unsupervised machine learning.
  • Reinforcement learning.

Deep Learning:

It is a subfield of machine learning which is concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.

11. What are dimensionality reduction and its benefits?


  • Dimensionality reduction is the process of conveying similar information by converting a dataset with huge dimensions into data with fewer dimensions.
  • The advantage of dimensionality reduction is it helps in compressing the data and reduces the storage space, computational time and redundant features. 

12. What is the significance of p-value?

Ans: While performing a hypothesis test in statistics, a p-value helps in determining the strength of your results. A p-value is a number between 0 and 1. Based on the value it will denote the strength of the results. The claim which is on trial is called the Null Hypothesis.

  1. p-value typically ≤ 0.05
    This indicates strong evidence against the null hypothesis; so you can reject the null hypothesis.
  2. p-value typically > 0.05
    This indicates weak evidence against the null hypothesis, so you can accept the null hypothesis. 
  3. p-value at cutoff 0.05 
    This is considered to be marginal, meaning it could go either way.

13. What are the steps in making a decision tree?

Ans: The following are the steps that need to be implemented for making a decision tree.

  1. Consider the entire data set as input.
  2. Look for a split that maximizes the separation of the classes. A split is a test that divides the data into two sets.
  3. Apply the split to the input data for diving the data.
  4. Re-apply steps one and two to the divided data.
  5. Stop when you meet the stopping criteria.
  6. This step is called pruning. Clean up the tree if you have explored too far by making the splits.

14. What is correlation and covariance in statistics?

Ans: The correlation and covariance are the two mathematical concepts used in statistics. Both concepts establish the relationship and measure the dependency between two random variables. 


  • It is the best technique used for measuring and estimating the quantitative relationship between the two variables. 
  • Correlation measures how strongly two variables are related.


  • In covariance two items vary together and it’s a measure that indicates the extent to which two random variables change in cycle. 
  • It explains the systematic relation between a pair of random variables, wherein changes in one variable reciprocal by a corresponding change in another variable.

15. Can you explain the difference between a Validation Set and a Test Set?


Validation Set:

It is a set which can be considered as a part of the training set as it is used for parameter selection and for avoiding the overfitting of the model being built.

Test Set:

It is the set that is used for testing or evaluating the performance of a trained machine learning model. A training set will fit the parameters such as weights and a test set is used to assess the performance of the model, which means evaluating the predictive power and generalization.

16. Explain cross-validation.


  • It is a model validation technique applied for evaluating how the outcomes of statistical analysis will generalize to an independent dataset. 
  • It is mainly applied in the backgrounds where the objective is forecast and one needs to estimate how accurately a model will accomplish in practice.
  • The goal of cross-validation is to term a data set to test the model in the training phase i.e. validation data set so as to limit problems of overfitting and to get an insight on how the model will generalize to an independent data set.

17. What do you understand by linear regression and logistic regression?


Linear Regression:

  • It is a statistical technique in which the score of some variable “Y” is predicted on the basis of the score of a second variable “X”.
  • The “X” variable is referred to as the predictor variable while the “Y” variable is known as the criterion variable.

Logistic Regression:

  • It is a statistical technique applied for predicting the binary outcome from a linear combination of predictor variables. 
  • Logistic Regression is also known as the logit model.

18. What are the different types of Deep Learning Frameworks?

Ans: The different types of Deep Learning Framework includes the following:

  • Caffe.
  • Keras.
  • TensorFlow.
  • Pytorch.
  • Chainer.
  • Microsoft Cognitive Toolkit.

19. What are the differences between Deep Learning and Machine Learning?


Machine Learning:

  • It gives computers with unlimited ability where many things can be done without prior programming. It includes supervised, unsupervised, and reinforcement machine learning processes.
  • It includes Deep Learning as one of its components.

Deep Learning:

  • It gives computers the ability to learn without being explicitly programmed.
  • It is a subcomponent of machine learning that is concerned with algorithms that are inspired by the structure and functions of the human brains called the Artificial Neural Networks.

Subscribe to our youtube channel to get new updates..!


20. What are the different kinds of Ensemble learning?

Ans: There are two different kinds of Ensemble learning.

  • Bagging: It is a technique which implements simple learners on one small population and takes mean for estimation purposes.
  • Boosting: It is a technique which adjusts the weight of the observation and thereby classifies the population in different sets before the outcome prediction is made.

21. What are the various Machine Learning Libraries and their benefits?

Ans: The various machine learning libraries include the following benefits.

  • Numpy: It is used for scientific computation.
  • Statsmodels: It is used for time-series analysis.
  • Pandas: It is used for tubular data analysis.
  • Scikit learns: It is used for data modelling and pre-processing.
  • Tensorflow: It is used for the deep learning process.
  • Regular Expressions: It is used for text processing.
  • Pytorch: It is used for the deep learning process.
  • NLTK: It is used for text processing.

22. How do you build a random forest model?

Ans:  A random forest is built upon a number of decision trees. The data is split into different packages and a decision tree is constructed in each of the different groups of data, the random forest brings all those trees together.

Steps to build a random forest model:

  1. Randomly select “k” features from a total of “m” features where k << m.
  2. Among the “k” features, calculate the node “D” using the best split point.
  3. Split the node into daughter nodes using the best split.
  4. Repeat steps two and three until leaf nodes are finalized. 
  5. Build a forest by repeating steps one to four for “n” times to create an “n” number of trees. 

23. Please explain the role of data cleaning in data analysis.

Ans: Data cleaning can help in the analysis because:

  • Cleaning the data from multiple sources helps in transforming the data into a format by which data analysts or data scientists are able to work with it.
  • Data Cleaning helps in increasing the accuracy of the model in machine learning.
  • It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases exponentially due to the number of sources and the volume of data generated by these sources.
  • It approximately consumes 80% of the time just only for cleaning the data making it a crucial part of the analysis task.

24. Which of the following machine learning algorithms can be used for inputting missing values of both categorical and continuous variables? K-means clustering, Linear regression, K-NN (k-nearest neighbor), Decision trees

Ans: The K nearest neighbor algorithm is used for computing the nearest neighbor and if it doesn't have a value, it simply computes the nearest neighbor based on all the other features. 

While dealing with K-means clustering or linear regression, you have to do that in your pre-processing, otherwise, they'll crash. Decision trees also have the same problem, although there is some variance.

25. How are weights initialized in a neural network?

Ans: There are two methods for initializing the weights where you can either initialize the weights to zero or assign them randomly.

  • Initializing all weights to 0: This makes your model similar to a linear model. All the neurons and every layer perform the same operation, giving the same output and making the deep net useless.
  • Initializing all weights randomly: Here, the weights are assigned randomly by initializing them very close to 0. It gives better accuracy to the model since every neuron performs different computations. This is the most commonly used method.

26. What Is the Difference Between Epoch, Batch, and Iteration in Deep Learning?


  • Epoch: It represents one iteration over the entire dataset i.e. everything is put into the training model.
  • Batch: It refers to when we cannot pass the entire dataset into the neural network all at once, so the dataset is divided into several batches.
  • Iteration: If we have 10,000 images as data and a batch size of 200. then an epoch should run 50 iterations i.e. 10,000 divided by 50.

Data Science Certification Training

Weekday / Weekend Batches


27. What are the different layers on CNN?

Ans: CNN stands for Convolutional Neural Network that contains four layers.

  1. Convolutional Layer: This layer performs a convolutional operation and creates several smaller picture windows to go over the data.
  2. ReLU Layer: It brings non-linearity to the network and converts all the negative pixels to zero. The output is a rectified feature map.
  3. Pooling Layer: The pooling is a down-sampling operation that reduces the dimensionality of the feature map.
  4. Fully Connected Layer: This layer recognizes and classifies the objects in the image.

28. You are given a data set consisting of variables with more than 30 percent missing values. How will you deal with them?

Ans: The missing data values are handled in the following ways:

  • If the data set is large, you can just simply remove the rows with missing data values. It is a very quick process where you can use the rest of the data to predict the values.
  • For smaller data sets, you can substitute missing values with the mean or average of the rest of the data using the pandas data frame in python. There are different ways to do so, such as “df.mean()”, “df.fillna(mean)”.

29. What is the difference between Point Estimates and Confidence Interval?


Point Estimation:

  • It gives a particular value as an estimate of a population parameter. 
  • Method of Moments and Maximum Likelihood estimator methods are used to derive Point Estimators for population parameters.

Confidence Interval:

  • It gives a range of values which is likely to contain the population parameter.
  • The confidence interval is generally preferred which tells how likely this interval is to contain the population parameter. 
  • This likeliness or probability is called a Confidence Level or Confidence coefficient and is represented by 1-alpha, where alpha is the level of significance.

30. While working on a data set, how can you select important variables?

Ans: The following are the variable selections which can be used.

  • Remove the correlated variables before selecting important variables.
  • Apply linear regression and select variables which depend on that p values.
  • Apply Backward, Forward Selection, and Stepwise Selection.
  • Apply Xgboost, Random Forest, and plot variable importance chart.
  • Measure information gain for the given set of features and select top “n” features accordingly.

Submit an interview question

Find our upcoming Data Science Certification Training Online Classes

  • Batch starts on 10th Mar 2021, Weekday batch

  • Batch starts on 14th Mar 2021, Weekend batch

  • Batch starts on 18th Mar 2021, Weekday batch



Request for more information

Research Analyst
As a senior Technical Content Writer for HKR Trainings, Gayathri has a good comprehension of the present technical innovations, which incorporates perspectives like Business Intelligence and Analytics. She conveys advanced technical ideas precisely and vividly, as conceivable to the target group, guaranteeing that the content is available to clients. She writes qualitative content in the field of Data Warehousing & ETL, Big Data Analytics, and ERP Tools. Connect me on LinkedIn.