Best Data Science Interview Questions 2019

About Data Science

Data science is a multi-disciplinary sector that utilizes scientific methods, processes, algorithms and system to take out information and insights from structured and unstructured data. Data science is the identical concept as data mining and big data, and utilizing the most influential hardware, the most powerful programming systems, and the most well organized algorithms to resolve problems. Data science is an umbrella expression that includes data analytics, data mining, machine learning, and several other connected disciplines. While a data scientist is predictable to forecast the prospect based on past patterns by analyzing the extracted from a variety of data resources.

Data Science Interview Questions And Answers

1. What is Data Science?

Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data.

2. What are the important skills to have in Python with regard to data analysis?

The following are some of the important skills to possess which will come handy when performing data analysis using Python.

Good understanding of the built-in data types especially lists, dictionaries, tuples, and sets.
Mastery of N-dimensional NumPy Arrays.
Mastery of Pandas dataframes.
Ability to perform element-wise vector and matrix operations on NumPy arrays.
Knowing that you should use the Anaconda distribution and the conda package manager.
Familiarity with Scikit-learn. **Scikit-Learn Cheat Sheet**
Ability to write efficient list comprehensions instead of traditional for loops.
Ability to write small, clean functions (important for any developer), preferably pure functions that don’t alter objects.
Knowing how to profile the performance of a Python script and how to optimize bottlenecks.

The following will help to tackle any problem in data analytics and machine learning.

3. What is Selection Bias?

Selection bias is a kind of error that occurs when the researcher decides who is going to be studied. It is usually associated with research where the selection of participants isn’t random. It is sometimes referred to as the selection effect. It is the distortion of statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate.

The types of selection bias include:

Sampling bias: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample.
Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria.
Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests that did not run to completion.

4. What is the goal of A/B Testing?

It is a statistical hypothesis testing for a randomized experiment with two variables A and B.

The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest. A/B testing is a fantastic method for figuring out the best online promotional and marketing strategies for your business. It can be used to test everything from website copy to sales emails to search ads

An example of this could be identifying the click-through rate for a banner ad.

5. What do you understand by statistical power of sensitivity and how do you calculate it?

Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, Random Forest etc.).

Sensitivity is nothing but “Predicted True events/ Total events”. True events here are the events which were true and model also predicted them as true.

Calculation of seasonality is pretty straightforward.

Seasonality = ( True Positives ) / ( Positives in Actual Dependent Variable )

*where true positives are positive events which are correctly classified as positives.

6. What are the differences between overfitting and underfitting?

In statistics and machine learning, one of the most common tasks is to fit a model to a set of training data, so as to be able to make reliable predictions on general untrained data.

In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model too would have poor predictive performance.

7. Python or R – Which one would you prefer for text analytics?

The best possible answer for this would be Python because it has Pandas library that provides easy to use data structures and high performance data analysis tools.

8. What is logistic regression? Or State an example when you have used logistic regression recently

Logistic Regression often referred as logit model is a technique to predict the binary outcome from a linear combination of predictor variables. For example, if you want to predict whether a particular political leader will win the election or not. In this case, the outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor variables here would be the amount of money spent for election campaigning of a particular candidate, the amount of time spent in campaigning, etc.

9. What are Recommender Systems?

A subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc.

10.Why data cleaning plays a vital role in analysis?

Cleaning data from multiple sources to transform it into a format that data analysts or data scientists can work with is a cumbersome process because - as the number of data sources increases, the time take to clean the data increases exponentially due to the number of sources and the volume of data generated in these sources. It might take up to 80% of the time for just cleaning data making it a critical part of analysis task.

11. Differentiate between univariate, bivariate and multivariate analysis.

These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.

If the analysis attempts to understand the difference between 2 variables at time as in a scatterplot, then it is referred to as bivariate analysis. For example, analysing the volume of sale and a spending can be considered as an example of bivariate analysis.

Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.

12. What is Linear Regression?

Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a second variable X. X is referred to as the predictor variable and Y as the criterion variable.

13. What is Interpolation and Extrapolation?

Estimating a value from 2 known values from a list of values is Interpolation. Extrapolation is approximating a value by extending a known set of values or facts.

14. What is power analysis?

An experimental design technique for determining the effect of a given sample size.

15. What is exploding gradients ?

Gradient:

Gradient is the direction and magnitude calculated during training of a neural network that is used to update the network weights in the right direction and by the right amount.

“Exploding gradients are a problem where large error gradients accumulate and result in very large updates to neural network model weights during training.” At an extreme, the values of weights can become so large as to overflow and result in NaN values.

This has the effect of your model being unstable and unable to learn from your training data. Now let’s understand what is the gradient.

16. What are the different kernels functions in SVM ?

There are four types of kernels in SVM.

Linear Kernel
Polynomial kernel
Radial basis kernel
Sigmoid kernel

17. What is pruning in Decision Tree ?

When we remove sub-nodes of a decision node, this process is called pruning or opposite process of splitting.

18. What is Random Forest? How does it work ?

Random forest is a versatile machine learning method capable of performing both regression and classification tasks. It is also used for dimentionality reduction, treats missing values, outlier values. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model.

In Random Forest, we grow multiple trees as opposed to a single tree. To classify a new object based on attributes, each tree gives a classification. The forest chooses the classification having the most votes(Over all the trees in the forest) and in case of regression, it takes the average of outputs by different trees.

19. Explain what regularization is and why it is useful.

Regularization

Regularization is the process of adding tunning parameter to a model to induce smoothness in order to prevent overfitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often the L1(Lasso) or L2(ridge). The model predictions should then minimize the loss function calculated on the regularized training set.

20. What is TF/IDF vectorization ?

tf–idf is short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

21. What is p-value?

When you perform a hypothesis test in statistics, a p-value can help you determine the strength of your results. p-value is a number between 0 and 1. Based on the value it will denote the strength of the results. The claim which is on trial is called Null Hypothesis.

Low p-value (≤ 0.05) indicates strength against the null hypothesis which means we can reject the null Hypothesis. High p-value (≥ 0.05) indicates strength for the null hypothesis which means we can accept the null Hypothesis p-value of 0.05 indicates the Hypothesis could go either way. To put it in another way,

High P values: your data are likely with a true null. Low P values: your data are unlikely with a true null.

22. Explain Cross-Validation.

It is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. It is mainly used in backgrounds where the objective is forecast, and one wants to estimate how accurately a model will accomplish in practice. The goal of cross-validation is to term a data set to test the model in the training phase (i.e., validation data set) to limit problems like overfitting and gain insight on how the model will generalize to an independent data set.

23. What Is Collaborative Filtering?

The process of filtering used by most recommender systems to find patterns and information by collaborating perspectives, numerous data sources, and several agents.

24. What Are Confounding Variables?

These are extraneous variables in a statistical model that correlates directly or inversely with both the dependent and the independent variable. The estimate fails to account for the confounding factor.

25. Explain Star Schema.

It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.

26. What Are Eigenvalue and Eigenvector?

Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing, or stretching.

27. Why Is Resampling Done?

Resampling is done in any of these cases:

Estimating the accuracy of sample statistics by using subsets of accessible data or drawing randomly with replacement from a set of data points
Substituting labels on data points when performing significance tests
Validating models by using random subsets (bootstrapping, cross-validation)

28. What is Naive ?

The Algorithm is ‘naive’ because it makes assumptions that may or may not turn out to be correct.

29. Explain Gradient Descent.

To Understand Gradient Descent, Let’s understand what is a Gradient first.

A gradient measures how much the output of a function changes if you change the inputs a little bit. It simply measures the change in all weights with regard to the change in error. You can also think of a gradient as the slope of a function.

Gradient Descent can be thought of climbing down to the bottom of a valley, instead of climbing up a hill. This is because it is a minimization algorithm that minimizes a given function (Activation Function).

30. What is an Auto-Encoder?

Autoencoders are simple learning networks that aim to transform inputs into outputs with the minimum possible error. This means that we want the output to be as close to input as possible. We add a couple of layers between the input and the output, and the sizes of these layers are smaller than the input layer. The autoencoder receives unlabeled input which is then encoded to reconstruct the input.

31. What are the variants of Back Propagation?

Stochastic Gradient Descent: We use only single training example for calculation of gradient and update parameters.
Batch Gradient Descent: We calculate the gradient for the whole dataset and perform the update at each iteration.
Mini-batch Gradient Descent: It’s one of the most popular optimization algorithms. It’s a variant of Stochastic Gradient Descent and here instead of single training example, mini-batch of samples is used.

32. How can outlier values be treated?

Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for large number of outliers the values can be substituted with either the 99th or the 1st percentile values. All extreme values are not outlier values.The most common ways to treat outlier values –

To change the value and bring in within a range
To just remove the value.

33. How can you assess a good logistic model?

There are various methods to assess the results of a logistic regression analysis-

Using Classification Matrix to look at the true negatives and false positives.
Concordance that helps identify the ability of the logistic model to differentiate between the event happening and not happening.
Lift helps assess the logistic model by comparing it with random selection.

Career scopes and salary scale

Every trade domain in present scenario is observing the job crisis. However, when it is about Data Science, the job seekers in the own technology are making a better move in their profession. Artificial Intelligence (AI), Machine learning (ML), Internet of Things (IoT), Python, and other programming sectors are completely dependent on Data Science analysis. Data Analyst and Data Scientist are the capacity behind current technological development. Besides, Data Science has occupied the highest place in the arena of AI, ML, IoT, etc. A newly joined Data Science applicant in an organization can wait for a minimum salary of 45, 000 dollars per annum. However, the salary of an experienced Data Science expert always gets the double of it. The salaries are very dependent upon the location, business, and the company’s requirements.

Conclusion

This article ‘Data Science interview questions’ has professionally answered every advanced Data Science interview questions. In addition, the move toward in Data Science interview questions for experienced is being projected by our trainers and team of experts. They have put their top of the acquaintance to aid professionals in getting answers to all doubts and not clear insights. Even then, if learners still need more detailing about Data Science, they may go down in a message to our experts related to Data Science interview questions for experienced professionals. Our trainers would be happy to help and resolve all your Data Science-programming issues of the students. Join Data Science Training in Noida, Data Science Training in Delhi, Data Science Training in Gurgaon

Enquire Now

Enquire Now Enquire Now

Students Zone

Blog Tutorials Video Reviews Online Training Reviews Reviews Interview Question Placed Students Lists

Follow Us!

Company

About Us Services Branches Contact us Career Corporate Training Become an Instructor Hire Talent Reach Us Reviews Jobs Get Scholarship Privacy Policy Cookie Policy Terms of use Frequently Asked Question(FAQ's) Terms & Conditions Refund Cancellation Policy

Top Online Courses

AWS Online Training DevOps Training Python Online Training Selenium Online Training Data Science Online Training Data Science Online With Python Training Machine Learning Online Training Power BI online Training Azure Online Training Full Stack Developer Online Training Artificial Intelligence Online Training

Our Institute is at the prime location near sector 15 metro station Noida. than you may reach out to our counselors at the nearst branch. Areas nearer to us are inclusive of sector 2, sector 15, sector 12, sector 18, sector 16, sector 31, sector 61, sector 62, sector 63, Greter Noida Naya Bans Sector 10, sector 6,greater noida, sector 2, Sector 3, Noida City Center, Noida Extention etc.

PS: We ensure that travelling of about 10-15 mins more will get you the best training institute for your training need which worth spending your money for a bright career.

Our Best Offer Ever!! Summer Special - Get 3 Courses at 24,999/- Only. Read More

Noida: +917065273000

Gurgaon: +917291812999

Data Science Interview Questions