Important RGPV Question
Table of Contents
ToggleME 803 (A) Data Analytics
VIII Sem, ME
UNIT 1- Descriptive Statistics
Q.1) Why it is important to screening the data prior to analysis task?
Q.2) What do you understand about technique “Use a global constant to fill in the missing value”?
Q.3) Differentiate between classification and numeric prediction.
Q.4) What are the terminating conditions for stopping the partitioning in decision tree induction algorithm?
Q.5) Give use of attribute selection measures in Decision tree.
Q.6) What is the use of Confusion Matrix? Define all the related terms of a Confusion Matrix.
Q.7) What is Linear Regression? How it is differ from Logistic regression?
Q.8) What do you understand by over fitting in classification? Give solutions for it.
Q.9) Compare simple discriminant analysis and multiple discriminant analysis.
Q.10) What is the use of variance? Give the basic properties of the standard deviation, as a measure of spread.
Q.11) Why sigmoid function is used in logistic regression?
Q.12) What do you understand by over fitting in classification? Give solutions for it.
Q.13) Compare simple discriminant analysis and multiple discriminant analysis.
Q.14) What is the use of variance? Give the basic properties of the standard deviation, as a measure of spread.
Q.15) What is multivariate analysis? Explain the following multivariate analysis techniques by taking any suitable examples – (a) Multiple Logistic Regression (b) Multivariate analysis of variance (MANOVA)
UNIT 2- Introduction To Big Data
Q.1) What is the need of dimensionality reduction of a dataset?
Q.2) Define principle components in PCA.
Q.3) Define Big Data and explain its importance in today’s business environment. Provide real-world examples to illustrate your points.
Q.4) Describe the Four V’s of Big Data (Volume, Velocity, Variety, Veracity) and explain how each contributes to the challenges and opportunities of Big Data analytics.
Q.5) What are the key drivers for the growth of Big Data? Discuss the technological, economic, and social factors that have led to its emergence.
UNIT 3- Processing Big Data
Q.1) Explain the challenges involved in integrating disparate data stores for Big Data processing. Provide examples of common data sources and integration techniques.
Q.2) Describe the process of mapping data to a programming framework for Big Data analytics. Why is this step crucial, and what considerations should be taken into account?
Q.3) Discuss the various methods for connecting to and extracting data from storage in a Big Data environment. Compare and contrast different approaches, highlighting their advantages and disadvantages.
Q.4) Explain the importance of transforming data for processing in Big Data. What are some common data transformation techniques, and how do they prepare data for analysis?
Q.5) Describe the process of subdividing data in preparation for Hadoop MapReduce. Why is this step necessary, and how does it impact the efficiency of data processing?
UNIT 4- Hadoop MapReduce
Q.1) Explain the concept of Hadoop MapReduce. Describe the key components and their roles in the MapReduce framework.
Q.2) Describe the process of creating the components of Hadoop MapReduce jobs. Provide an example of a simple MapReduce job and explain the code involved.
Q.3) Discuss the challenges and considerations involved in distributing data processing across server farms using Hadoop MapReduce.
Q.4) Explain the steps involved in executing Hadoop MapReduce jobs. How can the progress of job flows be monitored and managed?
Q.5) Describe the Building Blocks of Hadoop MapReduce. Explain the functions of different Hadoop daemons and their interactions.
Q.6) Explain the Hadoop Distributed File System (HDFS). What are its key features, and how does it contribute to the efficiency of Hadoop MapReduce?
UNIT 5- Big Data Tools And Techniques
Q.1) Write any 4 requirements of clustering.
Q.2) What is dissimilarity Matrix in Clustering?
Q.3) Write important steps of ARIMA model for time series data analysis.
Q.4) Give working Convergence Conditions, weakness and strength of K – means clustering algorithm.
Q.5) Write important steps of ARIMA model for time series data analysis.
Q.6) Consider the following data set consisting of the scores of two variables on each of seven individuals…
K = 2, and distance measures is Euclidean distance. Find the final allocation in each cluster
and centroid using K – means clustering algorithm.
Q.7) Use single and complete link agglomerative clustering to group the data described by the following distance matrix. Show all the steps and construct dendrogram.
— Best of Luck for Exam —