IT-8003 (2) (CBGS)
B.E. VIII Semester Examination, June 2020
Choice Based Grading System (CBGS)
a) Explain big data management architecture with neat block diagram. 7
Big data architecture style – microsoft.com
- The above writeup talks above big data architecture style which is designed to handle the ingestion, processing, analysis etc. of data,
- Refer visual block diagram that show how the process starts with data sources, data storage, batch/stream processing to data store and finally analytics followed by reporting,
- Refer second part that has description of all the steps mentioned in block diagram,
- Refer 3rd part i.e. “When to use this architecture” to understand that it is used in case where dataset is too large, for realtime processing of data, transforming unstructured data etc.
- 4th part talks about it’s benefits which are mixing it with other technologies/tools, parallelism and hence scalability etc.
- 5th part talks about challenges which are it’s complexity, specialised skillset requirement, evolving technologies and changes happens very quickly etc.
- 6th part talks about best practices in case you are interested to look into that and 7th part about IoT architecture.
b) Explain the role of relational databases in big data. 7
Basics of relational database (RDB)? – cloud.google.com
- Refer first 3 paragraphs for basic definition of relational database,
- Above link tells definition of relational database i.e. how it organises data into rows and columns, and how it has ability to establish links between different tables i.e. there is a need of having common key between tables to join them and extract meaningful insightful information.
- 4th topic tells about examples of RDB i.e. visually showing two tables and common keys between the two,
- 5th topic above benefits of it which are flexibility, ease of use, collaboration etc.
a) Classify the different data types with reference to big data. 7
Characteristics of Big Data: Types, & Examples
So there are 3 types of data with reference of big data they ae:
- Structured (Refer 2nd part 1st topic):
- In article it explains the various characteristics of structured data i.e. it’s structured, is in tabular form hence easy to analyse etc.
- Unstructured (Refer 2nd part 2nd topic):
- Is unstructured and not easy to interpret or analyse e.g. information such as dates, facts etc.
- Semi-structured (Refer 2nd part 3rd topic):
- e.g. json, xml type of data that needs to flattened out before we can use and make some meaning out of it. Such type of flattening is easy to perform and can be done using articles available online.
After above topics if your interested in knowing about characteristic of big data then you can go on to the third part that talks about 5Vs of big data i.e. volume, variety, velocity, value and veracity.
b) Define a normal distribution and derive the mean, median, mode of a normal distribution. 7
- A normal distribution is an arrangement of a data set in which most of the values are clustered in the middle of the range and the remaining are symmetrically closed at the extremes.
- Some characteristics: area under it is unity or 1, it’s unimodal, symmetric around a point etc. (for more characteristics refer this normal distribution wikipedia article – symmetries and derivatives part)
- Mean is average of numbers which can be calculated very easily i.e. by adding all the the numbers and dividing it by count of numbers. for e.g. (1, 4, 5, 6) the average of these numbers will be (1+4+5+6)/4 = 4.
- The median is the middle value in the list from smallest to largest. e.g. (2, 3, 7, 9, 14) – for this list median is 7.
- Mode is the most frequently occurring value in the list. e.g. (1, 4, 1, 5, 6, 7) – for this list the mode is 1.
Relevant links to answer mean, median, mode of normal distribution:
- What are the median and the mode of the standard normal distribution?
- What is the value of mean, median and mode if it follows the normal distribution?
a) Derive the mean and the variance of a binomial distribution. 7
- Mean and Variance of Binomial Distribution
- Refer this link the answer is quite small and easy to understand just follow the link.
- What are the 4 conditions of a binomial distribution?
- Some more further interesting read.
- The Three Assumptions of the Binomial Distribution
- Each test has two possible outcomes in which each test has the same probability of success and both are independent of each other.
b) Explain the role of support vector machine in data analysis. 7
Support Vector Machines (SVM), a fast and reliable classification algorithm (i.e. supervised, clustering algorithms are unsupervised) that performs very well with even small dataset.
It creates a hyperplane that divides the data points into two parts one on each side.
- Visual explanation of support vector machine algorithm – this link will show you how SVM works actually visually, you will get a feel of how it works actually.
The hyperplane that gets created is the one with largest distance to the nearest element. (Hyperplane distance understanding – this link will explain this better).
Other relevant topics to talk about:
- Linear as well as Non-Linear,
- How does SVM works?
4. Explain the following: 14
i) Naive bayes classifier
- Probabilistic classifiers i.e. works based on probability and is used for classification problems,
- Based on applying Bayes Theorem,
- Assumption: Features are independent,
- Highly scalable,
- Simple and easy to implement,
- Handles both continuous as well as discrete data,
- Difference between Bayes and Naive Bayes is that Naive Bayes assumes conditional independence whereas Bayes does not.
- Steps to apply Naive Bayes in Data Science or Machine Learning is same and standard like collecting and handling data, removing outliers, making prediction, evaluating final results etc.
- One should rely on Naive Bayes theorem when they want to solve for multi-class prediction problems.
- Some applications of Naive Bayes includes: sentiment analysis, commender systems etc.
ii) Linear regression with nonlinear substitution
iii) Evolution of data management
Data management starting goes to way back in 1950s when computer were really slow and lots of space used to be required just to store few megebytes of data.
The above article has divided the evolution into three generations:
- 1st Generation Data Management — Data Warehousing
- 2nd Generation Data Management – Data Lakes
- 3rd Generation Data Management — Catalogs, Hubs, and Fabrics: This tackled the problems of lake centric architecture.
Difference between date warehouse and data lake:
- Data lake is the repository of data both structured as well as unstructured data, whereas data warehouse is collection of structured data for some specific purpose.
In few decades many things have changed like retention volume has increased from small to large, earlier data was used just for operational purpose but it is being used for operational as well as analytical purpose, schema have became dynamic whereas earlier it used to be static etc.
Apart from it the advances on hardware also has happened like better processors, transfer rate, storage, connection speed etc.
Work in progress:
a) Explain parallel vs distributed processing. 7
b) Explain data warehousing architecture. 7
a) Explain data preparation, model planning and model building. 7
b) Differentiate cluster architecture and traditional architecture. 7
a) Explain core components of hadoop ecosystem with neat diagram. 7
b) Explain loops and conditional execution in R language. 7
a) Explain data manipulation and statistical analysis with R. 7
b) Explain various data structures used in R.