Unit1_Kumod_deeplearning.pptx DEEP LEARNING

Dr. Kumod Kumar Gupta Deep Learning Unit I 1
Deep Learning
(BCSML0552)
Dr. Kumod kr. Gupta
(Associate Professor)
AI Department
Unit: I
INTRODUCTION
Course Details
(B. Tech. 5th
Sem)
Noida Institute of Engineering and Technology

Faculty Introduction
Name Dr. Kumod Kr. Gupta
Qualification Ph.D., M. Tech
Designation Associate Professor
Department AI
Total Experience 17 years
NIET Experience 12 years
Subject Taught Python Basics, Advanced Python, ML, DL

Evaluation Scheme
Sl.
No.
Subject Codes
Subject Name
Periods Evaluation Scheme End
Semester Total Credit
L T P CT TA TOTAL PS TE PE
1 ACSML0602 Deep Learning 3 0 0 30 20 50 100 150 3
2 ACSML0603 Advanced Database Management
Systems
3 1 0 30 20 50 100 150 4
3 ACSE0603 Software Engineering 3 0 0 30 20 50 100 150 3
4 Departmental Elective-III 3 0 0 30 20 50 100 150 3
5 Departmental Elective-IV 3 0 0 30 20 50 100 150 3
6 Open Elective-I 3 0 0 30 20 50 100 150 3
7 ACSML0652 Deep Learning Lab 0 0 2 25 25 50 1
8 ACSML0653 Advanced Database Management Systems
Lab
0 0 2 25 25 50 1
9 ACSE0653 Software Engineering Lab 0 0 2 25 25 50 1
10 ACSE0659 Mini Project 0 0 2 50 50 1
11
ANC0602 /
ANC0601
Essence of Indian Traditional Knowledge /
Constitution of India, Law and
Engineering (Non
Credit)
2 0 0 30 20 50 50 100
12 MOOCs (For B.Tech. Hons. Degree)
GRAND TOTAL 1100 23
Bachelor of Technology
Computer Science And Engineering (Artificial Intelligence & Machine Learning)
EVALUATION SCHEME
SEMESTER-VI

Course Contents / Syllabus
Module 1 Introduction 14 hours
Model Improvement and Performance: Curse of Dimensionality, Bias and Variance Trade off, Overfitting and underfitting, Regression - MAE, MSE, RMSE, R Squared, Adjusted R Squared, p-Value,
Classification - Precision, Recall, F1, Other topics, K-Fold Cross validation, RoC curve, Hyper-Parameter Tuning Introduction – Grid search, random search, Introduction to Deep Learning. Artificial Neural
Network: Neuron, Nerve structure and synapse, Artificial Neuron and its model, activation functions, Neural network architecture: Single layer and Multilayer feed forward networks, recurrent networks.
Various learning techniques; Perception and Convergence rule, Hebb Learning. Perceptron, Multilayer perceptron, Gradient descent and the Delta rule, Multilayer networks, Derivation of Backpropagation
Algorithm
Module 2 CONVOLUTION NEURAL NETWORK 14 hours
What is computer vision? Why Convolutions (CNN)? Introduction to CNN, Train a simple convolutional neural net, Explore the design space for convolutional nets, Pooling layer motivation in CNN, Design a
convolutional layered application, Understanding and visualizing a CNN, Transfer learning and fine-tuning CNN, Image classification, Text classification, Image classification and hyper-parameter tuning,
Emerging NN architectures.
Module 3 DETECTION & RECOGNITION 14 hours
Padding & Edge Detection, Strided Convolutions, Networks in Networks and 1x1 Convolutions, Inception Network Motivation, Object Detection, YOLO Algorithm.
Module 4 RECURRENT NEURAL NETWORKS 15 hours
Why use sequence models? Recurrent Neural Network Model, Notation, Backpropagation through time (BTT), Different types of RNNs, Language model and sequence generation, Sampling novel sequences,
Vanishing gradients with RNNs, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Bidirectional RNN, Deep RNNs
Module 5 AUTO ENCODERS IN DEEP LEARNING 15 hours
Auto-encoders and unsupervised learning, Stacked auto-encoders and semi-supervised learning, Regularization - Dropout and Batch normalization.
Syllabus

Syllabus
UNIT-I: Model Improvement and Performance
Curse of Dimensionality, Bias and Variance Trade off, Overfitting and underfitting,
Regression - MAE, MSE, RMSE, R Squared, Adjusted R Squared, p-Value, Classification -
Precision, Recall, F1, Other topics, K-Fold Cross validation, RoC curve, Hyper-Parameter
Tuning Introduction – Grid search, random search, Introduction to Deep Learning.
Artificial Neural Network: Neuron, Nerve structure and synapse, Artificial Neuron and its
model, activation functions, Neural network architecture: Single layer and Multilayer feed
forward networks, recurrent networks. Various learning techniques; Perception and
Convergence rule, Hebb Learning. Perceptron’s, Multilayer perceptron, Gradient descent and
the Delta rule, Multilayer networks, Derivation of Backpropagation Algorithm.

Syllabus
UNIT-II: CONVOLUTION NEURAL NETWORK
What is computer vision? Why Convolutions (CNN)?
Introduction to CNN, Train a simple convolutional neural net, Explore the design space for convolutional nets,
Pooling layer motivation in CNN, Design a convolutional layered application, Understanding and visualizing a
CNN, Transfer learning and fine-tuning CNN, Image classification, Text classification, Image classification
and hyper-parameter tuning, Emerging NN architectures

Syllabus
UNIT-III:DETECTION & RECOGNITION
Padding & Edge Detection, Strided Convolutions, Networks in Networks and
1x1Convolutions, Inception Network Motivation, Object Detection, YOLO Algorithm.

Syllabus
UNIT-IV: RECURRENT NEURAL NETWORKS
Why use sequence models? Recurrent Neural Network Model, Notation, Back-propagation
through time (BTT), Different types of RNNs, Language model and sequence generation,
Sampling novel sequences, Vanishing gradients with RNNs, Gated Recurrent Unit (GRU),
Long Short-Term Memory (LSTM), Bidirectional RNN, Deep RNNs

Syllabus
UNIT-V: AUTO ENCODERS IN DEEP LEARNING
Auto-encoders and unsupervised learning, Stacked auto-encoders and semi-supervised
learning,
Regularization - Dropout and Batch normalization.

Course Objective
To be able to learn unsupervised techniques and provide continuous
improvement in accuracy and outcomes of various datasets with more reliable
and concise analysis results.

Course Outcome (CO)
Course
Outcome
( CO)
At the end of course , the student will be able to:
Bloom’s
Knowledge
Level (KL)
CO1 Analyze ANN model and understand the ways of accuracy
measurement.
K4
CO2 Develop a convolutional neural network for multi-class
classification in images
K6
CO3 Apply Deep Learning algorithm to detect and recognize an
object.
K3
CO4 Apply RNNs to Time Series Forecasting, NLP, Text and
Image Classification
K4
CO5 Apply Lower-dimensional representation over higher-
dimensional data for dimensionality reduction and capture
the important features of an object.
K3

Program Outcomes (POs)
Engineering Graduates will be able to:
PO1 : Engineering Knowledge
PO2 : Problem Analysis
PO3 : Design/Development of solutions
PO4 : Conduct Investigations of complex problems
PO5 : Modern tool usage
PO6 : The engineer and society

Program Outcomes (POs)
Engineering Graduates will be able to:
PO7 : Environment and sustainability
PO8 : Ethics
PO9 : Individual and teamwork
PO10 : Communication
PO11 : Project management and finance
PO12 : Life-long learning

CO-PO Mapping
CO.K PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1 3 3 3 3 2 2 1 - 1 - 2 2
CO2 3 3 3 3 2 2 1 - 1 1 2 2
CO3 3 3 3 3 3 2 2 - 2 1 2 3
CO4 3 3 3 3 3 2 2 1 2 1 2 3
CO5 3 3 3 3 3 2 2 1 2 1 2 2
AVG 3.0 3.0 3.0 3.0 2.6 2.0 1.6 0.4 1.6 0.8 2.0 2.4

Result Analysis 2022-2023 (Even semester )
Institute Result
FACULTY NAME BRANCH/SECTION RESULT

Pattern of Online External Exam Question Paper (100 marks)

Model Improvement and Performance:
• Curse of Dimensionality,
• Bias and Variance Trade off
• Overfitting and underfitting,
• Regression - MAE, MSE, RMSE,
• R Squared, Adjusted R Squared, p-Value,
• Classification - Precision, Recall, F1,
• Other topics, K-Fold Cross validation,
• RoC curve,
• Hyper-Parameter Tuning Introduction –
Grid search, random search,
• Introduction to Deep Learning.
Artificial Neural Network:
• Neuron, Nerve structure and synapse,
• Artificial Neuron and its model,
• activation functions,
• Neural network architecture: Single
layer and Multilayer feed forward
networks, recurrent networks.
• Various learning techniques; Perception
and Convergence rule, Hebb Learning.
Perceptron’s, Multilayer perceptron,
Gradient descent and the Delta rule,
• Multilayer networks,
• Derivation of Backpropagation
Algorithm.
Unit I Content

Analyze ANN model and understand the ways of accuracy measurement.
Unit I Objective

• Python, Basic Modeling Concepts
Topis Prerequisite

To be able to learn unsupervised techniques and provide continuous improvement in accuracy
and outcomes of various datasets with more reliable and concise analysis results.
Analyze ANN model and understand the ways of accuracy measurement.
Topic Objective

Model Improvement and Performance
Unit 1 Introduction
Curse of Dimensionality,
Bias and Variance Trade off,
Overfitting and underfitting,
Regression - MAE, MSE, RMSE, R Squared, Adjusted R Squared, p-Value,
Classification - Precision, Recall, F1, Other topics, K-Fold Cross validation, RoC curve,
Hyper-Parameter Tuning Introduction – Grid search, random search,
Introduction to Deep Learning.

• Increasing the number of features will not always improve
classification accuracy.
• In practice, the inclusion of more features might actually lead
to worse performance.
• The number of training examples required increases
exponentially with dimensionality d (i.e., kd
).
32
bins
33
bins
31
bins
k=3
CURSE OF DIMENSIONALITY

CURSE OF DIMENSIONALITY
Problem Effect in High Dimensions
Data sparsity
• Data Sparsity means that in a given dataset, most
of the possible values or combinations of
features are empty or have very few data points.
• Hard to find dense regions or clusters;
neighborhood methods (k-NN) fail.
Overfitting
Too many features → model memorizes noise rather
than learning patterns.
Distance metrics degrade
Distances between points become similar, reducing
discrimination power.
Exponential growth of computation
More features mean heavier calculations and storage
requirements.
Increased sample requirement
Need exponentially more samples to maintain
statistical significance.

29
• What is the objective?
– Choose an optimum set of features of lower dimensionality to improve classification
accuracy.
• Different methods can be used to reduce dimensionality:
– Feature extraction
– Feature selection
Dimensionality Reduction (CO1)
Dr. Kumod Kumar Gupta Deep Learning Unit I

30
There are two components of dimensionality reduction:
•Feature selection: In this, we try to find a subset of the original set of variables, or features, to get a
smaller subset which can be used to model the problem. It usually involves three ways:
• Filter
• Wrapper
• Embedded
•Feature extraction: This reduces the data in a high dimensional space to a lower dimension space, i.e. a
space with lesser no. of dimensions.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction include:
•Principal Component Analysis (PCA)
•Linear Discriminant Analysis (LDA)
•Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear and non-linear, depending upon the method used. The
prime linear method, called Principal Component Analysis, or PCA, is discussed below.

31
Type How it Works
Forward Selection
Start with no features → add one at a time → keep if
performance improves.
Backward Elimination Start with all features → remove one at a time → drop if
performance improves or stays the same.
Recursive Feature Elimination (RFE)
Train model → remove least important feature(s) →
repeat until desired number remains.
Types of Wrapper Methods

32
Advantages of Dimensionality Reduction
•It helps in data compression, and hence reduced storage space.
•It reduces computation time.
•It also helps remove redundant features, if any.
•Improved Visualization: High dimensional data is difficult to visualize, and dimensionality reduction
techniques can help in visualizing the data in 2D or 3D, which can help in better understanding and analysis.
•Overfitting Prevention: High dimensional data may lead to overfitting in machine learning models, which can
lead to poor generalization performance. Dimensionality reduction can help in reducing the complexity of the
data, and hence prevent overfitting.
•Feature Extraction: Dimensionality reduction can help in extracting important features from high dimensional
data, which can be useful in feature selection for machine learning models.
•Data Preprocessing: Dimensionality reduction can be used as a preprocessing step before applying machine
learning algorithms to reduce the dimensionality of the data and hence improve the performance of the model.
•Improved Performance: Dimensionality reduction can help in improving the performance of machine learning
models by reducing the complexity of the data, and hence reducing the noise and irrelevant information in the
data.

33
Disadvantages of Dimensionality Reduction
•It may lead to some amount of data loss.
•PCA tends to find linear correlations between variables, which is sometimes undesirable.
•PCA fails in cases where mean and covariance are not enough to define datasets.
•We may not know how many principal components to keep- in practice, some thumb rules are applied.
•Interpretability: The reduced dimensions may not be easily interpretable, and it may be difficult to
understand the relationship between the original features and the reduced dimensions.
•Overfitting: In some cases, dimensionality reduction may lead to overfitting, especially when the number
of components is chosen based on the training data.
•Sensitivity to outliers: Some dimensionality reduction techniques are sensitive to outliers, which can result
in a biased representation of the data.
•Computational complexity: Some dimensionality reduction techniques, such as manifold learning, can be
computationally intensive, especially when dealing with large datasets.

Bias-Variance Tradeoff (CO1)
• It is important to understand prediction errors (bias and variance) when it comes to accuracy in any
machine-learning algorithm.
• There is a tradeoff between a model’s ability to minimize bias and variance which is referred to as the best
solution for selecting a value of Regularization constant.
• A proper understanding of these errors would help to avoid the overfitting and underfitting of a data set
while training the algorithm.

Bias(CO1)
What is Bias?
• The bias is known as the difference between the prediction of the values by the Machine Learning model
and the correct value.
• Being high in biasing gives a large error in training as well as testing data.
• It recommended that an algorithm should always be low-biased to avoid the problem of underfitting.
• By high bias, the data predicted is in a straight line format, thus not fitting accurately in the data in the data
set. Such fitting is known as the Underfitting of Data. This happens when the hypothesis is too simple or
linear in nature.
High Bias in the Model

Variance(CO1)
What is Variance?
• The variability of model prediction for a given data point which tells us the spread of our data is called the
variance of the model.
• The model with high variance has a very complex fit to the training data and thus is not able to fit accurately
on the data which it hasn’t seen before. As a result, such models perform very well on training data but
have high error rates on test data.
• When a model is high on variance, it is then said to as Overfitting of Data.
• Overfitting is fitting the training set accurately via complex curve and high order hypothesis but is not the
solution as the error with unseen data is high. While training a data model variance should be kept low. The
high variance data looks as follows.
High Variance in the Model

Variance(CO1)
Bias and Variance Trade-Off

Bias- Variance trade off (CO1)
Bias- Variance Trade-off
Bias and variance should be low

• In supervised learning, underfitting happens when a model unable to capture the underlying pattern of the data.
These models usually have high bias and low variance. It happens when we have very less amount of data to
build an accurate model or when we try to build a linear model with a nonlinear data. Also, these kind of models
are very simple to capture the complex patterns in data like Linear and logistic regression.
Underfitting(CO1)
Reasons for Underfitting
1.High bias and low variance.
2.The size of the training dataset used is not enough.
3.The model is too simple.
4.Training data is not cleaned and also contains noise in it.
Techniques to Reduce Underfitting
5.Increase model complexity.
6.Increase the number of features, performing feature engineering.
7.Remove noise from the data.
8.Increase the number of epochs or increase the duration of training to get better results.

• In supervised learning, Overfitting happens when our model captures the noise along with the underlying pattern
in data. It happens when we train our model a lot over noisy dataset. These models have low bias and high
variance. These models are very complex like Decision trees which are prone to overfitting.
Overfitting(CO1)
Overfitting is a problem where the evaluation of machine learning algorithms on training data is different from
unseen data.
Reasons for Overfitting:
1. High variance and low bias.
2.The model is too complex.
3.The size of the training data.
Techniques to Reduce Overfitting
4.Increase training data.
5.Reduce model complexity.
6.Early stopping during the training phase (have an eye over the loss over the training period as soon as loss
begins to increase stop training).
7.Ridge Regularization and Lasso Regularization.
8.Use dropout for neural networks to tackle overfitting.
9.Cross- Validation (K- Fold Cross Validation)
10.Batch normalization

Overfitting(CO1)
Underfitting and Overfitting in Machine Learning

Overfitting(CO1)
Regularization
• The word “regularize” means to make things regular or acceptable.
• This is exactly why we use it for. Regularization is a form of regression used to reduce the error by
fitting a function appropriately on the given training set and avoid overfitting.
• It discourages the fitting of a complex model, thus reducing the variance and chances of overfitting. It
is used in the case of multicollinearity (when independent variables are highly correlated).
the equation of Linear Regression. Let be the prediction made.
We also introduced the concept of loss functions. We will use one such loss function in this post -
Residual Sum of Squares (RSS). It can be mathematically given as:

Solution of Overfitting(CO1)
Regularization can be of two kinds,
1. Ridge / L2 Regularization
2. Lasso Regression/L1 Regularization
Ridge Regression / L2 Regularization
In this regression, we add a penalty term to the RSS loss function. Our modified loss function now
becomes:
• Here, λ is called the “tuning parameter” which decides how heavily we want to penalize the
flexibility of our model.
• If we look closely, we might observe that if λ=0, it performs like linear regression
• as λ→inf, the impact of the shrinkage penalty grows, and the ridge regression coe cient estimates
ﬃ
will approach zero.
• As can be seen, selecting a good value of λ is critical. The coefficient estimates produced by this
method are sometimes also known as the “L2 norm”.

Lasso Regression / L1 Regularization
This regression adopts the same idea
as Ridge Regression with a change in
the penalty term. Instead of , we use
Thus our new loss function becomes:
this is sometimes called the “L1 norm”.

• Note:
• The tuning parameter λ controls the impact on bias and variance.
• As the value of λ rises, it reduces the value of coefficients and thus reducing the variance.
• Till a point, this increase in λ is beneficial as it is only reducing the variance (hence avoiding overfitting),
without losing any important properties in the data.
• But after a certain value, the model starts losing important properties, giving rise to bias in the model and
thus underfitting. Therefore, the value of λ should be carefully selected.
• λ is optimized using cross-validation(K –Fold Cross Validation)

Regularization:
• The regularization model promotes smoother functions by creating a new criterion function
that relies not only on the training error, but also on algorithmic intricacy.
• Particularly, the new criterion function punishes extremely complex hypotheses; looking for
the minimum in this criterion is to balance error on the training set with complexity.
• Formally, it is possible to write the new criterion as a sum of the error on the training set plus
a regularization term, which depicts constraints or sought after properties of solutions
• The second term penalizes complex hypotheses with large variance.
• When we minimize augmented error function instead of the error on data only, we penalize
complex hypotheses and thus decrease variance.
• When λ is taken too large, only very simple functions are allowed and we risk introducing bias. λ
is optimized using cross-validation

• We consider here the example of neural network hypotheses class .
• The hypothesis complexity may be expressed as,
• The regularizer encourages smaller weights .
• For small values of weights, the network mapping is approximately linear.
• Relatively large values of weights lead to overfitted mapping with regions of large curvature

Early
Stopping:
• The training of a learning machine corresponds to iterative decrease in the error function defined as
per the training data.
• During a specific training session, this error generally reduces as a function of the number of iterations
in the algorithm.
• Stopping the training before attaining a minimum training error, represents a technique of restricting
the effective hypothesis complexity.
Pruning:
• An alternative solution that sometimes is more successful than early stopping the growth (complexity)
of the hypothesis is pruning the full-grown hypothesis that is likely to be overfitting the training data.
• Pruning is the basis of search in many decision-tree algorithms; weakest branches of large tree
overfitting the training data, which hardly reduce the error rate, are removed.

• Regression analysis is a set of statistical methods used for the estimation of relationships between a
dependent variable and one or more independent variables.
• Regression analysis includes several variations, such as linear, multiple linear, and nonlinear.
• The most common models are simple linear and multiple linear.
• Nonlinear regression analysis is commonly used for more complicated data sets in which the
dependent and independent variables show a nonlinear relationship.
UNIT-1 Regression

• Regression Analysis
– Simple Linear Regression: A model that assesses the relationship
between a dependent variable and an independent variable
Y = mx + c + e
– Where:
• Y – Dependent variable
• x – Independent (explanatory) variable
• c – Intercept
• m – Slope
• e – Residual (error)
UNIT-1 Regression

• Multiple linear regression analysis is essentially similar to the
simple linear model, with the exception that multiple
independent variables are used in the model.
• The mathematical representation of multiple linear regression
is:
Y = a + bX1 + cX2 + dX3 + ϵ
• Where:
• Y – Dependent variable
• X1, X2, X3 – Independent (explanatory) variables
• a – Intercept
• b, c, d – Slopes
• ϵ – Residual (error)
UNIT-1 Regression

• Loss Function
– Loss function is a way to know the performance of a model.
– High Loss function leads to bad train model and low loss function leads to good train model.
– loss function should be as minimum as possible.
– Loss function calculated over a single training data.
L = (Actual_Value - Predicted_Value)2
– Loss function Sometime also known as error function.
• Cost Function
– Cost function calculated for complete batch of data
C = 2
UNIT-1 Regression

– Example for Loss and Cost Function
UNIT-1 Regression
Roll No. CGPA IQ
Actual_Value Predicted_Value
Loss Function Cost Function
Package Predicted
1 5.2 100 6.3 6.4 0.01
3.475
2 4.3 91 4.5 5.3 0.64
3 8.2 83 6.5 5.2 1.69
4 8.9 102 5.5 8.9 11.56
NOTE: Loss function calculated for Individual Data while Cost
Function calculate for Entire Dataset

• MAE (Mean Absolute Error): MAE is a metric that measures
the average absolute difference between the predicted values
and the actual values. It gives an idea of how far off the
predictions are from the true values, regardless of the direction
of the error.
L = |Actual_Value - Predicted_Value|
C =
UNIT-1 Regression

• Advantages
– Easy to Understand
– Same unit as unit of Actual_Value
– It is Robust to Outlier: It means outlier will not affect error, so if there
is no outliers in dataset then it better to use MAE instead of MSE
• Disadvantages
– Grap is not differenciable due which Gradient Descent(GD) algorithm
not easy to implement.
– To implement GD we need to calculate Sub-Gradient.
UNIT-1 Regression

UNIT-1 Regression
Actual values (y): [3, 5, 2, 7,]
Predicted values (ŷ): [2.5, 5.5, 2, 8]

• MSE (Mean Squared Error): MSE is a metric that calculates the average squared difference
between the predicted values and the actual values.
• Squaring the errors gives more weight to larger errors, making it useful for penalizing significant
deviations from the true values.
L = (Actual_Value - Predicted_Value)2
C = 2
UNIT-1 Regression

• Advantages
– Easy to interpret
– Loss function is differenciable that allows to implement GD easily
– One Local Minima: It means function has one minimum value that we have to find.
• Disadvantage
– Unit of error is Square: That creates an confusion to understand it, so to extract accurate error we
have to find square root of MSE.
– It is not Robust to Outlier: If dataset consists outliers then. MSE is not useful
UNIT-1 Regression

• Huber loss
• Huber Loss is applicable when Outlier data is around 25% because 25% is
a significant amount of data and if we use MSE then it will ignore the 75%
data which is correct, because graph will deviate towards Outliers and if
we use MAE, it will ignore 25% outlier data that is also significant. In this
type of situation Huber Loss is useful.
UNIT-1 Regression

• RMSE
• It quantifies the differences between predicted values and actual values, squaring the errors, taking the
mean, and then finding the square root.
• RMSE provides a clear understanding of the model’s performance, with lower values indicating better
predictive accuracy.
• RMSE is computed by taking the square root of MSE
• RMSE value with zero indicates that the model has a perfect fit
UNIT-1 Regression

• RMSE
• The lower the RMSE, the better the model and its predictions.
• A higher RMSE indicates that there is a large deviation from the
residual to the ground truth.
UNIT-1 Regression

• Pros of the RMSE Evaluation Metric:
– RMSE is easy to understand.
– It serves as a heuristic for training models.
– It is computationally simple and easily differentiable which many
optimization algorithms desire.
– RMSE does not penalize the errors as much as MSE does due to the
square root.
• Cons of the RMSE metric:
– Like MSE, RMSE is dependent on the scale of the data. It increases in
magnitude if the scale of the error increases.
– One major drawback of RMSE is its sensitivity to outliers and the
outliers have to be removed for it to function properly.
UNIT-1 Regression

UNIT-1 Regression(USE of MAE, MSE, and RMSE)
•MAE Example use: Predicting delivery time, demand forecasting,
house prices (when big and small errors should be treated equally).
MSE Example use: Medical predictions, credit risk, fault detection
(where a large error is much worse than small ones).
RMSE Example use: Weather forecasting, energy load prediction, traffic
prediction (applications where occasional big errors are unacceptable).
Quick Rules:
•MAE: Robust, easy to explain → Good for reporting general accuracy.
•MSE: Sensitive to large errors → Good for training.
•RMSE: Sensitive + interpretable → Good for evaluation.

• R Squared
• R-squared (Coefficient of Determination) is a statistical measure that
quantifies the proportion of the variance in the dependent variable that is
explained by the independent variables in a regression model.
• Where:
– SSR (Sum of Squares Residual) represents the sum of squared differences between
the observed values and the predicted values by the model.
– SST (Total Sum of Squares) represents the sum of squared differences between the
observed values and the mean of the dependent variable.
UNIT-1 Regression

• R-squared ranges between 0 and 1, with the following
interpretations:
– =0: The model does not explain any of the variability in the dependent
variable. It's a poor fit.
– : The model explains a proportion of the variability. A higher R-squared
indicates a better fit, with 1 indicating a perfect fit where the model
explains all the variability.
– =1: The model perfectly predicts the dependent variable based on the
independent variables.
UNIT-1 Regression

• R-squared evaluates regression model fit but has limitations:
• High R-squared doesn't always mean good fit; high value may imply overfitting, lacking
generalization.
• Including more predictors can inflate R-squared, even if they're weak; adjusted R-squared adjusts for
this.
• "Good" R-squared varies by field; lower values acceptable in data-rich areas.
• R-squared may miss fit quality with nonlinearity or outliers.
UNIT-1 Regression

• Adjusted R Squared
• Where −
– n = the number of points in your data sample.
– k = the number of independent regressors, i.e. the number of variables
in your model, excluding the constant.
UNIT-1 Regression

– Adjusted R-squared adjusts the statistic based on the number of independent variables in the
model
– Adjusted R2
also indicates how well terms fit a curve or line, but adjusts for the number of terms
in a model.
– If you add more and more useless variables to a model, adjusted r-squared will decrease.
– If you add more useful variables, adjusted r-squared will increase.
– Adjusted R2
will always be less than or equal to R2
UNIT-1 Regression

– Problem Statement −
• A fund has a sample R-squared value close to 0.5 and it is doubtlessly offering higher risk
adjusted returns with the sample size of 50 for 5 predictors. Find Adjusted R square value.
– Sample size = 50 Number of predictor = 5 Sample R - square = 0.5.Substitute the qualities in the
equation,
UNIT-1 Regression

• RMSE (Root Mean Squared Error): RMSE is the square root of the MSE and is
commonly used to express the average magnitude of the prediction errors in the same
units as the dependent variable. It provides a measure of the model's accuracy, and
lower values indicate better performance.
• R Squared (Coefficient of Determination): R-squared is a statistical measure that
represents the proportion of the variance in the dependent variable that is explained by
the independent variables in the regression model. It ranges from 0 to 1, where 1
indicates that the model explains all the variance, and 0 indicates that the model
doesn't explain any of the variance.
UNIT-1 Regression

• Adjusted R Squared: Adjusted R-squared is a modified version
of R-squared that takes into account the number of
independent variables in the model. It penalizes the addition of
irrelevant variables that might artificially inflate the R-squared
value.
• p-Value: The p-value is a measure of the evidence against a null
hypothesis in a statistical hypothesis test. In the context of
regression analysis, p-values are used to determine whether
the coefficients of the independent variables are statistically
significant. A low p-value (typically below a significance level
like 0.05) suggests that the variable has a significant impact on
UNIT-1 Regression

• A Fraud Detection Classifier
• Objective: To detect fraud claim
• Assumption:
– The output of your fraud detection model is the probability [0.0–1.0] that a transaction is
fraudulent.
– If this probability is below 0.5, you classify the transaction as non-fraudulent; otherwise, you
classify the transaction as fraudulent.
• Methodology
– Collect 10,000 manually classified transactions, with 300 fraudulent transaction and 9,700 non-
fraudulent transactions.
– You run your classifier on every transaction, predict the class label (fraudulent or non-fraudulent)
and summarise the results in the following confusion matrix:
UNIT-1 Classification

What is the Confusion Matrix? A confusion matrix is a nn matrix that is used for evaluating
the performance of the classification model. For Binary classification — The confusion Matrix is
a 22 matrix. If the target class is 3 means Confusion Matrix is 3*3 matrix and so on.

Terminologies used in Confusion Matrix
•True Positive → Positive class which is predicted as positive.
•*True Negative *→ Negative class which is predicted as negative.
•False Positive → Negative class which is predicted as positive.[Type I Error]
•False Negative →Positive class which is predicted as negative.[Type II Error]
1. Recall Recall is a measure of how many positives your model is able to recall
from the data.
Out of all positive records, how many records are predicted correctly.
Recall is also known as Sensitivity or TPR (True Positive Rate)

2. Precision Precision is the ratio of correct positive predictions to
the total positive predictions.
Out of all positives been predicted, how many are actually positive.

Example Cancer Prediction-For this dataset, if the model predicts cancer
records as non-cancer means it’s risky. All our cancer records should be
predicted correctly.
In this example, recall metrics is more important than precision. The recall
rate should be 100%. All positive records( cancer records) should be predicted
correctly. False Negative should be 0.
For this cancer dataset, recall metrics is given more importance while
evaluating the performance of the model.
If non-cancer records are predicted as cancer means it’s not that risky.

Example. The cancer data set has 100 records, out of which 94 are cancer
records and 6 are non-cancer records. But the model is predicting 90 out of 94
cancer records correctly. Four cancer records are not predicted correctly [ 4 —
FN]

Precision — Example
Email Spam Filtering- For this dataset, if the model predicts good email as spam means it's
risky. We don’t want any of our good emails to be predicted as Spam. So, the precision
metric is given more importance while evaluating this model. False Positive should be 0.
If the spam filtering dataset has 100 records, out of which 94 are predicted as spam
emails. Only 90 out of 94 records is predicted correctly. 4 good emails are classified as
spam. It’s risky. The precision rate is 95%. It should be 100%. No good emails should be
classified as “Spam”. False-positive should be 0 for this model.

F1 Score F1 score is a harmonic mean of precision and recall. F1 score metric is used
when you seek a balance between precision and recall.
F1 score vs Accuracy
Accuracy deals with True positive and True Negative. It doesn't mention about
False-positive and False-negative. So we are not aware of the distribution of False-
positive and False-negative. If accuracy is 95% means, we don't know how the
remaining 5% is distributed between False-positive and False-negative.
F1 Score deals with False-positive and False-negative. For some models, we want to
know about the distribution of False-negative and False positive. For those models,
the F1 Score metric is used for evaluating the performance.

Accuracy: Correctly predicted values out of total given data.

• Area Under Curve
• Area Under Curve(AUC) is one of the most widely used metrics for evaluation.
• It is used for binary classification problem.
• AUC of a classifier is equal to the probability that the classifier will rank a randomly chosen
positive example higher than a randomly chosen negative example.
• Two basic terms used in AUC:
– True Positive Rate (Sensitivity)
– True Negative Rate (Specificity)
UNIT-1 Classification (AUC)

• Few basic terms used in AUC:
– True Positive Rate (Sensitivity) : True Positive Rate is defined as TP/ (FN+TP). True Positive
Rate corresponds to the proportion of positive data points that are correctly considered as positive,
with respect to all positive data points.
– True Negative Rate (Specificity) : True Negative Rate is defined as TN / (FP+TN). False
Positive Rate corresponds to the proportion of negative data points that are correctly considered as
negative, with respect to all negative data points.
UNIT-1 Classification(AUC)

– False Positive Rate : False Positive Rate is defined as FP / (FP+TN). False Positive Rate
corresponds to the proportion of negative data points that are mistakenly considered as
positive, with respect to all negative data points.
• False Positive Rate and True Positive Rate both have values in the range [0, 1].
• FPR and TPR both are computed at varying threshold values such as (0.00, 0.02, 0.04, …., 1.00)
and a graph is drawn.
• AUC is the area under the curve of plot False Positive Rate vs True Positive Rate at different
points in [0, 1].

• As evident, AUC has a range of [0, 1]. The greater the value, the better is the performance of our
model.

An ROC curve (receiver operating characteristic curve) is a graph showing the performance
of a classification model at all classification thresholds. This curve plots two parameters:
•True Positive Rate
•False Positive Rate
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
TPR=TP/(TP+FN)
False Positive Rate (FPR) is defined as follows:
FPR=FP/(FP+TN)
An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the
classification threshold classifies more items as positive, thus increasing both False Positives and
True Positives. The following figure shows a typical ROC curve.
ROC curve (CO1)

ROC curve (CO1)
AUC-ROC curve
Let’s first understand the meaning of the two
terms ROC and AUC.
•ROC: Receiver Operating Characteristics
•AUC: Area Under Curve
ROC Curve
ROC stands for Receiver Operating Characteristics, and the ROC curve is the graphical representation of the
effectiveness of the binary classification model. It plots the true positive rate (TPR) vs the false positive rate (FPR)
at different classification thresholds.
AUC Curve:
AUC stands for Area Under the Curve, and the AUC curve represents the area under the ROC curve. It measures
the overall performance of the binary classification model. As both TPR and FPR range between 0 to 1, So, the area
will always lie between 0 and 1, and A greater value of AUC denotes better model performance. Our main goal is to
maximize this area in order to have the highest TPR and lowest FPR at the given threshold. The AUC measures the
probability that the model will assign a randomly chosen positive instance a higher predicted probability compared
to a randomly chosen negative instance.

88
ROC curve (CO1)
TPR and FPR
This is the most common definition that you would have encountered when you would Google AUC-ROC.
Basically, the ROC curve is a graph that shows the performance of a classification model at all possible thresholds(
threshold is a particular value beyond which you say a point belongs to a particular class). The curve is plotted
between two parameters
•TPR – True Positive Rate
•FPR – False Positive Rate

89
ROC curve (CO1)
• Specificity measures the proportion of actual negative instances that are correctly
identified by the model as negative.
• It represents the ability of the model to correctly identify negative instances And as said
earlier ROC is nothing but the plot between TPR and FPR across all possible thresholds
and AUC is the entire area beneath this ROC curve.
Sensitivity versus False Positive Rate plot

90
ROC curve (CO1)

91
ROC curve (CO1)
By changing cutoff point false positive increases
By changing cutoff point false negative increases
ROC curve can be used to determine cutoff point, which optimize the sensitivity,
specificity of a given test.

92
ROC curve (CO1)
AUC measures how well a model is able to distinguish between classes.
An AUC of 0.75 would actually mean that let’s say we take two data points belonging to separate classes then there is a 75% chance the model would
be able to segregate them or rank order them correctly i.e positive point has a higher prediction probability than the negative class. (assuming a
higher prediction probability means the point would ideally belong to the positive class). Here is a small example to make things more clear.

93
ROC curve (CO1)

94
ROC curve (CO1)

95
ROC curve (CO1)

• The concept of p-value comes from statistics and widely used in machine learning and data
science.
• p-value is also used as an alternative to determine the point of rejection in order to provide the
smallest significance level at which the null hypothesis is least or rejected.
• it is expressed as the level of significance that lies between 0 and 1, and if there is smaller p-value,
then there would be strong evidence to reject the null hypothesis. if the value of p-value is very
small, then it means the observed output is feasible but doesn't lie under the null hypothesis
conditions (h0).
• the p-value of 0.05 is known as the level of significance (α). usually, it is considered using two
suggestions, which are given below:
– if p-value>0.05: the large p-value shows that the null hypothesis needs to be accepted.
– if p-value<0.05: the small p-value shows that the null hypothesis needs to be rejected, and the
result is declared as statically significant.
P-value (CO1)

• Cross-validation is a statistical method used to estimate the skill of machine learning models.
• It is commonly used in applied machine learning to compare and select a model for a given
predictive modeling problem because it is easy to understand, easy to implement, and results in
skill estimates that generally have a lower bias than other methods.
• Cross-validation is a resampling procedure used to evaluate machine learning models on a
limited data sample.
• The procedure has a single parameter called k that refers to the number of groups that a given
data sample is to be split into. As such, the procedure is often called k-fold cross-validation.
• When a specific value for k is chosen, it may be used in place of k in the reference to the
model, such as k=10 becoming 10-fold cross-validation.
k-Fold Cross-Validation

• We can’t check the ability of this person because 70 math questions are from
algebra but in test 30 questions,10 from calculus so we can’t judge the ability of
person.
• That’s why we are going for K-Fold Cross- Validation, to get good results

Here K=5, total data is divided into 5 fold. First time we are using first fold for test and rest
80% for training purpose, repeat this process 5 times, after that we take average of 5
results.
https://www.youtube.com/watch?v=gJo0uNL-5Qw

• Hyperparameters in Machine learning/Deep learning are those parameters that are explicitly defined by the
user to control the learning process.
• These hyperparameters are used to improve the learning of the model, and their values are set before
starting the learning process of the model.
• They are usually fixed before the actual training process begins.
• These parameters express important properties of the model such as its complexity or how fast it should
learn.
• Some examples of model hyper parameters include:
• The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization
• The learning rate for training a neural network.
• The C and sigma hyperparameters for support vector machines.
• The k in k-nearest neighbors.
Hyper parameter tuning(CO1)
https://www.geeksforgeeks.org/hyperparameter-tuning/

Hyper parameter tuning(CO1)
Models can have many hyperparameters and finding the best combination of parameters can be
treated as a search problem. The two best strategies for Hyperparameter tuning are:
•GridSearchCV: Grid search Cross Validation
•RandomizedSearchCV: Randonized search Cross-Validation
In general, if the number of combinations is limited enough, we can use the Grid
Search technique. But when the number of combinations increases, we should
try Random Search or Bayes Search as they are not computationally expensive.

Grid Search technique (CO1)
GridSearchCV is a brute-force technique for hyperparameter tuning. It trains
the model using all possible combinations of specified hyperparameter values
to find the best-performing setup. It is slow and uses a lot of computer power
which makes it hard to use with big datasets or many settings.
It works using below steps:
•Create a grid of potential values for each hyperparameter.
•Train the model for every combination in the grid.
•Evaluate each model using cross-validation.
•Select the combination that gives the highest score.

GridSearchCV
For example,
if we want to set two hyperparameters C and Alpha of the Logistic Regression Classifier model, with
different sets of values. The grid search technique will construct many versions of the model with all
possible combinations of hyperparameters and will return the best one.
As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4]. For a combination
of C=0.3 and Alpha=0.2, the performance score comes out to be 0.726(Highest), therefore it is selected.

Grid Search technique Code (CO1)
# Necessary imports
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
# Creating the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}
# Instantiating logistic regression classifier
logreg = LogisticRegression()
# Instantiating the GridSearchCV object
logreg_cv = GridSearchCV(logreg, param_grid, cv = 5)
logreg_cv.fit(X, y)
# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Best score is {}".format(logreg_cv.best_score_))
https://www.geeksforgeeks.org/hyperparameter-tuning/

Drawback: GridSearchCV will go through all the intermediate combinations of hyperparameters
which makes grid search computationally very expensive.
• RandomizedSearchCV
RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes through only a fixed number
of hyperparameter settings.
• It moves within the grid in a random fashion to find the best set of hyperparameters. This approach
reduces unnecessary computation.

Random Search Code (CO1)
# Necessary imports
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
# Creating the hyperparameter grid
param_dist = {"max_depth": [3, None],
"max_features": randint(1, 9),
"min_samples_leaf": randint(1, 9),
"criterion": ["gini", "entropy"]}
# Instantiating Decision Tree classifier
tree = DecisionTreeClassifier()
# Instantiating RandomizedSearchCV object
tree_cv = RandomizedSearchCV(tree, param_dist, cv = 5)
tree_cv.fit(X, y)
# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))

Random Search Code (CO1)
Tuned Decision Tree Parameters: {‘min_samples_leaf’: 5, ‘max_depth’: 3, ‘max_features’: 5,
‘criterion’: ‘gini’} Best score is 0.7265625
OutPut

• Deep learning is a class of machine learning algorithms that use several layers of nonlinear processing
units for feature extraction and transformation. Each successive layer uses the output from the
previous layer as input.
• Deep neural networks, deep belief networks and recurrent neural networks have been applied to fields
such as computer vision, speech recognition, natural language processing, audio recognition, social
network filtering, machine translation, and bioinformatics where they produced results comparable to
and in some cases better than human experts have.
• Deep Learning Algorithms and Networks −
are based on the unsupervised learning of multiple levels of features or representations of the data.
Higher-level features are derived from lower level features to form a hierarchical representation.
use some form of gradient descent for training.
Introduction to Deep Learning(CO1)

109
Here are just a few examples of deep learning at work:
• A self-driving vehicle slows down as it approaches a
pedestrian crosswalk.
• An ATM rejects a counterfeit bank note.
• A smartphone app gives an instant translation of a
foreign street sign.
• Deep learning is especially well-suited to identification
applications such as face recognition, text translation,
voice recognition, and advanced driver assistance
systems, including, lane classification and traffic sign
recognition.
Deep Learning Applications (CO1)

110
Some other Applications (CO1)
Used for speed of machine Digital imaging
Fraud Detection Increasing phone efficiency

111
In a word, accuracy. Advanced tools and techniques have dramatically improved deep learning algorithms
—to the point where they can outperform humans at classifying images, win against the world’s best GO
player, or enable a voice-controlled assistant like Amazon Echo® and Google Home to find and download
that new song you like.
What Makes Deep Learning State-of-the-Art? (CO1)

112
Three technology enablers make this degree of accuracy possible:
Easy access to massive sets of labeled data Data sets such as
ImageNet and PASCAL VoC are freely available, and are useful for
training on many different types of objects.
What Makes Deep Learning State-of-the-Art? (CO1)
Increased computing power High-performance GPUs accelerate
the training of the massive amounts of data needed for deep
learning, reducing training time from weeks to hours.
Pretrained models built by experts Models such as AlexNet can be retrained to perform new recognition
tasks using a technique called transfer learning. While AlexNet was trained on 1.3 million high-resolution
images to recognize 1000 different objects, accurate transfer learning can be achieved with much smaller
datasets.

AI ML DL
DL
ML
AI
Difference between AI, ML, DL (CO1)

1. Huge amount of data
(Initially we started with ML, its major drawback is, its efficiency is degraded with higher data
or data sets)
(x-axis: number of data, Y-axis: efficiency )
And the solution given by deep learning, that can handled huge amount of data, which may
be structured or unstructured.
2. Complex problem
These are basically included the real time data analysis, medical diagnosis system etc., which
are handled by deep learning
Why it needed deep learning (CO1)

• The term "Artificial neural network" refers to a biologically inspired sub-field of artificial
intelligence modeled after the brain.
• An Artificial neural network is usually a computational network based on biological neural networks
that construct the structure of the human brain.
• Similar to a human brain has neurons interconnected to each other, artificial neural networks also
have neurons that are linked to each other in various layers of the networks. These neurons are known
as nodes.
Artificial Neural Network(CO1)

 The Brain is A massively parallel information processing system.
 Our brains are a huge network of processing elements. A typical brain contains a network of 10 billion
neurons.
How do our brains work?(CO1)

09/01/2025 Dr. kumod Kumar Gupta Programming for Data analyti
cs Unit 4
117
Neural Network
To begin understanding deep Learning, We will build up our model
abstractions
• Single Biological Neuron
• Perceptron
• Multi-Layer Perceptron Model
• Deep Learning Neural Network

 A processing element
Dendrites: Input
Cell body: Processor
Synaptic: Link
Axon: Output
Synapse: weight

An artificial neuron is an imitation of a human neuron

• Dendrites from Biological Neural Network represent inputs in
Artificial Neural Networks, cell nucleus represents Nodes,
synapse represents Weights, and Axon represents Output.
• Relationship between Biological neural network and artificial
neural network:
Biological Neural Network Artificial Neural Network
Dendrites Inputs
Cell nucleus Nodes
Synapse Weights
Axon Output
Biological neural network and artificial neural network(CO1)

Artificial neural network(CO1)

cs Unit 4
124
Neural Network

cs Unit 4
125
Neural Network

cs Unit 4
126
Neural Network

cs Unit 4
127
Neural Network

cs Unit 4
128
Neural Network

cs Unit 4
129
Neural Network

cs Unit 4
130
Neural Network

cs Unit 4
131
Neural Network

cs Unit 4
132
Neural Network

cs Unit 4
133
Neural Network

cs Unit 4
134
Neural Network
Activation function is using to introduce Non-Linearity

Our basic computational element (model neuron) is often called a node or unit. It receives input from some other
units, or perhaps from an external source. Each input has an associated weight w, which can be modified so as to
model synaptic learning. The unit computes some function f of the weighted sum of its inputs:
The typical Artificial Neural Network looks something like the given figure(CO1)

ARCHITECTURE OF AN ARTIFICIAL
NEURAL NETWORK
Kumod kumar Gupta Machine Learning Unit 2
Artificial Neural Network primarily consists of three layers:
Input Layer:
As the name suggests, it accepts inputs in several different formats
provided by the programmer.

Hidden Layer:
• The hidden layer presents in-between input and output layers. It
performs all the calculations to find hidden features and patterns.
Output Layer:
• The input goes through a series of transformations using the hidden
layer, which finally results in output that is conveyed using this
layer.
• The artificial neural network takes input and computes the weighted
sum of the inputs and includes a bias. This computation is
represented in the form of a transfer function.
NEURAL NETWORK

NEURAL NETWORK

• Bipolar binary and unipolar binary are called as hard limiting activation functions used in discrete
neuron model
• Unipolar continuous and bipolar continuous are called soft limiting activation functions are called
sigmoidal characteristics.
Activation function (CO1)

Feedforward Network
• It is a non-recurrent network having processing units/nodes in layers and all the nodes in a
layer are connected with the nodes of the previous layers.
• The connection has different weights upon them.
• There is no feedback loop means the signal can only flow in one direction, from input to
output. It may be divided into the following two types −
Neural network architecture

Neural network architecture Cont…(CO1)
•Single layer feedforward network − The concept is of feedforward ANN having only one
weighted layer. In other words, we can say the input layer is fully connected to the output layer.

Single layer Feedforward Network

• Multilayer feedforward network − The concept is of feedforward ANN having more than
one weighted layer. As this network has one or more layers between the input and the output
layer, it is called hidden layers.

Can be used to solve complicated problems
Multilayer feed forward network(CO1)

• Feedback Network :As the name suggests, a feedback network has feedback
paths, which means the signal can flow in both directions using loops. This
makes it a non-linear dynamic system, which changes continuously until it
reaches a state of equilibrium.
• Recurrent networks − They are feedback networks with closed loops. It is a
closed loop network in which the output will go to the input again as feedback
as shown in the following diagram.

When outputs are directed back as inputs to same
or preceding layer nodes it results in the formation
of feedback networks
Feedback network(CO1)

• Single node with own feedback
• Competitive nets
• Single-layer recurrent network
• Multilayer recurrent networks
Feedback networks with closed loop are called Recurrent Networks. The response at the k+1’th instant depends on
the entire history of the network starting at k=0.
Automaton: A system with discrete time inputs and a discrete data representation is called an automaton
Recurrent network(CO1)

FEED FORWARD UNSUPERVISED LEARNING
Hebbian Learning Rule(CO1)

• The learning signal is equal to the neuron’s output

• Feedforward unsupervised learning
• “When an axon of a cell A is near enough to exicite a cell B and repeatedly and persistently takes
place in firing it, some growth process or change takes place in one or both cells increasing the
efficiency”
• If oixj is positive the results is increase in weight else vice versa
Features of Hebbian Learning(CO1)

Final answer:

• For the same inputs for bipolar continuous activation function
the final updated weight is given by

• Learning signal is the difference between the desired and actual neuron’s
response
• Learning is supervised
Perceptron Learning Rule(CO1)

Perceptron Learning Rule(CO1)

• Only valid for continuous activation function
• Used in supervised training mode
• Learning signal for this rule is called delta
• The aim of the delta rule is to minimize the error over all training patterns
Delta Learning Rule(CO1)

Learning rule is derived from the condition of least squared error.
Calculating the gradient vector with respect to wi
Minimization of error requires the weight changes to be in the negative gradient direction
Delta Learning Rule Contd.(CO1)

A Multi-Layer Perceptron (MLP) neural network trained using the Backpropagation learning algorithm is
one of the most powerful forms of supervised neural network system.
The training of such a network involves three stages:
• feedforward of the input training pattern,
• calculation and backpropagation of the associated error
• adjustment of the weights
This procedure is repeated for each pattern over several complete passes (epochs) through the training
set.
After training, application of the net only involves the computations of the feedforward phase.
MLP training algorithm(CO1)

Feed Forward phase:
• Xi = input[i]
• Yj = f( bj + XiWij)
• Zk = f( bk + YjWjk)
Backpropagation of errors:
• k = Zk[1 - Zk](dk - Zk)
• j = Yj[1 - Yj]  k Wjk
Weight updating:
• Wjk(t+1) = Wjk(t) + kYj + [Wjk(t) - Wjk(t - 1)]
• bk(t+1) = bk(t) + kYtn +[bk(t) - bk(t - 1)]
• Wij(t+1) = Wij(t) + jXi + [Wij(t) - Wij(t - 1)]
• bj(t+1) = bj(t) + jXtn +[bj(t) - bj(t - 1)]
Backpropagation Learning Algorithm(CO1)

183
• 1. https://nptel.ac.in/courses/117/105/117105084/
• 4.https://www.youtube.com/watch?
v=DKSZHN7jftI&list=PLZoTAELRMXVPGU70ZGsckrMdr0FteeRUi
• 5.https://www.youtube.com/watch?
v=aPfkYu_qiF4&list=PLyqSpQzTE6M9gCgajvQbc68Hk_JKGBAYT
Faculty Video Links, Youtube & NPTEL Video Links and Online Courses Details

184
Quiz
1. Which element method for computing is started early days.
(a)Machine learning (b) Artificial intelligence (c) Deep learning (d) none of these
2. In which method for efficiency is higher with more in data set.
(b)Machine learning (b) Artificial intelligence (c) Deep learning (d) none of these
3. What is tensor flow.
(c) all the mathematics in the form of flow chart (b) Artificial intelligence algorithm (c)
Deep learning algorithm (d) none of these

185
Quiz
4. What are the benefits of TensorFlow over other libraries?
(a)Scalability (b) Visualization of Data (c) Pipelining (d) all of these
5. What do you mean by pipelining?
(b)Whole work doing at a time (b) whole work divide in small segments and then
execute in parallel manner (c) copying a work from another processor (d) none of
these

186
QUIZ
6. What is API?
(a)A programming interface(b) After programming interface (c)
Application Programming Interface (d) none of these.
7. What is the main operation in TensorFlow?
(b)Computing(b) calculation (c) Pipelining (d) passing values and
assigning the output to another tensor.
8. TensorFlow is the product of which company?
(c) Google research team (b) Amazon technical team (c) PayPal
(d) none of these
9. What is the execution speed of brain neuron?
(d) (b) (c) (d) none of these

187
Weekly Assignment
Q1. For which purpose Convolutional Neural Network is used?
Q2. What is the biggest advantage utilizing CNN?Q3. Discuss the history of deep learning.
Q4. What is the difference between Neural Networks and Deep Learning?
Q5. How can a neural network learn itself?
Q6.Explain the Concept of ANN with the help of an example.
Q7. Define the term Gradient descent. Also discuss its importance.
Q8. Explain Perceptron Convergence Theorem.
Q9.Define the term Bias.
Q10. Why ReLu function is required ?

188
MCQ
Q1. Which neural network has only one hidden layer between the input and output?
A. Shallow neural network
B. Deep neural network
C. Feed-forward neural networks
D. Recurrent neural networks
Q2. Which of the following is/are Limitations of deep learning?
A. Data labeling
B. Obtain huge training datasets
C. Both A and B
D. None of the above

189
MCQ
Q3.Deep learning algorithms are _______ more accurate than machine learning algorithm in image
classification.
A. 33%
B. 37%
C. 40%
D. 41%
Q4. Which of the following functions can be used as an activation function in the output layer if we wish
to predict the probabilities of n classes (p1, p2..pk) such that sum of p over all n equals to 1?
A. Softmax
B. ReLu
C. Sigmoid
D. Tanh

190
MCQ
Q5. Which of the following would have a constant input in each epoch of training a Deep Learning model?
A. Weight between input and hidden layer
B. Weight between hidden and output layer
C. Biases of all hidden layer neurons
D. Activation function of output layer
6. If in the training method we are not obtained the accurate output then which value the neural network
changes to get accurate output?
(a) bias (b) perceptron (c) weight (d) all value can change
B. 7. What are benefit of using graph in the tensor flow?
(a) parallelism (b) high execution speed (c) less complexity (d) all of these
C. 8. In between CPU and GPU which have high execution speed?
(a) GPU (b) CPU (c) Both have same speed (d) cannot distinguished

191
Old Question Papers

192
Old Question Papers

193
Expected Questions for University Exam
Q1. Define Batch Normalization. Why Batch Normalization helps in faster
convergence?
Q2. Define Deep Learning . Also discuss its importance.
Q3. Discuss the history of deep learning.
Q4. What is the difference between Neural Networks and Deep Learning?
Q5. How can a neural network learn itself?
Q6.Explain the Concept of ANN with the help of an example.
Q7. Define the term Gradient descent. Also discuss its importance.
Q8. Explain Perceptron Convergence Theorem.
Q9.Define the term Bias.
Q10. Why ReLu function is required ?

194
Summary
 Deep Learning is a subfield of machine learning concerned with algorithms inspired
by the structure and function of the brain called artificial neural networks.
 If you are just starting out in the field of deep learning or you had some experience
with neural networks some time ago, you may be confused. I know I was confused
initially and so were many of my colleagues and friends who learned and used
neural networks in the 1990s and early 2000s.

195
 1. https://www.slideshare.net/lablogga/deep-learning-explained2. Qin, T. (2020).
 Deep Learning Basics. In Dual Learning (pp. 25-46). Springer, Singapore.
 3.http://people.uncw.edu/chenc
/STT592_Deep%20Learning/STT592DeepLearning_Index.html
 4. Gulli, Antonio, and Sujit Pal. Deep learning with Keras. Packt Publishing Ltd, 2017.
References
Thank You

THANK YOU

Unit1_Kumod_deeplearning.pptx DEEP LEARNING

More Related Content

Similar to Unit1_Kumod_deeplearning.pptx DEEP LEARNING

Recently uploaded

Unit1_Kumod_deeplearning.pptx DEEP LEARNING