Data science#
Data science is a field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from the structured and unstructured data.
Data processing#
There are methods for preparing data for model fitting that can improve the final model’s performance.
Standartization#
Standartization: Makes the mean value of the data 0 and hte strandard deviation 1.
Sometimes confusion arises regarding the terms standardization and normalization. Normalization involves adjusting data to a fixed range.
The following benefits result from performing the standardization:
Better Interpretability of the model.
Prevention of Model Biases to the fiatures with higher scale.
Enhanced Convergence: makes the values of a loss function is more controllable.
There are methods that certainly require the standardization of the data:
Liear Models.
Support Vector Machines (SVM).
K-Nearest Neighbours (KNN).
Neural networks.
Categorical encoding#
Most of the machine learning approaches do not allow the use of categorical features. However, there are approaches that can convert them into a numerical form:
One-Hot Encoding: Create a binary variable for each category. The variable takes a value of 1 if the corresponding value appears in the original column and 0 otherwise.
Label encoding: Assigns an integer to each category.
Target encoding: Assigns a signle value to the target values that corresponds to the encoded category.
Embeddings: Assigns a special vector to each category that considers the relationships between the encoded categories.
Splitting data#
A crusial step in validating the machine learning model is splitting the data: the model is trained on the one split and validated on the other. This approach allows ensures that the model’s good performance isn’t just due to overfitting - memorizing the entire training dataset - but rather, it’s due to the identification of general rules that accurately describe the process under consideration.
In terms of data splitting, 3 subsets of data are usually created:
train: a set of data that is used training a model.
valid: a set of data on which all the different models will be estimated during the experimentation phase.
test: a set of data that is used to estimate the final model. This step is necessary to eliminate the bias associated with the process of selecting optimal hyperparameters.
Sampling is an approach for selecting samples from each group. There are a few approaches that are reasonable in most cases:
Random: Select points at random for different group.
Stratified: Each split has a proportional representation of the specified groups, as in the original dataset.
Cluster: Selects the entire group instead of individual points.
Cross-validation is an approach in which a train-test split is performed multiple times. Each time, the model is trained from scratch using each training set and evaluated using each testing set. There are few typicall approaches how to perform a cross-validation:
K-Fold: Define \(k\) different, non-overlapping test subsets. The remaining data for each test subset is considered training data.
Stratified K-Fold: In each train-test split, there is a strata that are proportional to those in the entire dataset.
Time-series: In each split, the data is separated according to the timming of the observations. Earlier data goes to the training set, and later data goes to the test set.
Classification#
Classification is a set of supervised machine learning tasks that are supposed to predict the lable to which a particular object corresponds.
Binary classification: is a task that separates samples into two groups.
Multiclass classification: separates data into more that two groups.
Multiclass#
Multiclass classification is a group of machine learning approaches that enables the separation of data into \(n\) groups.
There are two groups of the machine learning algorithms for multiclass classification:
One-vs-Rest trains a binary classification model to distinguish samples that belonging to a particular class from the all others: \(f^i(x)=P(y=i|x), i = \overline{1,n}\).
Multinomial builds a transformation of the input data to produce a vector containing a score that describes the likelihood of each class to correspond to the given \(x\): \(f(x) = \left[P(y=1|x), P(y=2|x), \ldots, P(y=n|x)\right]\).
Hyperparameter tuning#
Hyperparameter is a parameter whose values is used to control the training process.
Hyperparameter tuning is the process of adjusting the hyperparameters of a model to find the best configuration that maximizes its performance on a given dataset.
There are three typical approaches to perform a hyperparameter search:
Grid Search: For each hyperparameter, a set of possible values is defined, and the algorithm tries all possible combinations.
Random Serach: Random selection of the combinations of hyperparameters.
Bayesian Optimization: A set of approaches that introduces a probability model of the objective function. These approaches guide the search based on previous evaluations.
DL mechanisms#
Deep Learning is an advanced form of machine learning. It implements the concept of the human brain in the mathematical terms. A set of signals (actions) that appears when a specific input is received. In mathematics, this is usually implemented as a deep composition of functions - that’s actually the term term “deep learning” come from.
The DL is typically used in computer vision and complex NLP tasks. There are other sections for applications. This section considers general deep learning approaches.
Check details in the DL meachanisms
Computer Vision#
Computer vision tasks are commonly divided into categories. It is important to understand the differences in order to select the correct approach.
Classfication: is a form of computer vision in which a model is trained on a set of labaled images. The goal of the model is to predict the label to which an image belongs to.
Object Detection: Identifying the location of specific object in an image. In the most typcal case, simply to build a box around the object in the image.
Semantic segmentation: Identifying to each pixel a class that determines the object pixel belongs to.
Generating pictures.
NLP#
Natural Language Processing (NLP) a set of tasks related to processing texts.
This field considers approaches related to:
Texts classification. It defines which class a given text belongs to from a given set of classes.
Sentiment analysis: Identifying the tone of the text. For example, it can be positive, negative or neutral. Note: Some sources treat this as a separate task.
Named Entity Recognition (NER): This task involves identifying specific componets (named entities) within texts that contain particular pieces of information.
Texts generation.
For more details, check the NLP page.
Model deployment#
There are a few methods for machine learning model deployment. The following table lists and describes them:
Deployment Method |
Throughput |
Latency |
Example Application |
---|---|---|---|
Batch |
High |
High (hours to days) |
Periodic customer churn prediction |
Streaming |
Moderate |
Moderate (seconds to minutes) |
Dynamic pricing application |
Real-time |
Low |
Low (milliseconds) |
Recommendation systems & Chatbot |
Edge/Embedded |
Low |
Low (Dependent on device processing power) |
IoT Applications. Farm sensor for detecting humidity |