Complete Life Cycle of a Data Science/Machine Learning Project

Complete Life Cycle of a Data Science/Machine Learning Project

Introduction:

A Data Science project has data as its core element, without data no science can be applied and hence nothing can be achieved. With this many questions may arise like –

  • Why do we need the data?
  • What kind of data is required?
  • How to get the data?
  • What to do with the data?

And the list goes on. To define these set of questions there should be some pre-defined path or flow. This flow is termed as Data Science project lifecycle. The entire process involves several steps like data cleaning, preparation, modeling, model evaluation, etc. It is a long process and may take several months to complete. So, it is very important to have a general structure to follow for the every Data Science problem. The globally acknowledged structure in solving any analytical problem is called as Cross Industry Standard Process for Data Mining or CRISP-DM framework.

Image for post

Image for post

Source : quora

Life Cycle:

Below are the Life Cycle of Data Science/Machine Learning project.

1. Business Understanding

Business Understanding plays a very important role in success of any project as the entire life cycle revolves around the business goal. In order to acquire the correct data, we should be able to understand the business. Asking questions about dataset and a proper business objective will help in making the data acquisition process much easy.

2. Data Understanding

After business understanding, the next step is Data understanding. This step involves the collection of all the available data. If you are working on a real time project in your company then, you need to closely work with the business team as they are aware of what data is present, what data could be used for this business problem and other information, or if you are trying to build your own Data Science/ Machine learning Project then you can find free datasets from many websites available.

This step involves describing the data, their structure, their data type and many other information. Explore the data using graphical plots. Basically, extracting any information that you can get about the data by just exploring the data.

3. Data Preparation

After the Data Understanding step, the next step that comes in the life cycle steps is Data Preparation. This step is also known as Data Cleaning or Data Wrangling. It includes steps like selecting the relevant data, integrating the data by merging the data sets, cleaning it, handling the missing values by either removing them or imputing them with relevant data, treating erroneous data by removing them, also check for outliers and handle them. Constructing new data, derive new features from existing ones by using the feature engineering. Format the data into the desired structure, remove unwanted columns and features. Data preparation is the most time consuming as it takes up to 70%-90% of the overall project time, yet it’s the most important step in the entire life cycle.

Exploratory Data Analysis (EDA)plays an important role at this stage as summarization of clean data helps in identifying the structure, outliers, anomalies and patterns present in the data. These insights could help in finding the right set of features, algorithm to be used for model creation and building the model.

4. Data Modeling

Data modeling is considered as the heart of data analysis. A model takes the prepared data from the previous step (Data Preparation) as input and provides the desired output. This step includes choosing the appropriate type of model, whether the problem is a classification problem, or a regression problem or a clustering problem. After choosing the model , amongst the various algorithms present. We need to tune the hyper parameters of each model to achieve the desired performance.

In the end we need to evaluate the model by measuring the accuracy (How well the model performs i.e. does it describe the data accurately) and relevance (Does it answer the original question that is set out to answer). We also need to make sure there is a correct balance between performance and generalizability, which means the model created should not be biased and should be a generalized model.

5. Model Deployment

The model after rigorous evaluation is finally deployed in the desired format and channel. This is the final step in the data science life cycle. Each step in the data science life cycle explained above should be worked upon carefully. If any step is executed improperly, it will consequently affect the next step and the entire effort will go in vain. For example, if data is not collected properly, you’ll lose information and you will not be able to build a perfect model. If data is not cleaned properly, the model will not work properly. If the model is not evaluated properly, it will fail in the real world from giving a perfect output. Right from Business understanding to model deployment, each step should be given proper attention, time and effort.

All the above steps make a complete Data Science project but it is an iterative process and various steps are repeated until we are able to fine tune the methodology for a specific business case. Python and R are the most widely used languages for Data Science.

Thank you for reading! 😊