Machine Learning - Overview - Batch Vs Online Machine Learning Tutorial
1] Batch/Offline ML –
Batch learning represents the training of the models at regular intervals such as weekly, bi-weekly, monthly, quarterly, etc. The data gets accumulated over a period of time. The models then get trained with the accumulated data from time to time at periodic intervals not frequently. Batch learning is also called offline learning. The models trained using batch or offline learning are moved into production only at regular intervals based on the performance of models trained with new data.
For example- A student will get placed or not based on the past year's data, because in this the placement data is not accumulated frequently. It will get accumulated at specific intervals of time like yearly, or 6 months, etc. After a year, you pull down your model from the production phase to the development phase, retrained it, and again send it to production by replacing the old model
Another example is the image classification of animals, the dog will dog after 10 years also.
Disadvantage-
1] Not useful if data get accumulated frequently.
2] Not useful when the data distribution is changing frequently
3] Take a lot of time and resources like CPU, memory space, etc
2] Online ML –
In online learning, the training happens frequently in an incremental manner by continuously feeding data as it arrives or in a small group. Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives. The model gets trained frequently in the production phase only.
For example- Netflix train the model and moved the model from development to production to predict the movie based on preference. But within an hour of deployment, many new movies come to Netflix for which the model is not trained. In such a case again Netflix again needs to bring the model to the development phase, retrained it, and send it to production by replacing the old model. This is the case of batch learning, where this model gets failed.
So, in such case, online learning is used, whenever new data is accumulated, the model gets trained on new datasets frequently in the production phase only.
Algorithm- Stochastic Gradient Descent, is the same as linear regression, but it can follow online learning using the partial_fit function. Whenever new data points accumulated, you can call the partial_fit function. It takes less time than offline learning.
Another Library used for online learning is River and Vowpal Wabbit
Advantage-
1] Useful if data get accumulated frequently e.g. stock exchange
2] Cost-effective as it uses fewer resources.
3] Fast
Disadvantage-
1] Tricky to use
2] Risky (if hacker tries to put biased data online, prevention – anomalies detection, rollback server to the old state