# The Fundamental Tool That Data Scientists Can’t Miss

- Posted by Daitan Innovation Team
- On December 20, 2019
- Convex Optimization, Data Science, Deep Learning

## How Business Requirements can Prevent You from Using Available Machine Learning Tools and What to do About That

## Introduction

When we search the internet for the required technical abilities of a successful data scientist, we will find variations of the following list:

- Analytical skills
- Programming
- Linear algebra
- Multi-variate calculus
- Statistics
- Machine learning
- Data visualization
- Software engineering

And so on. But there is a common subject to some items on this list that does not get the same attention, despite being the main engine of several tools used by a data scientist: **Convex Optimization.**

When hearing this term, most people will immediately start talking about how gradient descent is the most awesome thing there is, how we can add momentum to it, how can we choose, adapt, or even circumvent the choice of the step-size parameter, and so on.

However, in reality, convex optimization goes well beyond gradient descent and its variants.

As shown in figure 1, this is a subject that combines several of the above-listed abilities, such as linear algebra, calculus, statistics, and the learning word in machine learning.

Not only that, with the increased volume of data both in industry and academia, the desire for faster optimization methods, and the popularization of parallel computing, both fields hugely influence each other.

Machine learning is posing new optimization problems and recovering forgotten optimization methods, while convex optimization research is allowing statisticians and computer scientists to tackle new and bigger problems.

## Refresher: What Is Optimization?

In short, optimization is a branch of mathematics that focuses on finding a solution that minimizes an objective function, possibly with some constraints on the potential values of our problem variables.

The most general optimization problem has the form:

In this setting, ** x ∈ ℝⁿ** is our variable and we want to find the values for this variable that achieve the minimum value of

**, our objective function (to maximize the function, you can minimize its negative version).**

*f(*x*)*We also may be restricted by some functions on our variables limited by scalar values, called constraints. Sometimes, we can find an analytical formula to solve an optimization problem, but usually, there is no closed-form to solve a problem like this.

And even if we do have an analytical solution, there may be other business constraints that will prevent us from using it in practice.

But here is the trick: depending on some properties of the domain of the variable x, the function ** f(x)**, and the constraints

**, we can use methods that will certainly find the best solution.**

*f*ᵢ(.)If we do not find interesting properties, we may have a problem that will not be solvable in a reasonable amount of time, or we will not be capable of finding the best solution.

By what we have exposed, you can guess the most important property we can find: * convexity*. When the domain of our variable, our objective function, and our constraints are convex, we have a convex problem and a large literature!

Given the plethora of resources on the web on the subject, the goal of this post is not to be a comprehensive guide to convex optimization and machine learning. This would be impossible even for a book with 300 pages (personal guess or limitation).

Instead, we want to show you how a canonical formulation that every data scientist is familiar with (or will be one day) — a regularized linear problem — can become impractical if you are restricted to off-the-shelf libraries.

We will also see how the business constraints can guide us to a practical solution, based on prior knowledge that we may have about the problem. Let’s get started.

## Canonical Formulation: Cost Function + Regularization

The example that we will focus on considers a regression task, but the concepts that we will discuss are much broader. Take as an example the famous linear regression with least squares:

Notice that the formulation does not present any regularization term, but the common practice tells us to regularize. Together with the Ridge, LASSO, and Elastic Net regularization terms, this problem makes a very popular set of regression methods. Here is a more general representation of this problem:

Where **𝔉(x)** is an empirical loss function, **ℜ(x)** is a regularization term, and λ is the regularization parameter.

In the case of regression, a common choice for **𝔉(.)** is the mean squared error, but we can make different choices. We focus here on the MSE loss function.

For **ℜ(.)**, it is common to use the l₂ norm, the** l₁** norm for the LASSO problem, or both for the Elastic Net regularization. Let * A *be our design matrix, of

**m× n**dimensions,

**b ∈ ℝᵐ**our label vector.

We show in the table below the most common configuration for these functions.

Common choices for the loss function and regularization terms.

Of course, these are well-studied problems, with stable libraries in several programming languages, and in practice, you should not be concerned with their implementation details for a business project.

You do not want to spend too much time debugging your implementation; or dealing with edge cases. But let’s walk through some regular use cases in the data science practice and see how this simple problem can get impractical without the proper abilities.

## Use Case 1: Fit and Predict

Suppose that you are facing a small regression problem. All your data fits in memory, having less than 2GB or so, and the inference can be done locally on the same machine. Any UCI and most Kaggle datasets will fall into this use case.

Without tricks to fool you, this is the most common scenario where we can readily use an off-the-shelf implementation (for instance, Sklearn).

If you are not using any regularization term — and are also feeling a little bit adventurous — you can just use your favorite programming language to compute the analytical solution to the least-squares problem:

Here are examples of Python functions that: solve Eq. (1) directly; and use the NumPy library for a stable solution, which is the underlying method operating in the Scikit-Learn library.

## Use Case 2: Data Doesn’t Fit in Memory

The second use case is more interesting.

Suppose now that the training set is too big to fit in memory, therefore, loading the entire dataset is not an option for this problem.

Clearly, we are not able to use the analytical solution, as we will need to load portions of the data sequentially. We will need to use optimization in order to find a solution to this problem.

In this case, several techniques can be handy, such as the gradient descent (if regularizing with the l₂-norm); the fast iterative shrinkage-thresholding algorithm FISTA (Beck, 2009), or the alternating directions method of multipliers (ADMM) (Boyd, 2011).

Methods that can handle non-smooth functions. Instead of computing the exact solution, we run an iterative process that finds a near-optimal solution, given enough iterations.

Knowing which algorithms we can use, the bigger problem now is to manage loading portions of the data and properly freeing memory. Some packages can even do this as default behavior, such as the *biglm* R package, based on (Miller, 1992).

Here is a toy example in Python. We use pandas to lazily load a CSV file, train a least-squared model using Scikit-Learn, and make use of the parameters from the previous iteration as warm-start.

Every time we call the fit method, it uses the previous solution as a starting point of the optimization procedure, thus accelerating convergence in most cases.

The adopted optimization strategy here can be called “mini-batch stochastic gradient descent” since we load the data in batches and optimize the loss function with a stochastic gradient descent method.

## Use Case 3: Features Are Not Completely Available

Let’s get a little bit more serious now. Suppose that you are working on a regression model for a problem with sensitive data. You have access to the labels, but the features are stored individually in distinct locations.

You can have a copy of the training labels on every data site, but due to privacy constraints, you are not allowed to move or copy the features to different locations: you need to process them on site.

See figure 2 for a pictorial representation of the problem.

Summarizing, the requirements of this use case are:

**We need a single model that uses data stored in different locations.****We are not allowed to move data, except for the labels.**

Since we will not be able to access all data at the same time, we can lead the computations closer to the data. Here is a possible overall strategy for this problem:

**Copy the labels to all data sites.****For each step, we update the variables***xᵢ*locally with all remaining variables fixed:

Where the indexing *A_ᵢ *indicates all columns of *A *except from column *i*.

Obs.: the derivation of this solution can be seen in the reference (Gordon, G and Tibshirani, R. Lecture Notes).

**At the end of the step, we synchronize***x*among all sites and repeat the process until convergence is reached.

This is a fairly common approach, known as *coordinate descent*. Instead of computing the derivative of the whole variable *x*, we update one feature at a time (or a block of features when block coordinate descending) and synchronize the results before the next update step.

It is important to mention that coordinate descent is a highly adaptive optimization approach. You can combine it with other approaches, distribute steps in a parallel fashion, update a random set of variables at each iteration, etc. For a broader view, refer to the seminal paper (Wright, 2015).

The function below implements this solution.

## Practical Example

Let’s dive into a practical example with the abalone regression dataset.

Abalone is a marine snail with a number of recognized species ranging from 30 to 130, worldwide. Our goal with this dataset is to predict the age of abalone from physical measurements.

Usually, the age of an Abalone is estimated by a precise cut through the shell, followed by a microscopic counting of rings. Since this is time-consuming, error-prone, and potentially boring to do, we are interested in more effective ways of estimating the age of an abalone.

This dataset presents us with several other measurements that are easy to obtain, together with the age of the subjects. We start by loading, preparing, and splitting the data into two sets: train and test.

To make our job easier, we will define a prediction function that returns the predicted labels and residuals from the predictions:

Let’s compare the solutions found by the three approaches that were defined previously:

- Close form solution.
- Least squares NumPy solution.
- Coordinate descent solution.

Now we are ready to train the models and use the estimated parameters to make predictions on the test set.

Let’s compare the predictions made by these approaches against the original labels visually.

The next figure shows the true labels as blue dots, and shows the labels predicted by the closed-form solution as orange dots.

The next figure presents the predictions of the NumPy least squares method.

Now we see the predictions made by the coordinate descent solution.

Notice how the predictions are close to each other: we did not reinvent the wheel, we just used the wheel in a way that it rolls for our scenario.

The next figure shows the predictions made by our solutions for 10 samples, together with their true labels, highlighting the equivalence of the solutions.

Despite the fact that all use cases can be solved with a linear regression method, depending on other factors, a public library may not be able to solve even a linear regression.

There are also several other sources of restriction on how the model should be running when connected to the product architecture and this can lead us to different optimization approaches.

Knowing a little bit about convex optimization allows the practitioner to handle more diverse situations.

** Obs.:** These may not be the best approaches for a practical scenario, but those are options that emerge once you have a better understanding of convex optimization.

## Conclusion

As we could see through the use cases, convex optimization provides a powerful toolbox for the data scientist.

Even when the model is well-covered, present in several libraries and APIs, there may be business constraints that will prevent us from using off-the-shelf implementations.

Convex optimization will give you options to repose problems in a nicer, more practical way, allowing you to implement a solution designed for your application.

Of course, the presented content is not enough for you to go ahead and start implementing ML algorithms from scratch. But, my hope is to spark your motivation to learn about this subject through these practical possibilities. If you are interested in learning more about the subject, we have prepared a list with some awesome learning resources below.

**Thanks for reading!**

Daitan's Data Scientist, Saullo Haniell Galvão de Oliveira authored this article, which can also be found on ** Medium**, courtesy of

**at this link. Additional thanks goes to Daitan's Innovation team for editorial support and publication.**

*Better Programming*## Recommended Learning Resources

### Video lectures

### Books

- Boyd, S. and Vandenberghe, L. (2004).
*Convex Optimization*. Cambridge University Press. - Bertsekas, D. (1999).
*Nonlinear Programming*. Athena Scientific.

### Seminal papers

- Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288.
- Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202.
- Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122.
- Wright, S. J. (2015). Coordinate descent algorithms. Mathematical Programming, 151(1):3–34.