Some ideas on refactoring the polynomial regression sub-package

We are currently working on how to refactor various development related to polynomial regression in Minterpy. These include, but may not be limited to:

Ordinary (least-square) regression
Windowed regression (domain decomposition)
Sparse regression (LASSO, etc.)

Perhaps we could create an abstract class that encompasses all these regression approaches. This abstract class depends strongly on how the eventual concrete instance should behave (what makes a polynomial regression a polynomial regression? What should it be able to do? etc.) My take on such an abstract class is very basic, taking from my own typical use of a polynomial regression model.

class PolynomialRegression(abc.ABC):
    """The abstract base class for all regression models."""
    
    @abc.abstractmethod
    def fit(self, xx, yy, *args, **kwargs):
        """Abstract container for fitting a polynomial regression."""
        pass
    
    @abc.abstractmethod
    def predict(self, xx):
        """Abstract container for making prediction using a polynomial regression."""
        pass
    
    @abc.abstractmethod
    def show(self):
        """Abstract container for printing out the details of a polynomial regression model."""
        pass
        
    def compute_validation_error(self, xx_valid, yy_valid, normalized=True):
        """A common method to evaluate the validation error of a polynomial regression model."""
        normalization = 1.0
        if normalized:
            normalization = np.var(yy_valid)
        
        return np.mean((self(xx_valid) - yy_valid)**2) / normalization
    
    def __call__(self, xx):
        """Evaluation of the polynomial regression model."""
        return self.predict(xx)

This means, all concrete classes of PolynomialRegression should be able to:

be fitted on available data (pair of inputs and response)
predict the response of new data points (and make the instance callable)
show the necessary diagnostics on the terminal

Of course, the details on the concrete classes can be extended specifically for that concrete classes.

For instance, the OrdinaryRegression may implement the abstract class for ordinary least-square regression without all the bells and whistles (I don't show the code here).

Here is a use case example:

def xsinx(xx):
    return np.sin(3.0*xx) * xx 
  
# Generate training dataset
xx_train = -1 + 2 * np.random.rand(50)
yy_train = xsinx(xx_train)

# To create a polynomial model, an index set is required as used as the starting point
mi = mp.MultiIndexSet.from_degree(1,10,1)

# Create an OrdinaryRegression instance
my_regression = OrdinaryRegression(mi)

# Fit the regression model on the available data (least-square estimates).
my_regression.fit(xx_train, yy_train)

# Some attributes are available post-fitting...
# Leave-one-out cross-validation error
print(my_regression.loocv_error)   # 2.1044483152230322e-10
# Regression fit error
print(my_regression.regfit_error)  # 1.8407138019799618e-11

# Generate validation dataset
xx_valid = -1 + 2 * np.random.rand(1000)
yy_valid = xsinx(xx_valid)

# Predict the response on the new data points (directly call the instance)
plt.scatter(yy_valid, my_regression(xx_valid))

# or by using the implemented predict method
plt.scatter(yy_valid, my_regression.predict(xx_valid))

# To compute the validation error
my_regression.compute_validation_error(xx_valid, yy_valid)  # 1.959210481524291e-10

This is a barebone prototype but it hides lots of details from the users, at least for typical use cases. For example, the full underlying fitted polynomial is stored and may be accessed for more advanced usages. However, depending on how a polynomial regression should be used within the Minterpy ecosystem additional methods and properties should be added to the class and exposed to the users.

Additional possible concrete classes would be WindowedRegression and SparseRegression (I think sparse regression is an umbrella term, LASSO is one approach to achieve sparsity; so either make one big class for various supported sparse regressions or make each class separately like LassoRegression).

The overall idea is the same, though; we should strike the right balance between hiding as many details as possible for typical usages and exposing things that may be needed for more advanced usages.

I don't exactly consider Minterpy design philosophy here because I'm not (yet) familiar with it, especially related how a polynomial regression will be used in the long-term Minterpy development roadmap. So your feedback or your own take on the refactoring is welcome!

Edited Feb 17, 2022 by Damar Wicaksono