The package FSinR contains functions for performing the feature selection task. More specifically, it contains a large number of filter and wrapper functions widely used in the literature that can be integrated into search methods, although they can also be executed individually. The FSinR package uses the functions for training classification and regression models available in the R caret package to generate wrapper measurements. This gives the package a great background of methods and functionalities. In addition, the package has been implemented in such a way that its use is as easy and intuitive as possible. This is why the calls to all search methods and all filter and wrapper functions follow the same structure.
The way to install the package from the CRAN repository is as follows:
As mentioned above, the package contains numerous filter and wrapper methods that can be executed as evaluation measures within a search algorithm in order to find a subset of features. This subset of features is used to generate models that represent the data set in a better manner. Therefore, the best way to use the filter and wrapper methods is through the search functions (although they can also be run independently). The search functions present in the package are the following:
These search functions are the main functions on which the package works. The structure of all of them is the same and they contain the same following parameters:
It is important to note that the FSinR package does not split the data into training data and test data, but instead applies the feature selection over the entire data set passed to it as a parameter. Then, in a modeling process, the data should be separated by the user prior to the whole process of the training and test sets. Missing data must be processed prior to the use of the package. The following are some examples of using the package along with a brief description of how it works.
To demonstrate in a simple manner how wrapper methods work, the iris dataset will be used in this example. The dataset consists in 150 instances of 4 variables (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) that determine the type of iris plant. The target variable, Species, has 3 possible classes (setosa, versicolor, virginica). For a correct use of the package, the data on which the feature selection is performed must be the training data. But the main objective of this vignette is to illustrate the use of the package, and not the complete modeling process, so in this case the whole dataset will be used without partitioning.
In the package, wrapper methods are passed as an evaluation measure to the search algorithms. The possible wrapper methods to use are the 238 models available in caret. In addition, the caret package offers the possibility of establishing a group of options to personalize the models (eg. resampling techniques, evaluation measurement, grid parameters, …) using the
train functions. In FSinR, the
wrapperGenerator function is used to set all these parameters and use them to generate the wrapper model using as background the methods of caret. The
wrapperGenerator function has as parameters:
trainfunction (x, y, method and trainControl not neccesary)
In the example we use a knn model, since the iris problem is a classification problem. The FSinR package is able to detect automatically depending on the metric whether the objective of the problem is to maximize or minimize. To tune the model, the resampling method is established as a 10-fold crossvalidation, the dataset is centered and scaled, the accuracy is used as a metric, and a grid of the
k parameter is performed.
For more details, the way in which caret train and tune a model can be seen here. The link contain tutorials on how to use the caret functions, and also show the parameters that accept the functions and the possible values they can take. A list of available models in caret can be found here.
The wrapper model is obtained as a result of the call to the
wrapperGenerator function, and is passed as a parameter to the search function. The search algorithms used in this example are
sfs (Sequential Forward Selection) and
ts (Tabu Search). As mentioned earlier, the search methods have 3 parameters that are always present, which are the dataset, the class name, and the wrapper or filter method. And in addition, each algorithm has its own parameters for optimal modeling. Examples of calls to default search functions are as follows:
But it is also possible o set some specific parameters of the search method. In this case, the number of algorithm iterations, the size of the taboo list, as well as an intensification phase and a diversification phase, both of 5 iterations each, are established. Although in this example has not been taken into account, it is important to note that the number of neighbors that are considered and evaluated in each iteration of the algorithm,
numNeigh, is set by default to as many as there are, then it should be noted that a high value of this parameter considerably increases the calculation time. Most FSinR package search algorithms include a parameter called verbose, which if set to TRUE shows the development and information of the iterations of the algorithms per console.
The search algorithm call returns a list of the most important results and the most important details of the process.
In the example, the output only shows the best subset of features chosen in the feature selection process and the accuracy measurement obtained. In this particular case, the taboo search result also returns the status of the taboo list in each iteration, as well as the subset of features chosen in each iteration. Although in this example is not shown to be extensive.
The wrapper method generated above can also be used directly, without being inside a search algorithm. To do this, the data set, the name of the variable to be predicted, and a vector with the names of the features to be taken into account are passed to the method as parameters.
This returns as a result the evaluation measure of the wrapper method on the data set with all variables passed as parameter. As it can be seen, this value is lower, and therefore worse, than the one obtained with the feature selection process.
For the example of the filter method, the iris dataset is used again as in the previous example. Again, it is important to note that the purpose of this example is to show how the package works, rather than a complete modeling process, so variable selection methods are applied to the entire dataset without partitioning. Otherwise the dataset should be partitioned into training and test data, and the feature selection should be applied over the training data.
The following filter methods are implemented in the package:
These methods can be passed as parameters to the search algorithms. But unlike wrapper methods, filter methods are directly implemented in the package for use, and there is no need to generate a function prior to use in the search methods as was the case with
wrapperGenerator in the wrapper example.
Therefore, the search method is executed directly with the name of the filter function as parameter. In this case the search algorithm used in the example is again Sequential Forward Selection,
sfs. This algorithm is simple and therefore will not set additional parameters to the required. And on the other hand, the chosen filter measure is the Gini Index.
This search algorithm returns a list with the best subset of features it has found, and the value of the measurement obtained.
As with wrapper measures, filter measures can be used directly without the need to include them in a search algorithm. To do this, the data set, the name of the variable to be predicted, and a vector with the names of the features to be taken into account are passed to the method as parameters.
Filter methods can also be applied to regression problems. For this as in the previous wrapper regression example the dataset mtcars is used. As a search method the same algorithm is used as in the previous example,
sfs, and as a filter method also the same method