1 Introduction
In many real tasks, data are accumulated over time, and thus, learning with streaming data has attracted much attention during the past few years. Many effective approaches have been developed, such as hoeffding tree Domingos and Hulten (2000), Bayes tree Seidl et al. (2009)
, evolving granular neural network (eGNN)
Leite et al. (2009), Core Vector Machine (CVM)
Tsang et al. (2007), etc. Though these approaches are effective for certain scenarios, they have a common assumption, i.e., the data stream comes with a fixed stable feature space. In other words, the data samples are always described by the same set of features. Unfortunately, this assumption does not hold in many streaming tasks. For example, for ecosystem protection one can deploy many sensors in a reserve to collect data, where each sensor corresponds to an attribute/feature. Due to its limitedlifespan, after some periods many sensors will wear out, whereas some new sensors can be spread. Thus, features corresponding to the old sensors vanish while features corresponding to the new sensors appear, and the learning algorithm needs to work well under such evolving environment. Note that the ability of adapting to environmental change is one of the fundamental requirements for learnware Zhou (2016), where an important aspect is the ability of handling evolvable features.A straightforward approach is to rely on the new features and learn a new model to use. However, this solution suffers from some deficiencies. First, when new features just emerge, there are few data samples described by these features, and thus, the training samples might be insufficient to train a strong model. Second, the old model of vanished features is ignored, which is a big waste of our data collection effort. To address these limitations, in this paper we propose a novel learning paradigm: Feature Evolvable Streaming Learning (FESL). We formulate the problem based on a key observation: in general features do not change in an arbitrary way; instead, there are some overlapping periods in which both old and new features are available. Back to the ecosystem protection example, since the lifespan of sensors is known to us, e.g., how long their battery will run out is a prior knowledge, we usually spread a set of new sensors before the old ones wear out. Thus, the data stream arrives in a way as shown in Figure 1, where in period , the original set of features are valid and at the end of , period appears, where the original set of features are still accessible, but some new features are included; then in , the original set of features vanish, only the new features are valid but at the end of , period appears where newer features come. This process will repeat again and again. Note that the and periods are usually long, whereas the and periods are short because, as in the ecosystem protection example, the and periods are just used to switch the sensors and we do not want to waste a lot of lifetime of sensors for such overlapping periods.
In this paper, we propose to solve the FESL problem by utilizing the overlapping period to discover the relationship between the old and new features, and exploiting the old model even when only the new features are available. Specifically, we try to learn a mapping from new features to old features through the samples in the overlapping period. In this way, we are able to reconstruct old features from new ones and thus the old model can still be applied. To benefit from additional features, we develop two ensemble methods, one is in a combination manner and the other in a dynamic selection manner. In the first method, we combine the predictions from two models and theoretically show that with the assistance of old features, the performance on new features can be improved. In the second approach, we dynamically select the best single prediction and establish a better performance guarantee when the best model switches at an arbitrary time. Experiments on synthetic and real datasets validate the effectiveness of our proposal.
The rest of this paper is organized as follows. Section 2 introduces related work. Section 3 presents the formulation of FESL. Our proposed approaches with corresponding analyses are presented in section 4. Section 5 reports experimental results. Finally, Section 6 concludes.
2 Related Work
Data stream mining contains several tasks, including classification, clustering, frequency counting, and time series analysis. Our work is most related to the classification task and we can also solve the regression problem. Existing techniques for data stream classification can be divided into two categories, one only considers a single classifier and the other considers ensemble classifiers. For the former, several methods origin from approaches such as decision tree
Domingos and Hulten (2000), Bayesian classification Seidl et al. (2009), neural networks Leite et al. (2009)Tsang et al. (2007), and nearest neighbour Aggarwal et al. (2006). For the latter, various ensemble methods have been proposed including Online Bagging & Boosting Oza (2005), Weighted Ensemble Classifiers Wang et al. (2003); Nguyen et al. (2012), Adapted OnevsAll Decision Trees (OVA) Hashemi et al. (2009) and Metaknowledge Ensemble Zhang et al. (2011). For more details, please refer to Gaber et al. (2005); Gama and Rodrigues (2009); Aggarwal (2010); de Andrade Silva et al. (2013); Nguyen et al. (2015). These traditional streaming data algorithms often assume that the data samples are described by the same set of features, while in many real streaming tasks feature often changes. We want to emphasize that though conceptdrift happens in streaming data where the underlying data distribution changes over time Aggarwal (2010); Gama and Rodrigues (2009); Bifet et al. (2010), the number of features in conceptdrift never changes which is different from our problem. Most studies correlated to features changing are focusing on feature selection and extraction
Samina et al. (2014); Zhou et al. (2012) and to the best of our knowledge, none of them consider the evolving of feature set during the learning process.Data stream mining is a hot research direction in the area of data mining while online learning Zinkevich (2003); Hoi et al. (2014)
is a related topic from the area of machine learning. Yet online learning can also tackle the streaming data problem since it assumes that the data come in a streaming way. Online learning has been extensively studied under different settings, such as learning with experts
CesaBianchi and Lugosi (2006) and online convex optimization Hazan et al. (2007); ShalevShwartz (2012). There are strong theoretical guarantees for online learning, and it usually uses regret or the number of mistakes to measure the performance of the learning procedure. However, most of existing online learning algorithms are limited to the case that the feature set is fixed. Other related topics involving multiple feature sets include multiview learning Li et al. (2014); Muslea et al. (2002); Xu et al. (2013)Pan and Yang (2010); Raina et al. (2007) and incremental attribute learning Guan and Li (2001). Although both our approaches and multiview learning exploit the relation between different sets of features, there exists a fundamental difference: multiview learning assumes that every sample is described by multiple feature sets simultaneously, whereas in FESL only few samples in the feature switching period have two sets of features, and no matter how many periods there are, the switching part involves only two sets of features. Transfer learning usually assumes that data are in batch mode, few of them consider the streaming cases where data arrives sequentially and cannot be stored completely. One exception is online transfer learning Zhao et al. (2014) in which data from both sets of features arrive sequentially. However, they assume that all the feature spaces must appear simultaneously during the whole learning process while such an assumption is not available in FESL. When it comes to incremental attribute learning, old sets of features do not vanish or do not vanish entirely while in FESL, old ones will vanish thoroughly when new sets of features come.The most related work is Hou and Zhou (2016), which also handles evolving features in streaming data. Different to our setting where there are overlapping periods, Hou and Zhou (2016) handles situations where there is no overlapping period but there are overlapping features. Thus, the technical challenges and solutions are different.
3 Preliminaries
We focus on both classification and regression tasks. On each round of the learning process, the algorithm observes an instance and gives its prediction. After the prediction has been made, the true label is revealed and the algorithm suffers a which reflects the discrepancy between the prediction and the groundtruth. We define “feature space" in our paper by a set of features. That the feature space changes means both the underlying distribution of the feature set and the number of features change. Consider the process with three periods where in the first period large amount of data stream come from the old feature space; then in the second period named as overlapping period, few of data come from both the old and the new feature space; soon afterwards in the third period, data stream only come from the new feature space. We call this whole process a cycle. As can be seen from Figure 1, each cycle merely includes two feature spaces. Thus, we only need to focus on one cycle and it is easy to extend to the case with multiple cycles. Besides, we assume that the old features in one cycle will vanish simultaneously by considering the example that in ecosystem protection, all the sensors share the same expected lifespan and thus they will wear out at the same time. We will study the case where old features do not vanish simultaneously in the future work.
Based on the above discussion, we only consider two feature spaces denoted by and , respectively. Suppose that in the overlapping period, there are rounds of instances both from and . As can be seen from Figure 2, the process can be concluded as follows.

For , in each round, the learner observes a vector sampled from where is the number of features of , is the number of total rounds in .

For , in each round, the learner observes two vectors and from and , respectively where is the number of features of .

For , in each round, the learner observes a vector sampled from where is the number of rounds in . Note that is small, so we can omit the streaming data from on rounds since they have minor effect on training the model in .
We use to denote the norm of a vector . The inner product is denoted by . Let and be two sets of linear models that we are interested in. We define the projection . We restrict our prediction function in th feature space and th round to be linear which takes the form where
. The loss function
is convex in its first argument and in implementing algorithms, we use logistic loss for classification task, namely and square loss for regression task, namely .4 Our Proposed Approach
In this section, we first introduce the basic idea of the solution to FESL, then two different kinds of approaches with the corresponding analyses are proposed.
The major limitation of the baseline algorithm mentioned above is that the model learned on rounds is ignored on rounds . The reason is that from rounds , we cannot observe data from feature space , and thus the model , which operates in , cannot be used directly. To address this challenge, we assume there is a certain relationship between the two feature spaces, and we try to discover it in the overlapping period. There are several methods to learn a relationship between two sets of features including multivariate regression Kibria (2007), streaming multilabel learning Read et al. (2011), etc. In our setting, since the overlapping period is very short, it is unrealistic to learn a complex relationship between the two spaces. Instead, we use a linear mapping to approximate . Assume the coefficient matrix of the linear mapping is , then during rounds
, the estimation of
can be based on least squaresThe optimal solution to the above problem is given by
Then if we only observe an instance from , we can recover an instance in by , to which can be applied. Based on this idea, we will make two changes to the baseline algorithm:

During rounds , we will learn a relationship from , , .

From rounds , we will keep on updating using the recovered data and predict the target by utilizing the predictions of and .
In round , the learner can calculate two base predictions based on models and : and By utilizing the two base predictions in each round, we propose two methods, both of which are able to follow the better base prediction empirically and theoretically. The process to obtain the relationship mapping and during rounds are concluded in Algorithm 1.
4.1 Weighted Combination
We first propose an ensemble method by combining predictions with weights based on exponential of the cumulative loss CesaBianchi and Lugosi (2006). The prediction at time is the weighted average of all the base predictions:
(2) 
where is the weight of the th base prediction. With the previous loss of each base model, we can update the weights of the two base models as follows:
(3) 
where is a tuned parameter. The updating rule of the weights shows that if the loss of one of the models on previous round is large, then its weight will decrease in an exponential rate in next round, which is reasonable and can derive a good theoretical result shown in Theorem 1. Algorithm 2 summarizes our first approach for FESL named as FESLc(ombination). We first learn a model using online gradient descent on rounds , during which, we also learn a relationship for . For , we learn a model on each round and keep updating on the recovered data showed in (4) where is a varied step size:
(4) 
Then we combine the predictions of the two models by weights calculated in (3).
Analysis
In this paragraph, we borrow the regret from online learning to measure the performance of FESLc. Specifically, we give a loss bound as follows which shows that the performance will be improved with assistance of the old feature space. For the sake of soundness, we put the proof of our theorems in the supplementary file. We define that and are two cumulative losses suffered by base models on rounds ,
(5) 
and is the cumulative loss suffered by our methods: Then we have:
Theorem 1.
Assume that the loss function is convex in its first argument and that it takes value in [0,1]. For all and for all with , with parameter satisfies
(6) 
This theorem implies that the cumulative loss of Algorithm 2 over rounds is comparable to the minimum of and . Furthermore, we define . If , it is easy to verify that is smaller than . In summary, on rounds , when is better than to certain degree, the model with assistance from is better than that without assistance.
4.2 Dynamic Selection
The combination approach mentioned in the above subsection combines several base models to improve the overall performance. Generally, combination of several classifiers performs better than selecting only one single classifier Zhou (2012). However, it requires that the performance of base models should not be too bad, for example, in Adaboost the accuracy of the base classifiers should be no less than 0.5 Freund and Schapire (1997). Nevertheless, in our FESL problem, on rounds , cannot satisfy the requirement in the beginning due to insufficient training data and may become worse when more and more data come causing a cumulation of recovered error. Thus, it may not be appropriate to combine the two models all the time, whereas dynamically selecting the best single may be a better choice. Hence we propose a method based on a new strategy, i.e., dynamic selection, similar to the Dynamic Classifier Selection Zhou (2012)
which only uses the best single model rather than combining both of them in each round. Note that, though we only select one of the models, we retain and utilize both of them to update their weights. So it is still an ensemble method. The basic idea of dynamic selection is to select the model of larger weight with higher probability. Algorithm
3 summarizes our second approach for FESL named as FESLs(election). Specifically, the steps in Algorithm 3 on rounds is the same as that in Algorithm 2. For , we still update weights of each model. However, when doing prediction, we do not combine all the models’ prediction, we adopt the result of the “best" model’s according to the distribution of their weights(7) 
To track the best model, we have a different way of updating weights which is given as follows CesaBianchi and Lugosi (2006).
(8) 
where we define , , and is the binary entropy function defined for .
Analysis
From rounds , the first model would become worse due to the cumulative recovered error while the second model will become better by the large amount of coming data. Since is initialized by which is learnt from the old feature space and is initialized randomly, it is reasonable to assume that is better than in the beginning, but inferior to after sufficient large number of rounds. Let be the round after which is worse than . We define we can verify that
(9) 
Then a more ambitious goal is to compare the proposed algorithm against from rounds to , and against the from rounds to , which motivates us to study the following performance measure Because the exact value of is generally unknown, we need to bound the worstcase An upper bound of is given as follows.
Theorem 2.
For all , if the model is run with parameter and , then
(10) 
where is the binary entropy function.
5 Experiments
In this section, we first introduce the datasets we use. We want to emphasize that we collected one real dataset by ourselves since our setting of feature evolving is relatively novel so that the required datasets are not widely available yet. Then we introduce the compared methods and settings. Finally experiment results are given.
Dataset  n  Dataset  Dataset  
Australian  690  42  29  r.ENFR  18,758  21,531  24,892  r.GRIT  29,953  34,279  15,505 
Credita  653  15  10  r.ENGR  18,758  21,531  34,215  r.GRSP  29,953  34,279  11,547 
Creditg  1,000  20  14  r.ENIT  18,758  21,531  15,506  r.ITEN  24,039  15,506  21,517 
Diabetes  768  8  5  r.ENSP  18,758  21,531  11,547  r.ITFR  24,039  15,506  24,892 
DNA  940  180  125  r.FREN  26,648  24,893  21,531  r.ITGR  24,039  15,506  34,278 
German  1,000  59  41  r.FRGR  26,648  24,893  34,287  r.ITSP  24,039  15,506  11,547 
Krvskp  3,196  36  25  r.FRIT  26,648  24,893  15,503  r.SPEN  12,342  11,547  21,530 
Splice  3,175  60  42  r.FRSP  26,648  24,893  11,547  r.SPFR  12,342  11,547  24,892 
Svmguide3  1,284  22  15  r.GREN  29,953  34,279  21,531  r.SPGR  12,342  11,547  34,262 
RFID  940  78  72  r.GRFR  29,953  34,279  24,892  r.SPIT  12,342  11,547  15,500 
5.1 Datasets
We conduct our experiments on 30 datasets consisting of 9 synthetic datasets, 20 Reuter datasets and 1 real dataset. To generate synthetic data, we randomly choose some datasets from different domains including economy and biology, etc^{1}^{1}1Datasets can be found in http://archive.ics.uci.edu/ml/. whose scales vary from 690 to 3,196. They only have one feature space at first. We artificially map the original datasets into another feature space by random Gaussian matrices, then we have data both from feature space and . Since the original data are in batch mode, we manually make them come sequentially. In this way, synthetic data are completely generated. We also conduct our experiments on 20 datasets from Reuter Amini et al. (2009). They are multiview datasets which have large scale varying from 12,342 to 29,963. Each dataset has two views which represent two different kinds of languages, respectively. We regard the two views as the two feature spaces. Now they do have two feature spaces but the original data are in batch mode, so we will artificially make them come in streaming way.
We use the RFID technique to collect the real data which contain 450 instances from and respectively. RFID technique is widely used to do moving goods detection Wang et al. (2016). In our case, we want to utilize the RFID technique to predict the location’s coordinate of the moving goods attached by RFID tags. Concretely, we arranged several RFID aerials around the indoor area. In each round, each RFID aerial received the tag signals, then the goods with tag moved, at the same time, we recorded the goods’ coordinate. Before the aerials expired, we arranged new aerials beside the old ones to avoid the situation without aerials. So in this overlapping period, we have data from both old and new feature spaces. After the old aerials expired, we continue to use the new ones to receive signals. Then we only have data from feature space . So the RFID data we collect totally satisfy our assumptions. The details of all the datasets we use are presented in Table 1.
5.2 Compared Approaches and Settings
We compare our FESLc and FESLs with three approaches. One is mentioned in Section 3, where once the feature space changed, the online gradient descent algorithm will be invoked from scratch, named as NOGD (Naive Online Gradient Descent). The other two approaches utilize the model learned from feature space by online gradient descent to do predictions on the recovered data. The difference between them is that one keeps updating with the recovered data while the other does not. The one which keeps updating is called Updating Recovered Online Gradient Descent (ROGDu) and the other which keeps fixed is called Fixed Recovered Online Gradient Descent (ROGDf). We evaluate the empirical performances of the proposed approaches on classification and regression tasks on rounds . To verify that our analysis is reasonable, we present the trend of average cumulative loss. Concretely, at each time , the loss of every method is the average of the cumulative loss over , namely We also present the classification performance over all instances on rounds on synthetic and Reuter data. The performances of all approaches are obtained by average results over 10 independent runs on synthetic data. Due to the large scale of Reuter data, we only conduct 3 independent runs on Reuter data and report the average results.
The parameters we need to set are the number of instances in overlapping period, i.e., , the number of instances in and , i.e., and and the step size, i.e., where is time. For all baseline methods and our methods, the parameters are the same. In our experiments, we set 5 or 10 for synthetic data, 50 for Reuter data and 40 for RFID data. We set almost and to be half of the number of instances, and to be where is searched in the range . The detailed setting of in for each dataset is presented in supplementary file.
5.3 Results
Here we only present part of the loss trend results, and other results are presented in the supplementary file. Figure 3 gives the trend of average cumulative loss. (ad) are the results on synthetic data, (eh) are the results on Reuter data, (i) is the result of the real data. The smaller the average cumulative loss, the better. From the experimental results, we have the following observations. First, all the curves with circle marks representing NOGD decrease rapidly which conforms to the fact that NOGD on rounds becomes better and better with more and more data coming. Besides, the curves with star marks representing ROGDu also decline but not very apparent since on rounds , ROGDu already learned well and tend to converge, so updating with more recovered data could not bring too much benefits. Moreover, the curves with plus marks representing ROGDf does not drop down but even go up instead, which is also reasonable because it is fixed and if there are some recovering errors, it will perform worse. Lastly, our methods are based on NOGD and ROGDu, so their average cumulative losses also decrease. As can be seen from Figure 3, the average cumulative losses of our methods are comparable to the best of baseline methods on all datasets and are smaller than them on 8 datasets. And FESLs exhibits slightly smaller average cumulative loss than FESLc. You may notice that NOGD is always worse than ROGDu on synthetic data and real data while on Reuter data NOGD becomes better than ROGDu after a few rounds. This is because on synthetic data and real data, we do not have enough rounds to let all methods converge while on Reuter data, large amounts of instances ensure the convergence of every method. So when all the methods converge, we can see that NOGD is better than other baseline methods since it always receives the real instances while ROGDu and ROGDf receive the recovered instances which may contain recovered error. As can be seen from (eh), in the first few rounds, our methods are comparable to ROGDu. When NOGD is better than ROGDu, our methods are comparable to NOGD which shows that our methods are comparable to the best one all the time. Moreover, FESLs performs worse than FESLc in the beginning while afterwards, it becomes slightly better than FESLc.
Table 2 shows the accuracy results on synthetic datasets and Reuter datasets. We can see that for synthetic datasets, FESLs outperforms other methods on 8 datasets, FESLc gets the best on 5 datasets and ROGDu also gets 5. NOGD performs worst since it starts from scratch. ROGDu is better than NOGD and ROGDf because ROGDu exploits the old better trained model from old feature space and keep updating with recovered instances. Our two methods are based on NOGD and ROGDu. We can see that our methods can follow the best baseline method or even outperform it. For Reuter datasets, we can see that FESLc outperforms other methods on 17 datasets, FESLs gets the best on 9 datasets and NOGD gets 8 while ROGDu gets 1. In Reuter datasets, the period on new feature space is longer than that in synthetic datasets so that NOGD can update itself to a good model. Whereas ROGDu updates itself with recovered data, so the model will become worse when recovered error accumulates. ROGDf does not update itself, thus it performs worst. Our two methods can take the advantage of NOGD and ROGDf and perform better than them.
Dataset  NOGD  ROGDu  ROGDf  FESLc  FESLs 
australian  .767.009  .849.009  .809.025  .849.009  .849.009 
credita  .811.006  .826.018  .785.051  .827.014  .831.009 
creditg  .659.010  .733.006  .716.011  .733.006  .733.006 
diabetes  .650.002  .652.009  .651.006  .652.007  .652.009 
dna  .610.013  .691.023  .608.064  .691.023  .692.021 
german  .684.006  .700.002  .700.002  .700.001  .703.004 
krvskp  .612.005  .621.036  .538.024  .626.028  .630.016 
splice  .568.005  .612.022  .567.057  .612.022  .612.022 
svmguide3  .680.010  .779.010  .748.012  .779.010  .778.010 
r.ENFR  .902.004  .849.003  .769.069  .903.003  .902.005 
r.ENGR  .867.005  .836.007  .802.036  .870.002  .870.003 
r.ENIT  .858.014  .847.014  .831.018  .861.010  .863.013 
r.ENSP  .900.002  .848.002  .825.001  .901.001  .899.002 
r.FREN  .858.007  .776.009  .754.012  .858.007  .858.007 
r.FRGR  .869.004  .774.019  .753.021  .870.004  .868.003 
r.FRIT  .874.005  .780.022  .744.040  .874.005  .873.005 
r.FRSP  .872.001  .778.022  .735.013  .872.001  .871.002 
r.GREN  .907.000  .850.007  .801.035  .907.001  .906.000 
r.GRFR  .898.001  .827.009  .802.023  .898.001  .898.000 
r.GRIT  .847.011  .851.017  .816.006  .850.018  .851.017 
r.GRSP  .902.001  .845.003  .797.012  .902.001  .902.001 
r.ITEN  .854.003  .760.006  .730.024  .856.002  .854.003 
r.ITFR  .863.002  .753.012  .730.020  .864.002  .862.003 
r.ITGR  .849.004  .736.022  .702.012  .849.004  .846.004 
r.ITSP  .839.006  .753.014  .726.005  .839.007  .839.006 
r.SPEN  .926.002  .860.005  .814.021  .926.002  .924.001 
r.SPFR  .876.005  .873.017  .833.042  .876.014  .878.012 
r.SPGR  .871.013  .827.025  .810.026  .873.013  .873.013 
r.SPIT  .928.002  .861.005  .826.005  .928.003  .927.002 
Accuracy with its variance on synthetic datasets and Reuter datasets. The larger the better. The best ones among all the methods are bold.
6 Conclusion
In this paper, we focus on a new setting: feature evolvable streaming learning. Our key observation is that in learning with streaming data, old features could vanish and new ones could occur. To make the problem tractable, we assume there is an overlapping period that contains samples from both feature spaces. Then, we learn a mapping from new features to old features, and in this way both the new and old models can be used for prediction. In our first approach FESLc, we ensemble two predictions by learning weights adaptively. Theoretical results show that the assistance of the old feature space can improve the performance of learning with streaming data. Furthermore, we propose FESLs to dynamically select the best model with better performance guarantee.
Acknowledgement
This research was supported by NSFC (61333014, 61603177), JiangsuSF (BK20160658), Huawei Fund (YBN2017030027) and Collaborative Innovation Center of Novel Software Technology and Industrialization.
References
 Aggarwal et al. (2006) C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for ondemand classification of evolving data streams. IEEE Transactions on Knowledge and Data Engineering, 18:577–589, 2006.
 Aggarwal (2010) C. C. Aggarwal. Data streams: An overview and scientific applications. In Scientific Data Mining and Knowledge Discovery  Principles and Foundations, pages 377–397. Springer, 2010.
 Amini et al. (2009) M.R. Amini, N. Usunier, and C. Goutte. Learning from multiple partially observed views  an application to multilingual text categorization. In Advances in Neural Information Processing Systems 22, pages 28–36, 2009.
 Bifet et al. (2010) A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA: Massive online analysis. Journal of Machine Learning Research, 11:1601–1604, 2010.
 CesaBianchi and Lugosi (2006) N. CesaBianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
 de Andrade Silva et al. (2013) J. de Andrade Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. C. P. L. F. de Carvalho, and J. Gama. Data stream clustering: A survey. ACM Computing Surveys, 46:13:1–13:31, 2013.
 Domingos and Hulten (2000) P. M. Domingos and G. Hulten. Mining highspeed data streams. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 71–80, 2000.
 Freund and Schapire (1997) Y. Freund and R. E. Schapire. A decisiontheoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55:119–139, 1997.
 Gaber et al. (2005) M. M. Gaber, A. B. Zaslavsky, and S. Krishnaswamy. Mining data streams: A review. SIGMOD Record, 34:18–26, 2005.
 Gama and Rodrigues (2009) J. Gama and P. P. Rodrigues. An overview on mining data streams. In Foundations of Computational Intelligence, pages 29–45. Springer, 2009.
 Guan and Li (2001) S. U. Guan and S. Li. Incremental learning with respect to new incoming input attributes. Neural Processing Letters, 14:241–260, 2001.
 Hashemi et al. (2009) S. Hashemi, Y. Yang, Z. Mirzamomen, and M. R. Kangavari. Adapted oneversusall decision trees for data stream classification. IEEE Transactions on Knowledge and Data Engineering, 21:624–637, 2009.
 Hazan et al. (2007) E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Maching Learning, 69:169–192, 2007.

Hoeffding (1963)
W. Hoeffding.
Probability inequalities for sums of bounded random variables.
Journal of the American Statistical Association, 58:13–30, 1963.  Hoi et al. (2014) S. Hoi, J. Wang, and P. Zhao. LIBOL: A library for online learning algorithms. Journal of Machine Learning Research, 15:495–499, 2014.
 Hou and Zhou (2016) C. Hou and Z.H. Zhou. Onepass learning with incremental and decremental features. ArXiv eprints, arXiv:1605.09082, 2016.
 Kibria (2007) B. M. Golam Kibria. Bayesian statistics and marketing. Technometrics, 49:230, 2007.
 Leite et al. (2009) D. Leite, P. Costa Jr., and F. Gomide. Evolving granular classification neural networks. In Proceedings of International Joint Conference on Neural Networks 2009, pages 1736–1743, 2009.

Li et al. (2014)
S.Y. Li, Y. Jiang, and Z.H. Zhou.
Partial multiview clustering.
In
Proceedings of the 28th AAAI Conference on Artificial Intelligence
, pages 1968–1974, 2014. 
Muslea et al. (2002)
I. Muslea, S. Minton, and C. Knoblock.
Active + semisupervised learning = robust multiview learning.
In Proceedings of the 19th International Conference on Machine Learning, pages 435–442, 2002.  Nguyen et al. (2012) H.L. Nguyen, Y.K. Woon, W. K. Ng, and L. Wan. Heterogeneous ensemble for feature drifts in data streams. In Proceedings of the 16th PacificAsia Conference on Knowledge Discovery and Data Mining, pages 1–12, 2012.
 Nguyen et al. (2015) H.L. Nguyen, Y.K. Woon, and W. K. Ng. A survey on data stream clustering and classification. Knowledge and Information Systems, 45:535–569, 2015.
 Oza (2005) N. C. Oza. Online bagging and boosting. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics 2005, pages 2340–2345, 2005.
 Pan and Yang (2010) S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22:1345–1359, 2010.
 Raina et al. (2007) R. Raina, A. Battle, H. Lee, B. Packer, and A. Ng. Selftaught learning: Transfer learning from unlabeled data. In Proceedings of the 24th International Conference on Machine Learning, pages 759–766, 2007.
 Read et al. (2011) J. Read, A. Bifet, G. Holmes, and B. Pfahringer. Streaming multilabel classification. In Proceedings of the 2nd Workshop on Applications of Pattern Analysis, pages 19–25, 2011.

Samina et al. (2014)
K. Samina, K. Tehmina, and N. Shamila.
A survey of feature selection and feature extraction techniques in machine learning.
In Proceedings of Science and Information Conference 2014, pages 372–378, 2014.  Seidl et al. (2009) T. Seidl, I. Assent, P. Kranen, R. Krieger, and J. Herrmann. Indexing density models for incremental learning and anytime classification on data streams. In Proceedings of the 12th International Conference on Extending Database Technology, pages 311–322, 2009.
 ShalevShwartz (2012) S. ShalevShwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4:107–194, 2012.
 Tsang et al. (2007) I. W. Tsang, A. Kocsor, and J. T. Kwok. Simpler core vector machines with enclosing balls. In Proceedings of the 24th International Conference on Machine Learning, pages 911–918, 2007.
 Wang et al. (2003) H. Wang, W. Fan, P. S. Yu, and J. Han. Mining conceptdrifting data streams using ensemble classifiers. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 226–235, 2003.
 Wang et al. (2016) C. Wang, L. Xie, W. Wang, T. Xue, and S. Lu. Moving tag detection via physical layer analysis for largescale RFID systems. In Proceedings of the 35th Annual IEEE International Conference on Computer Communications, pages 1–9, 2016.
 Xu et al. (2013) C. Xu, D. Tao, and C. Xu. A survey on multiview learning. ArXiv eprints, arXiv:1304.5634, 2013.
 Zhang et al. (2011) P. Zhang, J. Li, P. Wang, B. J. Gao, X. Zhu, and L. Guo. Enabling fast prediction for ensemble models on data streams. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 177–185, 2011.
 Zhao et al. (2014) P. Zhao, S. Hoi, J. Wang, and B. Li. Online transfer learning. Artificial Intelligence, 216:76–102, 2014.

Zhou et al. (2012)
G. Zhou, K. Sohn, and H. Lee.
Online incremental feature learning with denoising autoencoders.
In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, pages 1453–1461, 2012.  Zhou (2012) Z.H. Zhou. Ensemble methods: Foundations and algorithms. CRC press, 2012.
 Zhou (2016) Z.H. Zhou. Learnware: On the future of machine learning. Frontiers of Computer Science, 10:589–590, 2016.
 Zinkevich (2003) M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928–936, 2003.
Supplementary Material of “Learning with Feature Evolvable Streams”
In the Appendix, we will prove the two theorems in the section “Our Proposed Approaches", and give some additional experiment results.
Appendix A: Analysis
In this section, we will give the detailed proofs of the two theorems in “Our Proposed Approaches". The two theorems are the special cases of Theorem 2.2 and Corollary 5.1 respectively in CesaBianchi and Lugosi (2006).
A.1 Proof of Theorem 1
To prove Theorem 1, we propose to bound the related quantities where
for , and . is the cumulative loss at time of the th base learner, namely . In the proof we use the following classical inequality due to Hoeffding Hoeffding (1963).
Lemma 1.
Let be a random variable with . Then for any ,
The detailed proof of Lemma 1 can be found in Section A.1 of the Appendix in CesaBianchi and Lugosi (2006).
Proof of Theorem 1.
First observe that
(11) 
On the other hand, for each ,
Now using Lemma 1, we observe that the quantity above may be upper bounded by
where we used the convexity of the loss function in its first argument and the way how the weight updates. Summing over , we get
(12) 
Combining this with the lower bound (11) and solving for , we find that
as desired. In particular, with , the upper bound becomes . ∎
A.2 Proof of Theorem 2
To prove Theorem 2, we first give some definitions. Since we only choose one base learner’s prediction in FESLs as our final prediction in each round, we use to denote the index of the base learners in th round for . We call an action. So the loss in round can be denoted as . Thus, randomly choosing one base learner in each round is a randomized version of FESLc, so we call it randomized FESLc. Denote the distribution according to which the random action is drawn at time by , and is the expected loss of randomized FESLc at time . Then we have the following lemma:
Lemma 2.
Let and . The randomized FESLc with satisfies, with probability at least
Proof.
The random variables , for , form a sequence of bounded martingale differences. With a simple application of the HoeffdingAzuma inequality and combining the results of Theorem 1, we yield the result of this lemma. ∎
In addition, is defined as the sequence of the base learner’s index such that we can study a more ambitious goal where . It is not difficult to modify the randomized FESLc in order to achieve this goal. Specifically, we associate a compound action with each sequence which only switches once. Then we can run our randomized FESLc over the set of compound actions: at any time the randomized FESLc draws a compound action and plays action . Denote by the number of all compound actions. Then, in FESLc, we only have 2 base learners while in randomized FESLc, we have base learners. Then Lemma 2 implies that is bounded by . Hence, it suffices to count the number of compound actions: for each there are ways to pick time steps where a switch occurs, and there are ways to assign a distinct action to each of the resulting blocks. This gives
where is the binary entropy function defined for . Substituting this bound in , we find that satisfies
Comments
There are no comments yet.