Severin Gsponer
Learning prediction models for sequential data of all kinds, e.g., DNA sequences, amino acid sequences, music, or recently time series, to perform either classification or regression is a important sub-genre of machine learning which faces very specific and hard problems. As a result, various classical methods of machine learning were adopted to work on such data but the challenges of this setting also gave rise to completely new and specific techniques. All of these methods can be analyzed with regard to their accuracy, efficiency, as well as interpretability. In practice, these properties often stand in contrast with each other and for a given application a suitable trade off has to be found. Finding the correct balance depends heavily on the application constraints but also on the characteristics of the data. Deep Learning techniques for example are able to learn highly complicated relationships between features and often yield good performance, but come with huge computation cost and are normally hard to train. Linear models on the other hand are efficiently trained, but cannot directly capture high level interactions between features such as non-linear relationships and may not produce accurate models in such cases. Nevertheless, linear models have been shown to perform well on very large feature sets and are widely used in industry. Moreover, for linear models we can easily examine which features have the biggest impact on the prediction, helps to explain predictions as well as simplify debugging. This work aims to explore and push the limits of simple machine learning methods for sequence classification and regression with regard to accuracy, scalability and interpretability.