Kalyan Veeramachaneni - IEEE Xplore Author Profile

Showing 1-25 of 27 results

Results

Time series anomaly detection is a vital task in many domains, including patient monitoring in healthcare, forecasting in finance, and predictive maintenance in energy industries. This has led to a proliferation of anomaly detection methods, including deep learning-based methods. Benchmarks are essential for comparing the performances of these models as they emerge, in a fair, rigorous, and reprod...Show More
Explanations of machine learning (ML) model predictions generated by Explainable AI (XAI) techniques such as SHAP are essential for people using ML outputs for decision-making. We explore the potential of Large Language Models (LLMs) to transform these explanations into human-readable, narrative formats that align with natural communication. We address two key research questions: (1) Can LLMs reli...Show More
The flexible nature of large language models allows them to be used for diverse applications. Recent studies have showcased numerous abilities of these models, including performing time series forecasting. In this paper, we present a novel study of large language models used for the challenging task of time series anomaly detection. This problem entails two novel aspects for LLMs specifically: fir...Show More
Anomaly detection on time series data is increasingly common across various industrial domains that monitor metrics in order to prevent potential accidents and economic losses. However, a scarcity of labeled data and ambiguous definitions of anomalies can complicate these efforts. Recent unsupervised machine learning methods have made remarkable progress in tackling this problem using either singl...Show More
Machine learning (ML) is increasingly applied to Electronic Health Records (EHRs) to solve clinical prediction tasks. Although many ML models perform promisingly, issues with model transparency and interpretability limit their adoption in clinical practice. Directly using existing explainable ML techniques in clinical settings can be challenging. Through literature surveys and collaborations with ...Show More
Machine learning (ML) is being applied to a diverse and ever-growing set of domains. In many cases, domain experts - who often have no expertise in ML or data science - are asked to use ML predictions to make high-stakes decisions. Multiple ML usability challenges can appear as result, such as lack of user trust in the model, inability to reconcile human-ML disagreement, and ethical concerns about...Show More
Time series anomalies can offer information relevant to critical situations facing various fields, from finance and aerospace to the IT, security, and medical domains. However, detecting anomalies in time series data is particularly challenging due to the vague definition o f a nomalies and said data's frequent lack of labels and highly complex temporal correlations. Current state-of-the-art unsup...Show More
An estimated 180 papers focusing on deep learning and EHR were published between 2010 and 2018. Despite the common workflow structure appearing in these publications, no trusted and verified software framework exists, forcing researchers to arduously repeat previous work. In this paper, we propose Cardea, an extensible open-source automated machine learning framework encapsulating common predictio...Show More
We present an automated learning system that continuously gathers domain data from open repositories, develops a deep learning model, uses the model to make detections, publishes unreported malicious domains, leverages threat intelligence to label the detected domains, and periodically updates the detection models. The results presented in this paper show that the system not only extends the detec...Show More
Many businesses ("") across industries hire technology service providers ( "providers" ) to develop and maintain software applications. The provider in turn hires a team, distributed across the globe and filling out reports focused on what is happening locally. This results in hundreds of reports covering the provider ’s portfolio of projects, each with dozens of fields. The task of sorting throug...Show More
We present AnonML, a system for privacy-preserving model generation over a network of peers. Our goal is to allow a group of users to combine enough data to generate useful machine learning models without revealing private information. In our setting, each peer has a single row of featurized data according to a shared schema, and an aggregator would like to train a binary classification model on t...Show More
Feature engineering is a critical step in a successful data science pipeline. This step, in which raw variables are transformed into features ready for inclusion in a machine learning model, can be one of the most challenging aspects of a data science effort. We propose a new paradigm for feature engineering in a collaborative framework and instantiate this idea in a platform, FeatureHub. In our a...Show More
In this paper, we describe a system for sequential hyperparameter optimization that scales to work with complex pipelines and large datasets. Currently, the state-of-the-art in hyperparameter optimization improves on randomized and grid search by using sequential Bayesian optimization to explore the space of hyperparameters in a more informed way. These methods, however, are not scalable, as the e...Show More
In this paper, we present Auto-Tuned Models, or ATM, a distributed, collaborative, scalable system for automated machine learning. Users of ATM can simply upload a dataset, choose a subset of modeling methods, and choose to use ATM's hybrid Bayesian and multi-armed bandit optimization system. The distributed system works in a load-balanced fashion to quickly deliver results in the form of ready-to...Show More
Aiming at massive participation and open access education, Massive Open Online Courses (MOOCs) have attracted millions of learners over the past few years. However, the high dropout rate of learners is considered to be one of the most crucial factors that may hinder the development of MOOCs. To tackle this problem, statistical models have been developed to predict dropout behavior based on learner...Show More
In this paper we present a novel Markov Switching generative model for continuous multivariate time series and longitudinal data based on Gaussian copula functions. We assume that the values of the multivariate time series at every time slice are sampled out of a joint probability distribution that is selected by the latent state. The use of Gaussian copula functions give the flexibility of indivi...Show More
In this paper, we introduce "prediction engineering" as a formal step in the predictive modeling process. We define a generalizable 3 part framework - Label, Segment, Featurize (L-S-F) - to address the growing demand for predictive models. The framework provides abstractions for data scientists to customize the process to unique prediction problems. We describe how to apply the L-S-F framework to ...Show More
The goal of this paper is to build a system that automatically creates synthetic data to enable data science endeavors. To achieve this, we present the Synthetic Data Vault (SDV), a system that builds generative models of relational databases. We are able to sample from the model and create synthetic data, hence the name SDV. When implementing the SDV, we also developed an algorithm that computes ...Show More
In this paper, we designed a formal language, called Trane, for describing prediction problems over relational datasets, implemented a system that allows data scientists to specify problems in that language. We show that this language is able to describe several prediction problems and even the ones on KAGGLE-a data science competition website. We express 29 different KAGGLE problems in this langu...Show More
We present AI2, an analyst-in-the-loop security system where Analyst Intuition (AI) is put together with state-of-the-art machine learning to build a complete end-to-end Artificially Intelligent solution (AI). The system presents four key features: a big data behavioral analytics platform, an outlier detection system, a mechanism to obtain feedback from security analysts, and a supervised learning...Show More
In this paper, we present the concept of data science foundry for data from Massive Open Online Courses. In the foundry we present a series of software modules that transform the data into different representations. Ultimately, each online learner is represented using a set of variables that capture his/her online behavior. These variables are captured longitudinally over an interval. Using this r...Show More
In this paper, we develop the Data Science Machine, which is able to derive predictive models from raw data automatically. To achieve this automation, we first propose and develop the Deep Feature Synthesis algorithm for automatically generating features for relational datasets. The algorithm follows relationships in the data to a base field, and then sequentially applies mathematical functions al...Show More
Physiological signals such as blood pressure might contain key information to predict a medical condition, but are challenging to mine. Wavelets possess the ability to unveil location-specific features within signals but there exists no principled method to choose the optimal scales and time shifts. We present a scalable, robust system to find the best wavelet parameters using Gaussian processes (...Show More
We introduce FCUBE, a cloud-based framework that enables machine learning researchers to contribute their learners to its community-shared repository. FCUBE exploits data parallelism in lieu of algorithmic parallelization to allow its users to efficiently tackle large data problems automatically. It passes random subsets of data generated via resampling to multiple learners that it executes simult...Show More
Program autotuning has been shown to achieve better or more portable performance in a number of domains. However, autotuners themselves are rarely portable between projects, for a number of reasons: using a domain-informed search space representation is critical to achieving good results; search spaces can be intractably large and require advanced machine learning techniques; and the landscape of ...Show More