Projects
1. Deep Learning for Identifying Important Motifs in DNA Sequences
The analysis of gene structure is an important tasks in bioinformatics. The sequential nature of genomic data as well as the common syntactical structure between natural language and genetic code makes it suitable for NLP techniques to be applied to the genomics domain.
We explore the structure of the human genome by training neural network architectures to distinguish between DNA sequences containing different types of genomic elements (intron, exons, splice junctions), as well as to predict the exact positions of splice junctions in DNA sequences. Subsequently, we identify important sequence motifs that code for exon location in the DNA sequences. We use deep learning architectures like Convolutional Neural Networks and Long Short-Term Memory Networks, and introduce DeepDeCode, an attention-based deep learning model to perform the said tasks. We use various intrinsic and extrinsic visualization techniques to infer biologically relevant information learnt by these models.
We show that our models can successfully perform these tasks, while also attend to biologically plausible regions of the input while making the predictions. Given the results of our methodology, we expect that it can be extended to the discovery of splice sites that were not previously located through wet-lab methods as well as structural patterns of other subtle genomic elements.
The implementation details can be found in my GitHub repository.
1. Unsupervised Named Entity Recognition for Electronic Health Records using Bidirectional LSTM-CNN
Electronic Health Records (EHRs) are an important part of the knowledge concerning individual health histories. Extracting valuable knowledge from these records is a challenging task because they are often composed of highly codified notes diverse in language, acronyms and jargon.
As a Research Engineer at the National University of Singapore, I proposed a transfer learning approach for the medical concept extraction task from patient reports, as presented in the 2010 i2b2/VA Workshop on Natural Language Processing Challenges for Clinical Records, in an unsupervised setting. I used a Convolutional Neural Network (CNN) to extract character-level features by converting discrete features into continuous vector representations. I generated entity labels for unsupervised dataset using a Long-Short Term Memory (LSTM) model trained in a supervised setting. The extracted features along with the generated labels is used to train a hybrid Bidirectional LSTM-CNN model, to perform Named Entity Recognition in an unsupervised setting.
The preliminary implementation details can be found in my GitHub repository. My Poster was accepted to the 14th Women in Machine Learning (WiML) Workshop, held at NeurIPS, Vancouver, Canada, 2019.
2. Estimating Uncertainty of Neural Network for Detection of Diabetic Retinopathy
Diabetic Retinopathy (DR) is a leading cause of vision-loss globally. Of an estimated 500 million people with diabetes mellitus worldwide, approximately one third have signs of DR and of these, and a further one third of these cases is vision-threatening.
As a Research Engineer at the National University of Singapore, I was a part of the Singapore Eye Lesion Analyzer (SELENA) project in collaboration with the Singapore Eye Research Institute (SERI), examining retinal fundus photographs for eye conditions like diabetic retinopathy; to deploy it as the first nationally-adopted automated screening solution. I estimated uncertainty of neural networks for automated screening of DR using the PyTorch framework. I also generated visual explanation of the deep learning system to convey the pixels in the image that influences its decision using Integrated Gradient method. I designed and architected the complete workflow to input data, train models, find uncertainty and visualize the output, such that it can be used by end-users like clinicians.
The implementation details can be found in my GitHub repository. My Poster was accepted to the 13th Women in Machine Learning (WiML) Workshop, held at NeurIPS, Montreal, Canada, 2018.
3. Building R Package: BayesSentinel - Simulation and Classification of Multi-Dimensional Spectroscopic Data
The project aims to develop, study and implement supervised and unsupervised classification methods when the data are of different natures (heterogeneous) and have missing and / or aberrant data. The methods implemented are developed to process satellite and aerial data for ecology and cartography.
During my internship at INRIA, Lille-Nord Europe, I built an R package end-to-end to simulate and classify high volume, multi-dimensional, temporal spectroscopic data provided by the Sentinel-2 satellite covering the area of France. I built different statistical models to estimate the covariance of the spectroscopic data and implement the Bayes probabilistic model to classify the data into different classes. I used the S4 framework to incorporate the Object Oriented Programming paradigm in the R package, made vignettes to explain the code, did the code documentation using ROxygen; and used Git with GitHub for version control.
The implementation details can be found in my GitHub repository. My Poster was accepted to the 13th Women in Machine Learning (WiML) Workshop, held with NeurIPS, Montreal, Canada, 2018.
4. Context-Aware Music Recommendation using Factorization Machines
Today, we are facing a elementary change in the way people consume music as people switch from limited private music collections to public music streaming services containing millions of tracks. Hence, we need efficient Music Information Retrieval practices to analyse the tons of data generated.
During my internship at Academi Sinica, Taiwan, I introduced a novel data set, #nowplaying-RS containing 11.6 million listening events taken from Twitter, and audio features of the respective tracks from Spotify. I split the dataset into training and test sets based on different scenarios (warm users, cold-start users etc.), and used state-of-the-art Machine Learning algorithm - Factorization Machines to make context aware music recommendation. I used different user contexts, track content features as well as analysed the hashtags to make sentiment based track recommendations. I evaluated the Mean Average Precision and Mean Reciprocal Rank for each method and compared it to some baselines.
The implementation details can be found in my GitHub repository. My Paper was published in the Proceedings of the Sound & Music Computing Conference (SMC '18), Limassol, Cyprus. Dataset can be downloaded from this website.
5. Text and Natural Image Deblurring using Neural Networks
Inspired by biological neural networks, artificial neural networks have been used to successfully perform wonders . No wonder Deep Learning is a buzz word in today's computing world.
During my winter internship at IIT Kharagpur, I performed text and natural image deblurring using the Caffe Deep Learning framework. I applied non blind deconvolution methods using PyCaffe and MATLAB (for testing) to achieve space invariant deblurring of images. I used a 3 layer Convolutional Neural Network (Super Resolution architecture) to restore the defocused blurred images.
The implementation details can be found in my GitHub repository.
6. Twitter Sentiment Analysis
Social Media Analytics and Sentiment Analysis can give crucial insight into a user's opinions and behaviour.
Using R, I imported live tweets using the Twitter API. After cleaning and pre-processing the the data by removing emoticons, URLs and stopwords, I used Lexical Analysis as well as Naive Bayes Classifier to predict the sentiment of tweets given any search hashtag. I subsequently expressed the opinion graphically through bar plots, histograms, pie charts, word clouds and timelines. I also found the top trending tweets and top tweeters of a hashtag. I created an interactive front-end using the R Shiny app.
A demo of the project can be viewed alongside, and the implementation details can be found in my GitHub repository.
7. Reversible Digital Watermarking
Digital Watermarking has a lot of use cases in Security and Image Processing, and some sensitive applications such as those in the Military and Medicine require analysis of both the original image and the digital watermark.
During my summer internship at IIT Kharagpur, I coded in OpenCV using C++ to achieve increased embedding capacity using colour-space transformation from RGB to YCoCg. I embedded a watermark, and allowed full extraction of original image and watermark data, ensuring 100% reversibility. I increased 10% time efficiency by parallelizing code in OpenCV. I also, worked on implementing multi- layer watermark embedding to further increase embedding capacity of the image.
The implementation details can be found in my GitHub repository.
8. Real Time Network Analytics
Vehere specializes in state-of-the-art Communications Intelligence and Cyber Defence capabilities in the broad communications spectrum and mission control protection against advanced cyber threats and data breaches.
During my internship in the Product Management Team, I worked on a live project to predict the health status of a wireless network link. I performed Time Series Analysis using the ARIMA model to profile and forecast various Key Performance Indicators (KPIs) of two devices in the network. I visualized the results using different graphs, and consolidated the results to find out which device might fail in the future.
More details about the work I did can be found in my GitHub repository and this Report.
9. Credit Score Modelling
As an intern at Digital Cloud Tech, Kolkata, I developed an R-based Big Data solution to work on numerous user records. A number of variables used to compute credit score of a large set of people were available. These had to be analysed and creditworthiness of each person had to evaluated. I implemented a multi-parameter model to provide credit score ratings, and successfully predicted and identified probable loan defaulters from analytics on past cases.
More details about the work I did can be found in this Report.