POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency
Making video question answering possible on large context scenes with minimal input tokens.
Computer Vision and Multimodal AI Researcher
I'm an Undergraduate Research Assistant at The University of Southern Mississippi in Mississippi, USA. I work on the Cyber Innovations Lab advised by Dr. Nick Rahimi mostly focusing on modern deep learning techniques on computer vision and multimodal systems.
I was awarded a $5,500 summer research grant in 2025 by the Drapeau Center for Undergraduate Research (DCUR) for my research on Gaussian Splatting. I also drafted the NASA EPSCoR project (PI: Dr. Rahimi) that got funded for 2025-2026 academic year for $51,000.
I am also the Lead Organizer of the Google Developers Group (GDG) On Campus and the Research Liaison of the School of Computing Science and Computer Engineering's Student Ambassadors.
Trivia: I play bansuri and read literature. I also won the Eagles Write Award (best essay award) in 2023/24 for an essay I wrote in 45 minutes. I actively share my tech journey with over 3,000 organic followers on LinkedIn and 90+ on GitHub.
Selected Publications
I'm interested in computer vision, 3D/4D vision, deep learning, generative AI, and robotics. Most of my research focuses on understanding scenes, analyzing deep learning techniques, and inferring meaningful information by tinkering with model architectures. Some of my computer vision works are highlighted. Thumbnails can be hovered for a larger preview and clicked for a full lightbox view.
Making video question answering possible on large context scenes with minimal input tokens.
We introduce an RL-based policy that selects anchor budgets and informative Gaussians for efficient 4D Gaussian streaming, improving the quality-runtime trade-off.
This paper proposes a new multilingual dataset and methodology for detecting cyber threats spread through posts on X.
A short paper on how CLIP representations shift under different augmentation strengths and types.
A comparison of ViTs and CNNs on iSAID segmentation, including a loss function that helps a smaller CNN compete with a much larger ViT.
A robust framework to evaluate image-text pairs under perceptual, semantic, pragmatic, and distributional alignment.
We propose a new model, R1 Translator, which aims to improve the performance of EEG-to-text decoding.
A study of convolutional Kolmogorov-Arnold networks across ImageNet, MNIST, and tabular classification settings.
Analyzing zero-day cyber attacks with MLP and SHAP using a weighted loss that prevents the model from overfitting to the majority class.
We fine-tuned a LeNet-style CNN to perform OCR on handwritten Devanagari characters, which are more structurally complex than Roman letters.
A survey of 300+ individuals about privacy, labor, mainstream adoption, and the features people expect before robots become household companions.
We created Jelly, the first Romanized Nepali chatbot using BlenderBot, and studied the efficacy of native language in mental health conversational systems.
Using a self-curated chicken recipe dataset, I trained a GRU-based generator and studied how different preprocessing choices changed the generated recipes.