Data Scientist · ML / NLP Research

Soumyajit Roy

I build ML and NLP systems for fintech — credit analytics, transaction intelligence, and LLM fine-tuning on noisy real-world financial data. Currently at Pine Labs, Bangalore.

Portrait of Soumyajit Roy
About

I'm a data scientist with over a year of experience building ML and NLP systems for fintech at Pine Labs. My work spans credit analytics, transaction intelligence, and deploying models on large transactional datasets to improve decisioning quality, cost, and operational efficiency.

I have a strong computer science foundation and I'm happiest when I'm turning messy real-world data into production-ready systems — whether that's fine-tuning an LLM on a low-resource task, forecasting payment traffic with time-series models, or pulling structure out of PDF bank statements with vision-language models.

Research

I'm actively interested in research at the intersection of the following areas:

  • NLP for financial data. Extracting structured signal from noisy real-world transaction streams and bank statements using LLMs, BERT-family models, and vision-language models.
  • LLM fine-tuning on low-resource tasks. Parameter-efficient methods (LoRA / QLoRA) and iterative self-improvement pipelines for data-scarce domains and underrepresented languages.
Publications
  1. [1]
    CodeAnubad at BLP-2025 Task 2: Efficient Bangla-to-Python Code Generation via Iterative LoRA Fine-Tuning of Gemma-2

    Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025), ACL, 2025.

    Iterative LoRA fine-tuning of Gemma-2-9B-it for Bangla→Python code generation; outperforms translation and RAG baselines on a 74-sample task.

    Show abstract

    This paper presents our submission for Task 2 of the Bangla Language Processing (BLP) Workshop, which focuses on generating Python code from Bangla programming prompts in a low-resource setting. We address this challenge by fine-tuning the gemma-2-9b instruction-tuned model using parameter-efficient fine-tuning (PEFT) with QLoRA. We propose an iterative self-improvement strategy that augments the extremely limited training data (74 examples) by reusing verified correct predictions from the development set, alongside LoRA rank experiments (8, 16, 32), observing a clear correlation between rank and accuracy, with rank 32 delivering the best results. Compared to translation-based and retrieval-augmented baselines, our approach achieves significantly higher accuracy, with a pass rate of 47% on the development set and 37% on the hidden test set. These results highlight the effectiveness of combining iterative data augmentation with rank optimisation for specialised, low-resource code generation tasks.

    DOI: 10.18653/v1/2025.banglalp-1.53

  2. [2]
    Unveiling Natural Based Underwater Image Colour Enhancement: a Paradigm Shift in Underwater Image Processing and Colour Enhancement

    15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kamand, India, 2024.

    Proposes NUCE, using swarm intelligence techniques to improve colour distribution and visual quality in underwater imagery.

    Show abstract

    In several fields, including marine research, environmental monitoring, and undersea exploration, underwater imaging is essential. Unfortunately, the complicated interactions between light and water molecules make it difficult to take high-quality underwater pictures, leading to problems including low contrast, colour distortion, and poor visibility. Even with improvements in imaging technology, these problems are frequently not adequately addressed by current enhancement methods, particularly in deeper water or under unfavorable circumstances. To address these shortcomings in existing enhancement strategies, this work presents the method of Natural-based Colour Enhancement of Underwater Images (NUCE). NUCE seeks to improve underwater photos' authenticity, boost contrast, and reduce the colour cast, by using a four-step technique that includes colour cast neutralization, fusion of dual-intensity images using the mean and median as averages, mean equalization based on swarm intelligence, and unsharp masking. NUCE's ability to enhance item visibility and detail in underwater scenes by restoring real colours, boosting contrast, and minimizing colour cast has been demonstrated empirically using the EUVP dataset.

    DOI: 10.1109/ICCCNT61001.2024.10725348

  3. [3]
    PlantDiseaseNet-RT50: A Fine-tuned ResNet50 Architecture for High-Accuracy Plant Disease Detection Beyond Standard CNNs

    IEEE International Conference on Advances in Computing Research on Science Engineering and Technology (ACROSET), Indore, India, 2025.

    Fine-tuned ResNet50 with selective layer unfreezing and cosine LR scheduling reaches ~98% accuracy on plant disease detection with a deployment-ready footprint.

    Show abstract

    Plant diseases cause 70–80% of crop losses globally, making automated detection critical for food security. Traditional visual inspection methods are impractical for large-scale farming. PlantDiseaseNet-RT50 addresses this by employing a fine-tuned ResNet50 architecture with strategically unfrozen layers, custom classification components, and dynamic learning rate scheduling. The model achieves approximately 98% accuracy, precision, and recall across multiple crop species and disease categories. Key improvements include targeted layer unfreezing, batch normalization, dropout regularization, and advanced training techniques. This computationally efficient solution enables rapid, accurate plant disease diagnosis suitable for real-world farming applications.

    DOI: 10.1109/ACROSET66531.2025.11281259

Experience
  1. Data Scientist I · Pine Labs

    Aug 2025 — Present

    Leading fintech company in Asia · Bangalore, India

    • Credit & Transaction Intelligence: Architecting and deploying production-grade ML and NLP systems within the Credit and Insights team to enhance real-time transaction intelligence and credit analytics.
    • Clustering & Similarity Systems: Designed a high-precision similarity-based clustering module for recurring transaction detection, achieving a 15% improvement in precision and 30% in recall while cutting system latency by 50%.
    • Credit Intelligence (NLP): Developed an NLP pipeline utilising BERT-based architectures to identify salary patterns within noisy, real-world financial data, resulting in a 20% increase in recall.
    • Predictive Credit Modeling: Engineered income estimation models leveraging alternative financial data and advanced feature engineering, achieving an ~75% accuracy/AUC to drive higher-quality downstream credit decisioning.
    • Time-Series Forecasting: Benchmarked and deployed Gradient Boosting and Transformer-based forecasting models to predict payment traffic volume, successfully reducing infrastructure provisioning costs by ~50%.
    • Generative AI & VLMs: Leveraged Large Language Models (LLMs) and Vision-Language Models (VLMs) for structured information extraction from complex financial documents, significantly improving the automation and reliability of internal ML workflows.
  2. Data Science Intern · Pine Labs

    Sep 2024 — Aug 2025

    Leading fintech company in Asia · Bangalore, India

    • LLM Fine-tuning: Leveraged Parameter-Efficient Fine-Tuning (PEFT) to adapt the Qwen 2.5 LLM for automated internal workflows, resulting in a 25% reduction in manual oversight.
    • Retail Underwriting Optimisation: Enhanced production-grade models for Retail Underwriting and Personal Finance Management (PFM) through rigorous error analysis, boosting both precision and recall by 15%.
    • Large-Scale Data Engineering: Engineered and analysed 50M+ financial transactions to extract behavioral patterns and validate Proof-of-Concept (POC) initiatives for internal and external stakeholders.
    • Credit Risk Intelligence: Contributed to the design and evaluation of ML systems tailored for credit risk assessment and financial data intelligence platforms, ensuring model robustness on real-world transactional data.
Education

Kalinga Institute of Industrial Technology

B.Tech in Computer Science and Engineering · 2021 – 2025

Cumulative GPA: 9.50 / 10.0 · Bhubaneswar, India

Writing
Contact

I'm open to conversations about data science roles, research collaborations, and interesting problems in ML / NLP for finance. The fastest way to reach me is email:

roysoumyajit@icloud.com