Hello, I'm Harshit.

Data & ML Engineer with 5+ years of startup, corporate, and academic work experience.

Let's Connect:   
Scroll Down

About Me

I’m currently working as a Research Scientist for The Vera C. Rubin Observatory - Chile. I have an MS in Data Science from The University of Washington - Seattle. I've worked as a Data & Machine Learning Engineer for Shell where I developed and deployed some of the most sophisticated and successful data-powered products to help traders generate $500+ Million/year in revenue. I've hands-on experience in taking products from ideation to production in both, startup and corporate environments. I possess a strong foundation in data science, machine learning, data engineering, and operations. Let's connect and discuss how we can work together to build something awesome!


Career

DiRAC Institute - Rubin Observatory - Stanford - Princeton

Research Scientist - Machine Learning (Full-Time) March 2023 - Present

  • Architectured, built, and deployed 3 Extract-Transform-Load (ETL) pipelines to fetch crowdsourced data and create ground truth for the training dataset.
  • Trained and deployed CNN models using Docker, enabling 1M+ near-real-time alerts for 1000+ astronomers and astrophysicists across the world.
  • Improved performance by 30% to process data within the allocated time frame by optimizing the data pipeline.

Virufy

Machine Learning Engineer (Capstone) September 2023 - March 2023

  • Pioneered the first scalable ML infrastructure for seamless development and deployment, saving 100+ hrs/month.
  • Implemented software development best practices, continuous monitoring, and model optimization on GCP using Kubeflow, Docker, and Vertex AI for a healthcare application to save $70k+/year in operational costs.

Extropolis AI

Machine Learning Engineer (Internship) July 2023 - December 2023

  • Introduced intelligent routing, load-balancing, and autoscaling that led to a 40% increase in availability and scalability SLAs, while reducing costs by 85%.
  • Integrated automatic model conversion and deployment saving about 30+ hours/month of manual work and improving rolling updates by 15%.

Royal Dutch Shell

Data Engineer (Full-Time) July 2019 - September 2022

  • Developed and implemented 6 terabyte-scale data pipelines leveraging Python, Spark, SQL, Databricks, and Microsoft Azure Data Platform, enabling Shell traders to enhance trading decisions, yielding $500+ million/year.
  • Spearheaded the training and deployment of LLM models utilizing Python, TensorFlow, HuggingFace, and Azure ML Studio to automate job description classification, resulting in savings of $500k+/year.
  • Engineered Data Quality Control solutions using Python and Alteryx, achieving an 85% reduction in business downtime.
  • Collaborated with Shell and Microsoft data scientists to develop and deploy interpretable NLP models utilizing the Interpret ML library.
  • Led a Continuous Improvement initiative to optimize machine learning and data engineering solutions, delivering savings of $290k+/year.

Qustac Technologies

Co-founder & Chief Executive Officer (Full-Time) March 2017 - June 2019

  • Developed and deployed 4 image classification and object detection models, including Xception, EfficientNet, and YOLO, using Python and TensorFlow for a real-time content moderation application.
  • Implemented state-of-the-art research papers on text classification, data processing optimization, and model performance optimization - Deep Compression and Knowledge Distillation.
  • Conducted data analysis utilizing Python, Numpy, Pandas, Matplotlib, and Tableau, and presented actionable insights from user data to board members to inform business decisions.
  • Designed and implemented the software architectures of two desktop and web-based applications powered by natural language processing (NLP) and computer vision technologies.

Education

University of Washington

Master of Science in Data Science | CGPA: 3.9

Subjects: Introduction to Statistics and Probability, Data Visualization, Software Design, Applied Statistics and Experimental Design, Data Management, Statistical Machine Learning, Human-Centered Data Science, Scalable Data Systems and Algorithms.
Co-curricular: Organizer at The RAISE Group, Graduate Research Assistant at the DiRAC Institute, Capstone with Virufy.

University of Mumbai

Bachelor of Engineering in Information Technology | CGPA: 3.6

Subjects: Data Structures & Algorithms, Object-Oriented Programming Methodology, Big Data, Open-Source Technologies, Soft Computing, Database Management System, Software Engineering, Data Mining & Business Intelligence, Distributed Systems, Cloud Computing, Software Project Management, Intelligent System.
Co-curricular: Co-founder of the Coders' Club.

Skills

Concepts & Technologies
Data Science, Data Engineering, Machine Learning, Natural Language Processing (NLP), Computer/Machine Vision, MLOps (Machine Learning Operations), ETL (Extract-Transform-Load), Data Visualization, A/B Testing, Data Modeling, Database Management, Data Analysis, Data Wrangling, Data Warehousing, RAG.

Programming & Scripting Languages
Regular: Python, SQL, JavaScript, HTML, CSS
Past Experience: C, C#, Java, PHP

Tools & Framworks
Data Engineering: Databricks, Apache Spark, Git, Microsoft Azure, AWS, GCP, Microsoft SQL Server, Alteryx, Apache Airflow, Docker.
Data Science: Micorsoft Azure ML Studio, HuggingFace, PyTorch, Scikit-learn, TensorFlow, Pandas, Numpy, Matplotlib, OpenCV, Tableau, Flask, Keras, NLTK, FastAPI, Streamlit, Seaborn, LangChain, Vector Database (Pinecone).

Research

  • On-Device ML: An Efficient Approach to Classify Large Number of Images Using Multi-threading in Android Devices- accessible here (Springer).
  • Explicit Content Detection using Faster R-CNN and SSD MobileNet v2- accessible here (IRJET).
  • Explicit Text Classification for Hinglish Language- accessible here (unpublished).
  • Dtoxd.ai: Content moderation for Android devices- accessible here (unpublished).
  • Explicit Content Censor – Offline (ECC-O)- accessible here (unpublished).
  • Balanced : An Application To Improve Mental Health- accessible here (unpublished).

  • - NOTE: Some research papers are unpublished because they were written while working at my startup.

Projects

Author image Realtime Stock Analyzer (2024) Python, Airflow, Postgres

• Designed and deployed an ETL pipeline using Airflow to pull data from the Alpha Vantage API.
• Hosted a PostgreSQL database on Aiven.io to store all the processed data.
• Created a Streamlit dashboard to vizualize and display various stock data analysis.

Author image English Premier League Pass Analysis (2022) Python, Pandas, Tableau

• Performed data collection and preprocessing- handled missing values, created derived features, parsed string coordinates into numeric latitude and longitude, and eliminated outliers.
• Performed join operations across 10+ denormalized data files for creating a dataset to be finally used by Tableau.
• Created interactive grpahs using Tableau.

Author image Cyberbullying Abusive Content Classifier (2020) Python, TensorFlow.js, Keras, JavaScript

• Performed data collection and text pre-processing; and trained LSTM and RNN models using TensorFlow.js to detect abusive text.
• Trained a lightweight MobileNet V2 model for detecting explicit images in real-time.
• Developed a web-browser extension using JavaScript to filter abusive text and images using the trained ML model in real-time.

Author image Yoga Pose Detector (2019) Python, TensorFlow, Keras, Flask

• Performed web-scraping to collect data and then performed data preprocessing and data augmentation using the Albumentations library.
• Trained YOLO v3 and MobileNet v2 models using Transfer Learning on Google Cloud Platform (GCP).
• Developed a full-stack web application using React and Flask for serving the ML model.

Author image Agnel Online (2017) PHP, HTML, CSS, JavaScript

• Developed an online portal for ordering iterms from the college canteen and stationery store.
• Integrated credit/debit card payment options with the web application to enable online payment.
• Implemented the 'shopping cart' feature to enable ordering multipling items at once.

Say Hello

Have a great idea in mind? Let's collaborate and build something awesome. Let's turn that idea to an even greater product :)