Hi, I'm Harish
23 | Data Science | Software Engineering | AI & Machine Learning | Generative AI
Open My Resume ContactAbout
My Introduction
I am a final-year Integrated MTech student specializing in Data Science at VIT, with a strong foundation in Python, Machine Learning, NLP, Computer Vision, and Data Analytics. I have hands-on experience in developing and deploying AI/ML models, building data-driven solutions, and extracting meaningful insights from large-scale datasets. Passionate about leveraging cutting-edge AI frameworks and advanced analytics, I thrive on solving complex real-world problems and contributing to innovative R&D projects. If you're looking for someone to drive impactful AI initiatives and turn data into actionable intelligence, let's connect!
Experience
My journey in the academic & professional front(INTEGRATED) Master of Technology - Computer Science and Engineering with Specialization in Data Science
Vellore Institute of Technology, India CGPA : 7.8Class XII
Maths, Physics, Chemistry, Computer Science State Board of Secondary Education TamilNadu |THIRU G V C Higher Secondary School, India Percentage : 79 %Class X
State Board of Secondary Education TamilNadu | St joseph matriculation higher Secondary School, India Percentage : 87.8 %Software Developer Trainee
Fact Entry Data Solutions-A SIX CompanyAssociate Software Engineer - Trainee Intern
MIMASOFT Technologies Private Limited Ref LinkSkills
My technical & other skillsData Science & AI
Machine Learning
Deeplearning
Generative AI
Probability & Statistics
Computer Vision
Natural Language Processing
Data Mining
Ollama Models Configuring and finetuning
AI/ML Frameworks & Libraries : TensorFlow, PyTorch, Scikit-learn
Programming & Backend Development
Python
SQL
C++/C
Python-Flask / Python-Django
FastAPI
Docker
Kubernetes
Computing & Cloud
GPU & Distributed Computing
Google Cloud Platform (GCP)
Microsoft Azure
Data Engineering & ETL
Data Preparation
ETL Pipelines
Data Warehousing
SQL (MySQL)
NoSQL (MongoDB)
Data Lakes
ETL Tools
Data Analytics & Visualization
Power BI
Excel
Business Analytics
Data Storytelling
DevOps & Version Control
Git / Gitlab / Github
CI/CD
Automated Testing & Deployment
Projects
My Projects and WorksAI-Powered Document Q&A and Data Extraction System
FactEntry Data Solutions (A SIX Company)
Objective:
Developed an AI-driven RAG system for extracting structured data from scanned documents, including tables, forms, and text.
Key Contributions:
• Integrated Google Generative AI Embeddings (Google API) and LLaMA 3.3 (Groq API) for efficient retrieval and response generation.
• Implemented OCR-based text extraction using PyTesseract, enhanced with OpenCV preprocessing for improved accuracy.
• Leveraged Hugging Face LayoutLM for extracting data from complex layouts, including borderless tables and unstructured forms.
• Developed a Streamlit web interface for real-time document analysis and deployed it securely via ngrok.
Impact:
Automated document processing, reducing manual effort while improving data accuracy and retrieval speed.
Hate Speech and Harmful Content Classification Using LLM
Vellore Institute of Technology | Capstone
Objective:
Developed an AI-powered cybersecurity system for detecting hate speech and harmful content on the web and social media using NLP and LLMs.
• Designed a hybrid approach integrating ML/DL models with a fine-tuned LLM for advanced content moderation.
• Experimented with LLaMA 3.3, 3.2, DeepSeek 7B/14B, and WizardLM 7B via Ollama to optimize detection.
• Automated web scraping, text preprocessing, and structured content extraction.
• Deployed a real-time monitoring system for detecting and flagging harmful content.
Impact:
Enhanced cybersecurity and content moderation, reducing manual effort while improving detection accuracy.
Language Distribution & Document Clustering Tool
FactEntry Data Solutions (A SIX Company)
• Developed and deployed Language Distribution and Document Layout Clustering tools, releasing basic versions on PyPI and advanced versions for FactEntry.
Key Contributions:
• Used Tesseract OCR, LangDetect, and StableLM to analyze PDFs, providing language distribution breakdowns (e.g., 60% English, 30% Spanish).
• Built a clustering system using VGG16, ResNet, and PCA-based fine-tuning for document classification.
• Applied OCR and deep learning for accurate data extraction from scanned financial documents.
• Integrated LLaMA Vision for enhanced image analysis, improving document ranking and content grouping.
Disease Prediction using Ensemble Learning
SMART INTERNZ
• Engineered a disease prediction model using Python and machine learning algorithms like Random Forest and SVM,
achieving high accuracy in clinical trial predictions based on extensive patient historical data.
Key Contributions:
• Stored data in MongoDB Atlas, retrieved it with PyMongo, and transformed it into a binary dataset for analysis.
Real-Time Room Temperature & Humidity Visualization (AR)
J Component for ARVR Course
• Engineered a real-time AR system integrating the MQTT protocol to capture and visualize environmental data (temperature and humidity) using Unity-based 3D models.
• Enhanced user engagement with actionable ML-driven insights and provided interactive climate reporting.
Extractive Summarization & Information Retrieval
J Component for Web Mining Course
• Designed an advanced web scraping and information retrieval system leveraging natural language processing (NLP).
• Enabled the system to understand user queries in English, translate them into SQL, and extract targeted data with 95% accuracy from multiple websites.
• Stored extracted data in MySQL using SQLite3 in Python.
• Automated user queries in English to retrieve relevant data efficiently.
Application Tracking System
JP INFOTECH
Developed an NLP-based application tracking system using Python to analyze resumes, identify weak sentences, suggest improvements, and recommend keywords based on job descriptions.
Implemented an n-gram model to predict the next word a user might type, providing a typing assist feature.
• Developed an OCR-based system leveraging PyTesseract, combined with OpenCV preprocessing techniques scanned documents with 95% accuracy.
• Built a pipeline for recognizing complex tabular structures, including borderless tables, by integrating Hugging Face models and custom algorithms.
• Automated document processing across various formats (PDF, images) by integrating multi-API workflows, reducing manual efforts.
• Optimized data extraction workflows for large-scale datasets, enhancing performance and reducing processing time by 30%.
• Collaborated on ML-driven NLP projects to generate automated suggestions for keyword optimization and grammar correction.