HARIKRUPA VEDERE

Senior Data Engineer

United States.

Career Overview

Senior Data Engineer with 11+ years of hands-on experience building end-to-end ETL and lakehouse pipelines across AWS, Azure, and GCP for healthcare, telecom, and banking clients. Specializes in PySpark, Kafka, and Delta Lake to handle petabyte-scale data under strict compliance requirements, with a track record of cutting query runtimes by over 60%. Works across AI/ML integration, real-time streaming, and fraud detection, combining Microsoft Fabric and GenAI-powered workflows with modern cloud-native tools to build scalable data platforms that turn raw data into reliable, decision-ready systems.

Skill set

Big Data Ecosystem

Hive, Apache Spark, PySpark, Spark SQL, Spark Streaming, Structured Streaming, Kafka, Kafka Streams, Confluent Kafka, Kafka Connect, NiFi, Sqoop, Flume, MapReduce, HDFS, YARN, Zookeeper, Apache Beam, Apache Druid, Apache Flink, Impala, HBase, Ambari, Airflow, Airflow DAGs, Oozie, Cloud Composer.

ETL and Data Integration

AWS Glue, Azure Data Factory, Informatica PowerCenter, IICS, IDMC, Talend, SSIS, Oracle Data Integrator, Semarchy xDM, Reltio, MDM, CDC, ELT, Incremental Loading, Job Scheduling, SLAs, Airbyte, Soda, dbt.

Lakehouse and Query Engines

Delta Lake, Apache Iceberg, Trino, Presto, Athena.

Programming Languages

Python, SQL, PySpark, Scala, Unix, T-SQL, PL/SQL, Java, Spring Boot.

Cloud Environment - AWS

EMR, S3, Glue, Redshift, Redshift Spectrum, Lambda, Athena, EC2, RDS, DynamoDB, Kinesis Data Streams, Kinesis Firehose, EventBridge, SQS, VPC, IAM, CloudWatch, Step Functions.

Cloud Environment - Azure

Azure Databricks, ADLS Gen2, Data Lake, Blob Storage, Azure SQL, Cosmos DB, HDInsight, Azure Synapse Analytics, Azure Functions, Event Hubs, Azure Monitor, Log Analytics, RBAC, Managed Identities, Microsoft Fabric, Microsoft Purview.

Cloud Environment - GCP

BigQuery, Dataproc, Cloud Storage, GKE, Cloud Functions, Spanner, Pub Sub, Dataflow, Bigtable, Cloud Composer, Cloud Monitoring, Vertex AI, Cloud AI Platform, KMS, IAM.

Databases and Tools

SQL Server, MySQL, PostgreSQL, Oracle, Teradata, MongoDB, DynamoDB, Cassandra, Cosmos DB, Erwin, Palantir Foundry.

Reporting and BI Tools

Power BI, Tableau, Looker Studio, Google Data Studio, OBIEE, Microsoft Fabric.

Python Libraries and ML

NumPy, Pandas, Scikit Learn, Matplotlib, TensorFlow, PyTorch, PySpark ML, Spark MLlib, BigQuery ML.

GenAI and AI/ML

RAG, LangChain, LlamaIndex, LangGraph, Agentic AI, Copilot, LLMs, Prompt Engineering, Microsoft Copilot Studio, FAISS, Pinecone, Vector DB, Azure OpenAI, OpenAI, AWS Bedrock, Azure AI Search, Amazon OpenSearch, AWS SageMaker, Azure AI, Microsoft Foundry, AI Agents, Bedrock Agents, Multi Agent Orchestration.

Automation

Microsoft Power Platform, Power Automate, Power Apps, Power Pages, Dataverse, Custom Connectors.

Containerization and DevOps

Kubernetes, Docker, Jenkins, GitHub Actions, Azure DevOps, IaC, Terraform, Pulumi, AWS CDK.

Data Observability and Monitoring

Prometheus, Grafana, Datadog, Splunk, ELK Stack (Elasticsearch, Logstash, Kibana), OpenTelemetry, CloudWatch, Azure Monitor, Log Analytics, GCP Cloud Monitoring, PagerDuty.

Data Quality, Catalog, and Governance

Great Expectations, Monte Carlo, Bigeye, Soda, Apache Atlas, Apache Ranger, Microsoft Purview, Collibra, Alation, DataHub, Amundsen, Data Lineage, Data Contracts.

Software Life Cycle/Methodologies

Agile Models, Waterfall, SDLC, CI/CD, Infrastructure-as-Code, AWS CDK, GitHub Actions, Azure DevOps, Audit Trails, Data Lineage, Data Governance, Row-level Security, Column-level Encryption.

Work History

JOHNSON & JOHNSON

SR. BIG DATA ENGINEER

New Brunswick, NJ, US

Oct 2023

→

Present

Remote

Summary

Built and scaled enterprise data platforms across Johnson & Johnson's healthcare, pharmaceutical, and retail operations, delivering lakehouse architectures, real-time pipelines, and GenAI-powered solutions while maintaining strict regulatory compliance standards.

Highlights

Built and managed bronze-silver-gold lakehouse platforms using Delta Lake and Microsoft Fabric, enforcing ACID guarantees, schema governance, and zone-based data quality across 15M daily patient records.

Designed and optimized multi-source ETL and ELT pipelines handling batch, incremental, and CDC-based data movement with watermarking, schema drift handling, and dependency scheduling at scale.

Integrated EHR systems using FHIR and HL7 standards, building ingestion pipelines with Apache Kafka and Apache NiFi to consolidate patient records across hospital endpoints and reduce physician data retrieval time by 40%.

Implemented HIPAA-compliant data security including patient de-identification, column-level encryption, and role-based access control across clinical data platforms.

Improved analytical query performance by over 45% through Delta Lake optimization techniques including Z-ordering, partition pruning, and automated table maintenance orchestrated via Airflow.

Built GenAI-powered data assistants using LLMs and RAG architectures, reducing report generation time from days to hours for clinical and business stakeholders.

Deployed Agentic AI workflows using LangGraph and Bedrock Agents to automate complex multi-step clinical data enrichment tasks, cutting manual curation effort significantly per workflow cycle.

Delivered end-to-end retail analytics consolidating Salesforce, Shopify, and SAP data using PySpark across cloud environments, driving 20% better inventory distribution and 30% improvement in demand forecasting accuracy.

Built clinical and operational BI dashboards in Power BI with row-level security, connecting gold-layer lakehouse tables to support high-concurrency reporting across regulated cloud environments.

AT&T

SR. DATA ENGINEER

Dallas, TX, US

Dec 2020

→

Sep 2023

Remote

Summary

Designed and optimized terabyte-scale data pipelines for AT&T's telecom operations, processing CDRs and network usage logs across multi-cloud environments to enhance real-time analytics, reduce compute costs, and improve network performance visibility.

Highlights

Built and optimized ETL ingestion pipelines for CDRs and network logs at terabyte scale, applying watermarking, dependency scheduling, and adaptive query execution to meet strict processing SLAs across streaming and batch workflows.

Implemented event-driven ingestion using Apache Kafka for high-frequency network signaling events, utilizing Apache HBase for near real-time metric storage and Apache Druid for fast OLAP queries across 500 billion telecom events monthly.

Optimized cloud analytical datasets through partitioning, clustering, and materialized views, significantly reducing dashboard load times across multi-terabyte telecom workloads.

Orchestrated multi-cloud data workflows using Airflow DAGs coordinating streaming, batch processing, and query jobs, implementing retry logic and SLA sensors for reliable pipeline recovery.

Implemented CI/CD pipelines using GitHub Actions and Terraform alongside centralized monitoring with Grafana and Prometheus to automate deployments and track pipeline latency and Spark job performance across cloud-hosted workloads.

BANK OF AMERICA

DATA ENGINEER

Charlotte, NC, US

Feb 2018

→

Nov 2020

Hybrid

Summary

Engineered real-time fraud detection and batch ingestion pipelines for Bank of America's credit card and mobile banking operations, reducing manual investigation workload by over 40% and improving detection accuracy across high-volume transaction streams.

Highlights

Built real-time transaction ingestion pipelines using Apache Kafka and Confluent Kafka, configuring topic partitioning and replication for fault-tolerant delivery under peak transaction volumes.

Designed Apache Flink streaming jobs with Flink SQL and CEP patterns to detect complex fraud scenarios including location mismatches, velocity spikes, and blacklisted merchant interactions, enabling sub-second suspicious transaction flagging.

Implemented dynamic risk scoring combining rule-based fraud signals with ML model outputs managed via MLflow, integrating Redis for sub-5-millisecond lookups and Apache HBase historical patterns to achieve 94% detection accuracy with minimal false positives.

Built batch ingestion pipelines using Apache Beam to move 5+ years of mainframe transaction records into cloud storage, sustaining over 10,000 messages per second, and optimized cloud analytical datasets reducing fraud trend query runtimes from hours to minutes.

Optimized BigQuery datasets through partitioning and clustering, reducing fraud trend query runtimes from hours to minutes for risk reporting.

PAYCHEX

DATA ENGINEER

Rochester, NY, US

Nov 2015

→

Jan 2018

On-site

Summary

Drove payroll data migration and ETL modernization for Paychex's enterprise platform, improving processing performance, compliance, and forecasting accuracy across 650,000+ client businesses.

Highlights

Built centralized payroll data migration pipelines using Talend ETL and Azure Data Factory, supporting full historical loads and incremental updates across multi-state tax jurisdictions for over 650,000 client businesses.

Migrated ETL orchestration from legacy cron jobs to Airflow DAGs with fault-tolerant retry logic and SLA sensors, reducing failed payroll runs by 78% during peak cycles.

Reduced end-to-end tax processing time from 18 hours to 4.5 hours using Apache Spark with ACID-compliant Delta Lake operations, maintaining complete audit trails and SOX and PCI-compliant security controls throughout.

Developed ML models using PySpark ML and Scikit-Learn to improve payroll cash flow forecasting, achieving 89% precision in predicting client funding requirements 30 days ahead, with results surfaced through Power BI dashboards.

RAMCO SYSTEMS

ETL DEVELOPER

Chennai, Tamil Nadu, India

May 2014

→

Sep 2015

On-Site

Summary

Developed ETL pipelines and data warehouse solutions for Ramco's aviation maintenance ERP systems, cutting nightly processing windows significantly and improving analytics across manufacturing and logistics operations.

Highlights

Built ETL pipelines using Oracle Data Integrator, Informatica PowerCenter, and Talend integrating data from Oracle, MySQL, MongoDB, and cloud storage across 12 interconnected aviation ERP systems, cutting nightly processing windows from 8 to 2.5 hours.

Automated ETL job scheduling using Oozie and Oracle Enterprise Manager, creating dependency-aware sequences for interdependent payroll and inventory processes across regions.

Built Oracle Business Intelligence reports sourcing curated data from data warehouses and cloud data marts using star schemas, with role-based access controls for executives, managers, and operational teams.

Education

Jawaharlal Nehru Technological University Hyderabad (JNTUH)

Hyderabad, Telangana, India

Aug 2010

→

Apr 2014

Bachelor of Technology – B.Tech

Computer Science