We are looking for Senior Lead Data Engineer, you will architect the highly scalable, cloud-native data platforms that power our Real-World Data (RWD) and DRG (Decision Resources Group) analytics solutions—critical tools that help researchers, clinicians, scientists, and business leaders make faster, more confident decisions. You’ll help build the data engine behind products used to accelerate drug discovery, evaluate treatment effectiveness, model patient journeys, and bring life-saving innovations to market.This is an opportunity to build data systems that not only drive next-generation AI but also create measurable impact in healthcare and life sciences globally.If you’re passionate about data engineering and excited to work on platforms that enable next-generation AI, this role is for you.About You – Experience, Education, Skills, and AccomplishmentsBachelor’s degree in computer science, Engineering, or related field.Minimum 8 years of experience building scalable, production-grade data systems.Proven ability to design massively scalable distributed data processing pipelines.Strong background in database design, schema modelling, and performance tuning.Hands-on expertise building and optimizing complex ETL/ELT pipelines that power ML and analytics workloads.Ability to research and work independently, & working with remote team in different time-zonesExperience working on interactive speed query engines like StarRocks, ClickHouse, Druid etcExperience designing resilient, fault-tolerant, cloud-native data platforms with automated disaster recovery.Hands-on background in Agile delivery, CI/CD, and containerized workflows.Strong understanding of data versioning, lineage, reproducibility, and metadata management — critical for AI governance.Technical SkillsBig Data, PySpark, Databricks, SnowflakeInteractive query engines like StarRocks/ClickHouse/DruidExposure to open-source technologies like DuckDB, PolarsOptimize Transformations: Refine complex logic, often the most resource-intensive part, using efficient code and techniques.AWS Glue, AWS EMR, Delta Lake, IcebergParquet, RDBMS (PostgreSQL)Experience designing data flows that serve AI, GenAI, and algorithmic workloadsLanguagesProficient in Python, SQL, and PySparkBonus: experience building data prep scripts for ML model trainingCloud Technologies & ToolsStrong experience with AWS: EMR, Glue, S3, EC2, RDS, Aurora PostgreSQL, LambdaAbility to evaluate and integrate AI-friendly tools (feature stores, vector databases, ML workflow orchestration, etc.)It Would Be Great If You Also HaveExposure to GenAI technologies, LLM data pipelines, or vector embeddingsExperience supporting data needs for ML, LLMs, or analytics teamsExperience collaborating with distributed, high-velocity global teamsExperience building end-to-end RAG pipelines, advanced RAG like Fusion RAG and applying Query transformation to improve the Retrieval process.Experience in Python frameworks like LangChain, LlamaIndex used to build GenAI applicationExposure to Vector databases like Chroma, Pinecone, Milvus, Weaviate, LanceDBWhat You Will Be Doing in This RoleAI-Ready Data Architecture & Technical LeadershipArchitect and deliver a future-proof data lake platform optimized for analytics, ML, and GenAI workloads.Design intelligent, automated, highly scalable data pipelines that support model training, inference, and continuous learning.Provide thought leadership on emerging AI-driven data patterns such as feature stores, vectorized pipelines, and streaming ingestion.Evaluate modern technologies (Delta Lake, Iceberg, Databricks ML, AWS AI services) to ensure the platform stays ahead of the curve.Own the end-to-end data lake solution design, ensuring scalability, reliability, and AI-readiness.Collaborate well with colleagues & business stakeholders to define and execute on technical strategy.Be an active stakeholder throughout the software development life cycle, overseeing the software design & ensuring the project maintains its technical direction, while adjusting the technical design to mitigate unexpected blockers during the project.Data Engineering & Platform DeliveryBuild high-performance, cloud-native ETL & ELT pipelines using AWS Glue, EMR, and Databricks.Ensure data quality, lineage, auditability, and governance to support trustworthy AI and analytics.Embed standards for data observability, automated quality checks, and ML-ready feature transformations.Help implement robust SLAs for AI data services, ensuring fast, deterministic, and reliable data flows.Act as a key contributor in architectural decisions, data modelling, workflow optimization, and platform enhancements.Innovation, GenAI Integration & Customer ImpactDrive R&D explorations across new AI/GenAI enablers such as automated data labelling, embeddings, or intelligent data preparation.Partner with Product and Technology leaders to translate business problems into AI-ready data solutions.Lead initiatives to make the data platform more “AI-native,” enabling advanced analytics, LLM-driven insights, and real-time intelligence.Continuously explore how emerging AI tools can reduce operational overhead and automate previously manual processes.Create technical documentation and knowledge assets to scale AI-ready engineering practices across the organization.About the TeamYou will join the RWD DRG Fusion team, a global engineering organization focused on powering the next generation of healthcare and life sciences insights. The team thrives on innovation, collaboration, diversity, and a strong sense of mission. You’ll work with product owners, scientists, data scientists, ML engineers, and architects shaping the future of our AI-driven products.Hours of WorkFull-time (IST)40 hours per weekHybrid working environmentAt Clarivate, we are committed to providing equal employment opportunities for all qualified persons with respect to hiring, compensation, promotion, training, and other terms, conditions, and privileges of employment. We comply with applicable laws and regulations governing non-discrimination in all locations.
View Original Job Posting