Job Description
Role Overview:
We are seeking a Data Engineer who can design, build, and optimize scalable ETL pipelines and real-time data processing systems, handling high-volume datasets in distributed environments.
Key Highlights:
• Expertise in PySpark, Apache Spark (Core, SQL, Structured Streaming)
• Strong experience with Kafka (Confluent/Apache) for real-time data ingestion
• Hands-on with Informatica or similar ETL tools
• Experience with Cloudera Hadoop ecosystem and distributed databases (SingleStore preferred)
• Strong programming skills in Python / Scala / SQL
• Exposure to large-scale data processing (~40TB data, ~5TB daily ingestion)
• Experience in batch and real-time architectures
• Knowledge of Medallion Architecture (Bronze, Silver, Gold layers) is a plus
• Familiarity with workflow orchestration tools like Airflow/Oozie
• Experience in BFSI or Telecom domains preferred
Key Responsibilities:
• Build and optimize scalable ETL pipelines (batch + real-time)
• Develop streaming frameworks for low-latency analytics
• Ensure data quality, governance, and performance optimization
• Collaborate with cross-functional teams for data-driven insights
• Improve system scalability, reliability, and cost efficiency