Asset & Wealth Management-Cloud SRE Engineer-Associate-Dallas
Goldman Sachs
Accounting & Finance
Dallas, WV, USA
Cloud SRE Engineer - Associate
Who We Look For:
Goldman Sachs Engineers are innovators and problem-solvers who thrive in fast-paced global environments. We are seeking a motivated Cloud Site Reliability Engineer (SRE) to support the WM Data Engineering ecosystem. In this role, you will apply software engineering principles to operational challenges, ensuring that our cloud-native services - primarily running on AWS are resilient, scalable, and cost-optimized. As we transition from on-premises legacy systems to AWS, you will be the guardian of system health, moving beyond traditional dashboards to implement predictive remediation and SLOs-as-Code.
Key Responsibilities:
- Reliability & Performance Engineering:
- SLO Management: Define and enforce Service Level Objectives (SLOs) and Service Level Indicators (SLIs) using OpenSLO or similar declarative frameworks. Manage "Error Budgets" to balance the pace of innovation with system stability.
- Predictive Observability: Implement AI-driven observability stacks (e.g., Datadog, Amazon CloudWatch Container Insights, or OpenTelemetry) to detect "p99" latency spikes and subtle configuration drifts before they impact users.
- Incident Response: Lead high-severity incident restoration and conduct blameless post-mortems to identify root causes and automate future prevention.
- Cloud Migration & Orchestration:
- Microservices Migration: Support the migration of on-premises microservices to Amazon ECS (Fargate/EC2). Design and maintain task definitions, service discovery via AWS Cloud Map, and inter-service communication using Amazon ECS Service Connect.
- Infrastructure as Code (IaC): Develop and maintain modular, version-controlled infrastructure using Terraform or AWS CDK, ensuring that reliability guardrails are baked into every deployment.
- Automation of Toil: Identify and eliminate repetitive manual tasks ("toil") by developing custom automation tools in Python or Go.
- Modernization:
- Migration Support: Contribute to the migration of on-premises data workloads to AWS.
Qualifications:
Technical Requirements
- Experience: 4+ years in SRE, DevOps, or Cloud Engineering roles, with a strong focus on production operations for distributed systems.
- Container Orchestration: Deep proficiency in Amazon ECS (Fargate and EC2 launch types). Experience with Docker containerization and managing service-to-service connectivity.
- Programming: Strong proficiency in Python or Java for automation and tool development. Expert-level SQL for data-driven reliability analysis.
- Cloud Platforms: Advanced knowledge of AWS core services (VPC, IAM, S3, Lambda) and networking (Transit Gateway, PrivateLink).
- Observability Tools: Hands-on experience with modern monitoring and tracing tools such as Prometheus, Grafana, AWS X-Ray, or Splunk.
- CI/CD for Containers: Proven ability to build automated deployment pipelines for ECS using AWS CodePipeline, GitHub Actions, or Terraform, incorporating blue/green or canary deployment strategies.
- Soft Skills: Strong problem-solving "builder" mindset and the ability to communicate technical concepts within a team environment.
Education
- Bachelor’s or Master’s degree in computer science, Engineering, Mathematics, or a related field.