Site Reliability Engineer

Location: Raleigh, NC

Country: United States

Salary: $50.00 - 55.00 / hr

Start Date:

Description:

Contractor Onsite position

We are looking for experienced Site Reliability Engineers to ensure the reliability, scalability, and performance of mission-critical enterprise platforms. This role is ideal for individuals with deep technical expertise in cloud infrastructure, operating systems, automation, and modern observability practices. The right candidate brings a strong engineering mindset, thrives in high-pressure environments, and drives operational excellence through metrics, automation, and continuous improvement. This is a hands-on engineering role that partners closely with cross-functional teams to maintain and enhance complex, high-availability systems.

Responsibilities

  • Design, implement, and maintain reliable, scalable, and secure systems across cloud and on-premises environments.
  • Manage and optimize distributed systems running on Azure, Linux (RHEL7+), and Windows Server (2019+).
  • Build and enhance automation workflows using Python, Go, and Bash.
  • Develop Infrastructure-as-Code solutions with Terraform, Ansible, or similar tools.
  • Define, monitor, and improve SLIs, SLOs, and SLAs to maintain consistent service quality.
  • Reduce operational toil by implementing automation, improving tooling, and refining processes.
  • Integrate systems with modern observability platforms to improve visibility and proactively detect issues.
  • Troubleshoot complex incidents, lead structured incident response efforts, and deliver clear post-mortem analyses.
  • Work closely with software engineering, infrastructure, and business teams to support resilient and performant services.
  • Identify and execute improvements to system reliability, performance, and maintainability with full ownership of problem spaces.

Requirements

  • Proven experience as a Site Reliability Engineer, with a background in software engineering, infrastructure, or operations.
  • Hands-on experience with cloud platforms such as Azure, and enterprise operating systems like Linux RHEL7+ and Windows Server 2019+.
  • Solid understanding of networking and storage technologies including NFS, SAN, and NAS.
  • Working knowledge of DNS, LDAP, Kerberos, Centrify, and other authentication or naming services.
  • Strong scripting and automation skills using Python, Go, and Bash.
  • Practical experience with Infrastructure-as-Code tools such as Terraform and Ansible.
  • Demonstrated ability to design and monitor SLIs, SLOs, and SLAs, and to drive reliability improvements using metrics and automation.
  • Experience integrating with observability platforms providing logs, metrics, and tracing.
  • Ability to stay calm, structured, and focused during high-pressure operational events.
  • Strong communication skills and the ability to collaborate effectively with cross-functional teams.
  • A proactive, ownership-driven mindset with a commitment to continuous improvement.

Skills

Site Reliability Engineering, Cloud Platforms, Azure, Linux RHEL7+, Windows Server 2019+, Networking Fundamentals, NFS, SAN, NAS, DNS, LDAP, Kerberos, Centrify, Python, Go, Bash, Terraform, Ansible, Infrastructure as Code, Observability Platforms, SLIs, SLOs, SLAs, TOIL Reduction, Incident Response, Post-Mortems, Automation, Metrics-Driven Engineering, System Reliability, Cross-Functional Collaboration, Communication Skills, Ownership Mindset