Site Reliability Engineer - Santa Monica, CA
The next generations of our products are delivering engaging, adaptive, and personalized learning experiences to optimally support every learner. We are hiring a Site Reliability Engineer who will work with system and software engineers to build reliable, high capacity and high-performance systems in support of our mission to reimagine learning for millions of students and learners worldwide. This position is preferred to sit onsite in our Santa Monica, CA facility, but is open to other DPG hubs including Seattle, WA, Boston, MA, Columbus, OH, Irvine, CA, East Windsor, NJ and New York, NY.
Our Digital Platforms Group creates and builds data-driven digital products that enhance teaching and improve learning outcomes. In addition, the team conducts research and development targeted at new market opportunities in the quickly evolving education technology space.
Your contribution to the team includes:
- Hands-on design, analysis and troubleshooting of highly-distributed large-scale production systems
- Ownership of reliability, uptime, capacity- and performance-analysis thereof
- Ensuring the repeatability, traceability, and transparency of our infrastructure automation
- Identifying highest-impact opportunities to optimize existing systems
- System design consulting for teams seeking to leverage or improve their production infrastructure
- Anticipate, build and plan capacity for upcoming product/feature launches
- Strong skills in reading, understanding and writing code in the same
- Mastery of infrastructure automation technologies (like Terraform, CodeDeploy, Puppet, Ansible, Chef)
- Expertise in container/container-fleet-orchestration technologies (like Docker, Kubernetes, Vagrant, Mesosphere, etcd, zookeeper)
- Cloud and container native Linux administration/build/management skills (AWS AMIs, Packer, etc.)
- Significant experience troubleshooting concurrent and distributed system interactions
- Expertise with cloud- continuous-deployment- based software development lifecycles (e.g. CI/CD)
- Cloud database operations and deployment experience (RDS MySQL/Postgres/Aurora), Caching operations & deployment experience (memcache, Redis)
- Expertise with Lean/Agile deployment processes (Blue/Green, ZDT, Canary, load balancers/DNS strategies)
- Familiarity with site and infrastructure monitoring systems (like Datadog, New Relic, Sumologic)
- Strong problem solving, root cause analysis and systems engineering skills
- Excellent presentation and communication skills
- Ability to design and manage escalation response plans from monitoring, react, respond, remediate and retrospect in culturally aligned (proactive, customer focused, collaborative, data-driven) ways.
- Demonstrated expertise building and managing highly scaled production infrastructure in the cloud (AWS required; GCP, OpenStack a plus)
- Expertise with SDLC branching, SCM, and code deployment systems (git/gitflow, Jenkins, CircleCI, TravisCI, etc.)
- BS Degree in Computer Science (or related technical field and/or equivalent industry experience)
As a Site Reliability Engineer, you will help design, analyze and resolve issues with infrastructure in collaboration with product development teams; you will design, deploy and manage automation tools that increase predictability as well as time to market while reducing cost.
Why work for McGraw-Hill Education? You’ll have the opportunity to unlock your potential, both professional and personally. Click here http://bit.ly/2zcMQZn to learn more!
North America-United States-New Jersey-East Windsor, North America-United States-Ohio-Columbus, North America-United States-New York-New York City, North America-United States-California-Irvine, North America-United States-Washington-Seattle, North America-United States-Massachusetts-Boston