L

Lead Site Reliability Engineer/DevOps - Roberts Recruiting - Boston, Massachusetts, United States

Lead Site Reliability Engineer/DevOps - Roberts Recruiting - Boston, Massachusetts, United States
Full-time
Remote
Worldwide

We’re looking for a top-notch, hands-on SRE to lead our small and talented infrastructure engineering team and help us elevate our game when it comes to designing, building and operating high-performance and highly-available systems.

Every engineer is responsible for the software they build, and SREs play a critical part in providing the tools, practices, and expertise to support them.

Our production systems are hosted in AWS data centers running a large Ruby on Rails web application and a handful of smaller services in Ruby, Node.js, and Java. We currently deploy an average of 5 times a day. Our systems are stable and fire drills are rare.



Technologies we’re currently using include:


  • Amazon Web Services (EC2, ELB, S3, RDS, ElastiCache) and Ubuntu Linux



  • Postgres, Redis, Memcached, ElasticSearch



  • Chef, ServerSpec, Terraform, NewRelic, DataDog, Sumo Logic and Test Kitchen


 

In this mission-critical role, what you'll do:



  • Design, build and maintain the core infrastructure

  • Actively manage the backlog for our infrastructure team and work closely with other SREs on the team to provide coaching and mentorship



  • Help us increase developer productivity and get to true continuous delivery



  • Develop operational and security standards and champion operational excellence and secure coding practices



  • Partner with engineering teams closely to educate and consult



  • Participate in solution design for new features, products, systems, and tooling



  • Debug complex problems across the whole stack



  • Continually monitor application/system performance and costs, generate actionable insights and either implement or advocate for them



  • Participate in on-call rotations, along with every member of the engineering team



  • Ruthlessly eliminate repetitive manual tasks and recurring errors



  • Ensure we are always employing best-of-breed tooling for all our infrastructure and automation needs



  • Collaboratively plot course for the maturing and growth of our infrastructure



  • Participate (and sometimes run point) in handling production incidents



  • Work closely with engineering teams to conduct root cause analysis for production incidents, and evolve infrastructure and tooling.


This role might be that rare opportunity if you have:



  • Thrive in a highly collaborative, no red-tape, rapid-growth environment



  • Love building tooling and infrastructure to help developers be more productive



  • Love eliminating repetitive manual tasks through automation



  • Have a healthy appreciation of what it means to work in production



  • Have solid Unix command line and systems chops



  • Have experience with substantial, distributed SaaS or eCommerce systems



  • Can point to a solid track record of success leading small-to-medium infrastructure teams



  • Have vision and well-informed opinions about how to build infrastructure for a high-growth, technology-driven company that’s headed towards the $1B mark


 

What you’ll get from us: 


Importantly, you’ll get sane working hours and a huge amount of flexibility around work/life balance. Have people in your life – of any age – who always, often, or sometimes need your help? We make room for that. Have a bad thing or a good thing happen to you? We make room for that, too.

Oh, and here’s what else you’ll get: Market salary, stock options you’ll help make worth a lot, the usual holidays, all-you-can-eat vacation, 401K, health/dental/FSA, long-term disability insurance, subsidized T-passes, a great office smack-dab in Boston’s Downtown Crossing, a tremendous amount of responsibility and autonomy, wicked awesome co-workers, cupcakes (and many more goodies), and knowing that you helped get this rocket ship to the moon.