
We’re seeking a seasoned Technical Operations Engineer to ensure the stability, reliability, and performance of our production systems. In this key role, you’ll leverage deep technical expertise, particularly in Web3/blockchain technologies, to manage, optimize, and enhance our platform infrastructure. You’ll drive operational excellence through proactive monitoring, meticulous incident management, innovative problem-solving, and collaborative cross-team initiatives.
What You’ll Do
-
Blockchain Network Management: Lead the deployment, optimization, and operational management of new blockchain networks. Conduct thorough testing, benchmarking, and continuous improvement of chain reliability and performance.
-
Complex Web3 Issue Resolution: Address high-impact Web3 incidents through rigorous troubleshooting, detailed log analysis, JSON-RPC response debugging, and direct coordination with blockchain foundations and ecosystem partners.
-
Proactive System Monitoring: Develop and maintain comprehensive monitoring and alerting solutions using advanced dashboards (e.g., Grafana, DataDog), identifying trends, anomalies, and performance bottlenecks before they become critical.
-
Incident & SLO Management: Define, implement, and enforce service-level objectives (SLOs) and agreements (SLAs), ensuring measurable standards of system reliability and performance are consistently met.
-
Automation & Optimization: Implement and maintain automation solutions (Ansible, Terraform, Kubernetes) to streamline deployments, reduce manual tasks, and optimize cloud infrastructure cost and efficiency.
-
Technical Collaboration: Actively collaborate with Tier-1 support, infrastructure, and development teams, ensuring alignment on system improvements, rapid issue resolution, and operational knowledge sharing.
-
On-Call Support: Participate in a rotating 24/7 on-call schedule to swiftly address critical system incidents, maintain continuous service delivery, and uphold customer trust.
What You’ll Bring
-
Minimum of 5 years in Technical Operations, Site Reliability Engineering (SRE), or related roles. Proven Linux/Unix system administration and advanced troubleshooting capabilities.
-
Deep experience managing complex Web3 infrastructures (RPC services, validator setups, node operations). Skilled in interpreting blockchain logs, JSON-RPC responses, and debugging intricate Web3 protocol issues.
-
Solid hands-on experience with configuration management and infrastructure automation tools (Helm, Terraform, Ansible, Consul), including containerization expertise (Docker, Kubernetes), managing and scaling services in cloud environments.
-
Competency in scripting/programming languages (Python, Go, JavaScript).
-
Advanced proficiency in monitoring and analytics platforms (Grafana, DataDog), enabling proactive and data-driven operational decision-making.
-
Demonstrated ability to identify performance patterns, forecast potential issues, and implement preventive solutions.
-
Strong track record defining, measuring, and maintaining SLAs/SLOs, and experienced with incident response tooling and processes (PagerDuty), ensuring quick resolution and systematic root-cause analyses.
-
Exceptional interpersonal and communication skills, with a proven ability to collaborate effectively across multiple teams and stakeholders.
-
Self-motivated, solution-oriented, and consistently striving for operational improvements, quality enhancements, and reduced technical debt.
-
Solid professional attributes, committed to transparency, accountability, and ethical behavior. Capable of managing complexity and staying adaptable under pressure, and able to demonstrate continuous learning and comfort evolving within a rapidly changing technical landscape.
-
Self-starter driven by curiosity and initiative, proactively identifying opportunities, addressing gaps, and implementing solutions autonomously.
-
Thrives in dynamic environments and committed to maintaining industry leadership through close collaboration with the most innovative and talented minds in Web3.
Level-Specific Expectations
P1 – Technical Operations Associate
-
Execute documented playbooks (node deployment, DNS updates, incident triage) with close guidance.
-
Monitor dashboards and PagerDuty; tackle known issues, escalate complex issues within the team.
-
Shadow incident response, and submit clear shift-handover notes.
P2 – Technical Operations Engineer
-
Maintain two to three production chains or subsystems independently during your shift.
-
Write or update small Ansible/Terraform modules and simple Bash/Python utilities.
-
Act as first incident commander for SEV 2/3 events; publish concise post-incident notes.
-
Tune alerts and dashboards to reduce false positives.
P3 – Technical Operations Engineer II
-
Lead new chain launches from design review through canary, cut-over, and post-mortem.
-
Command SEV 0/1 efforts and drive deep root-cause analysis.
-
Define, track, and report SLOs; create capacity and cost models.
-
Mentor P1/P2 engineers; perform peer reviews on IaC and observability changes.
-
Join customer or partner calls for complex escalations.
P4 – Senior Technical Operations Engineer
-
Architect region-wide failover, anycast, and multi-cloud safety controls.
-
Build benchmarking harnesses that compare kernels, instance types, and storage back-ends.
-
Lead fleet-scale initiatives (e.g., deployment stack updates, platform migrations) with minimal oversight.
-
Establish reliability standards adopted by all Core TechOps engineers.
-
Coach senior engineers and run design-review teams.