Description

CUJO AI is the leading provider of artificial intelligence solutions for network service providers. We use machine learning and real-world data to develop and deliver cutting-edge cybersecurity, device intelligence, and parental controls that enable network operators to offer better and safer connected experiences to millions of households.

We are seeking a highly skilled Ops engineer with software orientation expertise and proven production experience managing large-scale systems. The ideal candidate will have a background in backend development, particularly in Python, and the ability to troubleshoot and resolve application bugs efficiently.

This role requires a proactive individual who can work collaboratively with cross-functional teams to ensure the reliability, performance, and scalability of our production environments. Additionally, the candidate should be an expert in AWS cloud, monitoring methods, and alerting systems, with knowledge of Tier 2 technical support.

 

Your responsibilities will be

Troubleshoot and resolve complex infrastructure and application issues. Participate in the creation and implementation of operational policies and procedures. Manage and maintain monitoring, logging, and alerting systems. Implement automation and tooling to increase efficiency and reduce manual processes.

  • Conduct root cause analysis (RCA) for incidents and implement corrective and preventive measures.
  • Identify system weaknesses, fix bugs, and improve system latency, leading to cost reduction.
  • Troubleshoot and resolve application bugs, working closely with development teams.
  • Ensure code quality and performance through code reviews, testing, and optimization.
  • Implement and manage comprehensive monitoring and alerting solutions to ensure system health and performance.
  • Use tools such as Prometheus, Grafana, ELK stack, and CloudWatch to track metrics and respond to incidents.
  • Develop automated responses to common alerts and incidents to minimize downtime.
  • Work closely with development, QA, and operations teams to ensure seamless integration and delivery of software releases.
  • Document processes, procedures, and technical specifications to maintain a knowledge base.
  • Contribute to the development and enforcement of operational policies and procedures.
  • Continuously seek opportunities to optimize the production environment, focusing on both infrastructure and application code.
  • Ensure the serviceability, monitoring, and maintainability of the production environment, which is central to the Ops Engineer role

Preferred competencies

  • Excellent problem-solving skills and attention to detail.
  • Strong communication and collaboration skills.
  • Knowledge of Tier 2 technical support processes and best practices.
  • Proficiency in backend development using Python.
  • Expert knowledge of AWS cloud services and infrastructure.
  • Experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack, CloudWatch).
  • Proven experience as a Site Reliability Engineer (SRE) or in a similar role managing large-scale production systems.
  • Possess extensive experience with Kafka (MSK).

 

Benefits and Perks

  • You will have ability to work flexible hours and choose your preferred location – home office or CUJO AI office (in Kaunas or Vilnius)
  • Modern development equipment
  • Opportunity to learn from highly skilled colleagues
  • Ambitious projects and meaningful cause
  • Team Building and company events
  • Conferences, training, books – anything for your development
  • 100 hours/year for training during paid business hours
  • Multiple Bonus systems, as Performance, AWS Certifications, Inventions and other
  • Benefits package that includes Lunch in the office and Wolt coupons every month, Recreational, Health insurance benefits and more!
All positions