Headquarters: Beaverton, Oregon
URL: https://www.discogs.com/about/careers
The Discogs Platform team is focused on several objectives: building and supporting performant, cost-effective, reliable infrastructure; developer experience tooling and mentorship; and creating "golden paths" for organization-wide standards and velocity. As a key member of the Platform team, the Senior Site Reliability Engineer - Data will be working closely with other Discogs engineering squads to develop and optimize scalable, well-planned relational database architectures, drive best practices and stability for our use of Kafka and change data capture, and contribute to the Platform team’s operations.
Location
This is a remote position. Open to candidates located in OR, WA, CA, CO, TX, IL
Compensation
Starting Base Salary Range: $130,000 - $140,000 yearly
What You’ll Accomplish
Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions.
- Stewarding Discogs’ data stores as a key subject matter expert
- Leading efforts on the reliability and design patterns of our Kafka and Kafka Connect implementations
- Establishing data contracts and clear communication standards between CDC producers and consumers
- Working closely with engineering squads to refactor and re-architect MySQL database schema and indexing for long-term scalability, performance, and cost effectiveness
- Mentoring engineering squads on Platform best practices for MySQL, Kafka, and other software development lifecycle areas
- Writing documentation and runbooks that contribute to the engineering organization’s knowledge base
- Working in a containerized, orchestrated environment
- Contributing to the Platform team’s disciplines of site reliability and operations, supporting both our squads and Platform’s central infrastructure
- Participating in on-call rotation, responding to incidents, and troubleshooting data and other operations issues
What You’ll Contribute
Minimum Education and Experience
- A Bachelor's Degree in Computer Science or similar area of focus, or equivalent relevant work experience.
- 5+ years of experience working with Kafka and relational database management systems (RDBMS).
- 6+ years experience in Ops, DevOps, Site Reliability, Platform or other systems roles.
Required Skills & Abilities:
- Relational database schema design, query performance optimization, administration (MySQL, Percona Server, AWS RDS)
- Kafka: Cluster administration (Strimzi), Kafka Connect (Debezium, JDBC)
- CI/CD (GitHub Actions)
- GitOps (ArgoCD)
- Kubernetes (EKS, Kustomize, Karpenter, administration, application manifests)
- AWS and cloud development (VPC, EKS, RDS, S3)
- Observability (Datadog, Sentry)
- Scripting (Shell, Python)
- Track record of collaboration and mentorship
- Excellent written communication and documentation skills
- Continuous learning
- Ownership and proactive approach to solving large problems
Preferred:
- Infrastructure-as-code (Terraform)
- Elasticsearch (ECK administration, scaling, performance)
- Python (SQLAlchemy, FastAPI)
- GraphQL (schema design, Apollo federation)
- REST API
- Hashicorp Vault
- Redis
- Memcached
- NoSQL Database
- Data Lake/Warehouse
- Data Governance
- Data Security
The Platform team covers a wide range of technical topics and we'd love to hear about your skills beyond this list!
To apply: https://weworkremotely.com/remote-jobs/discogs-inc-senior-site-reliability-engineer-data-remote