Senior Data Scientist

PythonSF · Senior · Seed

About Probably Genetic

Probably Genetic is changing the lives of patients living with severe, complex diseases. Our data platform is used by drug developers and patient advocacy groups to develop and launch treatments for these patients. Our technology discovers undiagnosed patients online, analyzes their disease state using machine learning and at-home testing, and enables compliant communication with patients. In doing so, we help patients access diagnoses, clinical trials, and treatments as early as possible.

We are a tight-knit group of hard-working, ambitious problem solvers united by a mission greater than ourselves. We do well by doing right by patients. We are developing some of the most cutting-edge solutions in healthcare, and our roadmap is packed with innovations in bioinformatics, AI, and drug development. We have built a lean, all-star team to help us bring our vision to life, and we want you to be a part of it.

Probably Genetic has raised multiple rounds of funding from Silicon Valley’s best investors, including Threshold, Khosla, and Y Combinator, and offer competitive salaries, comprehensive benefits, and meaningful early stage equity.

About the role

We are looking for a Senior Data Scientist who will own some of the most consequential diagnostic AI in rare disease: building, validating, and operationalizing the models that help us find and diagnose patients who have never had a name for their disease, powering the analytical rigor behind our testing programs, and shaping how we use data to make smarter product decisions.

What you will do

Own the end-to-end development, validation, and operationalization of PG's predictive diagnostic AI models — from feature engineering through production deployment – that power program eligibility decisions and clinical decisions for patients
Run prospective testing experiments: apply diagnostic models to undiagnosed patients, coordinate testing, and track outcomes to continuously improve model performance
Build and maintain PG's synthetic patient data pipeline, a critical deliverable for our research programs, and key input to our own model development lifecycle
Optimize our patient intake experience using NLP and multimodal data analysis to determine which questions to ask, in what order, to maximize data quality and conversion
Own API usage and cost optimization across PG's AI stack, including prompt engineering, model evaluation, and ongoing performance monitoring
Conduct ad hoc strategic analyses that inform product prioritization, causality assessment, and generate customer-facing program insights
Establish MLOps infrastructure: model monitoring, drift detection, API observability, and lightweight but durable operational processes
Have the freedom to conduct blue sky research initiatives aimed at creating value from our data
Work with Data Engineering to build a robust, scalable data foundation that supports all of the above

Who you are

We are looking for a few specific things that will help you succeed in this role:

7+ years of experience in data science, machine learning engineering, or a closely related field
Strong Python proficiency and fluency across the core data science stack: pandas, NumPy, scikit-learn, PySpark, and SQL
Demonstrated end-to-end ML experience: you have taken models from problem definition through feature engineering, validation, deployment, and monitoring in a production environment
Experience with NLP techniques and applying language models to real-world problems
Comfort with prompt engineering and evaluating external AI API performance (e.g., OpenAI)
A track record of operating with high ownership in lean, fast-moving environments where you have had to build structure as much as execute within it
Strong analytical communication skills — you can translate complex model outputs and data findings into clear, actionable narratives for technical and non-technical audiences alike

Some things that are not required, but you will learn on the job:

Experience with Databricks or similar lakehouse/ML platform environments
Familiarity with synthetic data generation techniques
Domain knowledge in healthcare, rare disease, genomics, or clinical research
Experience with MLOps tooling and building observability infrastructure from scratch
Exposure to biopharma or insurance analytics use cases

As with all new hires at Probably Genetic, you will also need to be:

A good person. We work with some of the most marginalized populations on the planet and empathy is key
Patient-focused and motivated to have a lasting, positive impact on humanity
Comfortable in a fast-paced, often ambiguous environment with rapid change
Action-oriented and excited to build a company from the ground up

The salary range for this role is $180,000-$230,000 annually. Actual compensation offered will depend on several factors including but not limited to: work experience, education, skill level, and/or other business and organizational needs.

What we offer at Probably Genetic:

An engaging and supportive team all on a mission to improve lives
Fair and equitable compensation with competitive early-stage equity grants
Generous Flexible Time off policy, that we actually use
Parental Leave Benefits (12 weeks for both birthing and non-birthing)
Hybrid, flexible work with high-trust and autonomy
A bright, inviting, pet-friendly office in Downtown SF near transit
A “work from anywhere” policy, up to 4 weeks a year
Regular team retreats in exciting destinations
Health Benefits including medical, dental, vision, therapy, FSA, and 401k
And so much more!

Probably Genetic is committed to fostering a welcoming and inclusive work environment for people of all genders, sexuality, ethnicity, socioeconomic background and life experiences. We urge candidates of all backgrounds to apply. If you require specific accommodations as you interview or consider working with us, please let us know.

Check your CV against this role

Drop your CV. You get a 0-100 fit score against the actual job description, plus the read a senior engineering lead would write. Private to you.

Score this once, or every future role

Start the candidate journey and every new role on the board gets scored against you.

Five minutes. Tell us what you’re after, drop your CV once, pick how we should reach out. You get a candid read back and you only hear from us when a role actually fits.

Start the journey How it works

More at Probably Genetic

Founding Data EngineerSF · Mid→Senior Software EngineerSF · Senior→