Hybrid – Cambridge, MA
Through End of Year
Top 3–5 Skills Needed:
- Strong programming and data engineering skills (Python, SQL, R)
- Experience with large-scale omics data management and integration
- Knowledge of metadata standards and ontologies for biological data
- Experience designing or maintaining bioinformatics data pipelines or repositories
- Understanding of data governance, permissions, and FAIR data principles
Job Description:
- We are seeking a highly motivated Data Scientist to design and implement an internal GEO-like system for managing the Immune Discovery omics data assets.
- The successful candidate will build a centralized platform that integrates raw, processed, and metadata layers of multi-omics datasets (e.g., bulk and single-cell RNA-seq, spatial omics, CyTOF) and ensures that they are findable, accessible, well-documented, and permission-controlled.
- This role bridges bioinformatics, data engineering, and data governance, enabling researchers to efficiently submit, query, and reuse internal datasets while maintaining data quality and compliance.
Key Responsibilities:
- Design and implement scalable pipelines for ingestion, curation, and storage of raw and processed omics data.
- Build and maintain a searchable data catalog or portal to enable dataset discovery and visualization of metadata and QC metrics.
- Implement access controls and permission management systems to ensure appropriate data security and compliance.
- Work closely with Immunology Discovery, and IR teams to integrate the system with existing compute and storage infrastructure.
- Develop and enforce metadata standards, ontologies, and schema to ensure consistency and interoperability across studies.
Impact:
- By developing this internal data platform, the candidate will transform how omics data are organized and shared across client.
- The system will improve data visibility and reuse, enhance reproducibility, and accelerate scientific insights by enabling streamlined access to all relevant data layers, raw, processed, and annotated.
Qualifications:
- BS (5+ years) or MS (0–3 years) in Bioinformatics, Computational Biology, Data Science, Computer Science, or related field.
- Proficiency in Python and SQL, with experience in data wrangling, ETL pipelines, and automation.
- Hands-on experience managing large omics datasets.
- Strong understanding of metadata models, data provenance, and FAIR data principles.
- Excellent communication skills and ability to collaborate with cross-functional teams.
Preferred Technical Skills:
- Experience with cloud storage or compute environments (AWS, GCP, or on-prem HPC).
- Experience with workflow orchestration tools (Nextflow, Snakemake).
- Familiarity with relational and NoSQL databases (PostgreSQL).
- Familiarity with public repositories such as GEO, or SRA and their metadata standards.
- Proficiency with Git for version control and collaboration.
Additional Technical Skills (a plus):
- Experience with containerization (Docker/Singularity) and CI/CD workflows.
- Understanding of web application frameworks or dashboarding tools for data portals.
- Exposure to single-cell or multi-omics integration workflows.
- Experience implementing data access and permission systems integrated with organizational identity management.
Job Type: Contract
Pay: $88.00 - $90.00 per hour
Expected hours: 40 per week
Benefits:
Education:
Experience:
- Nextflow: 5 years (Preferred)
- Snakemake: 5 years (Preferred)
- NoSQL databases (PostgreSQL).: 5 years (Preferred)
- data wrangling, ETL pipelines, and automation.: 4 years (Preferred)
- managing large omics datasets.: 5 years (Preferred)
- Python: 5 years (Preferred)
- SQL: 5 years (Preferred)
- R Programming: 5 years (Preferred)
Ability to Commute:
- Cambridge, MA 02139 (Preferred)
Work Location: In person