Research Tools Platform | TeraSystemsAI Careers

Platform Overview

Accelerating Scientific Discovery

Our Research Tools Platform provides comprehensive infrastructure for modern AI research, from experiment design to publication. Through integrated experiment tracking, hyperparameter optimization, distributed training orchestration, and collaborative notebooks, we eliminate infrastructure friction enabling researchers to focus on innovation and discovery.

Built for academic institutions, corporate research labs, and AI startups, our platform ensures full reproducibility with versioned datasets, model checkpoints, code snapshots, and environment configurations. Every experiment is traceable, auditable, and shareable, supporting peer review, collaboration, and scientific transparency. Seamless integration with HPC clusters, cloud GPUs, and on-premise infrastructure ensures flexibility across diverse research environments.

Experiment Tracking

Automatic logging of hyperparameters, metrics, and artifacts
Interactive visualization of training curves and model performance
Experiment comparison, filtering, and search across projects
Model registry with versioning and deployment tracking
Provenance tracking for full reproducibility and auditability

Distributed Training Orchestration

Multi-GPU and multi-node training with automatic scaling
Integration with SLURM, Kubernetes, Ray for job scheduling
Spot instance management and fault-tolerant checkpointing
Hyperparameter sweeps with Bayesian optimization and PBT
Resource utilization monitoring and cost optimization

Collaborative Research

Shared project workspaces with role-based access control
Collaborative Jupyter notebooks with real-time editing
Version control integration (Git, DVC) for code and data
Comment threads and annotations on experiments and results
Publication-ready plots, tables, and LaTeX export

Dataset Management

Versioned datasets with deduplication and compression
Efficient data loading with caching and prefetching
Data provenance tracking from raw to preprocessed
Support for petabyte-scale datasets across S3, HDFS, NFS
Privacy-preserving data sharing and access controls

Reproducibility and Provenance

One-click experiment reproduction with exact environments
Docker and Conda environment capturing and versioning
Git commit linking and dependency freezing
Artifact storage for models, checkpoints, predictions
Audit trails for regulatory compliance and peer review

Integration Ecosystem

PyTorch, TensorFlow, JAX, scikit-learn integration
HuggingFace Transformers and Diffusers compatibility
Weights & Biases, MLflow, TensorBoard interoperability
GitHub Actions, GitLab CI/CD pipeline integration
Slack, email, and webhook notifications for job status

Join the Research Tools Team

Build infrastructure empowering researchers to make breakthrough discoveries in AI and scientific computing

Hiring planned → Coming soon

Principal Engineer, Research Infrastructure

Location: Philadelphia, PA or Remote

Compensation: $205,000 - $290,000 + equity

Type: Full-time

Architect and build scalable research infrastructure supporting distributed training, experiment tracking, and collaborative research at academic and industry scale. Design systems enabling 10K+ researchers to run millions of experiments across cloud and on-premise GPU clusters.

Core Responsibilities

Design and implement distributed training orchestration systems supporting multi-GPU, multi-node workloads across heterogeneous clusters
Build experiment tracking infrastructure handling millions of runs with sub-second query latency for metrics, hyperparameters, and artifacts
Architect storage systems for petabyte-scale datasets, model checkpoints, and training artifacts with deduplication and efficient retrieval
Develop job scheduling and resource allocation algorithms optimizing GPU utilization, cost efficiency, and researcher productivity
Implement fault-tolerant checkpointing, spot instance management, and automatic failure recovery for long-running training jobs
Build monitoring and observability systems tracking cluster health, job progress, resource utilization, and infrastructure costs
Collaborate with research teams to understand workflows and design APIs, SDKs, and CLI tools for seamless integration

Required Qualifications

BS/MS in Computer Science, Engineering, or related field with 8+ years of infrastructure engineering for ML/HPC workloads
Deep expertise in distributed systems, job scheduling, and resource orchestration using Kubernetes, SLURM, or Ray
Strong systems programming skills in Go, Rust, or C++ with experience building high-performance, scalable infrastructure
Proven track record designing and operating ML platforms supporting hundreds of researchers and thousands of experiments
Experience with cloud infrastructure (AWS, GCP, Azure) and GPU-accelerated computing (NVIDIA A100, H100, AMD MI300)
Understanding of ML training workflows, distributed training frameworks (PyTorch DDP, DeepSpeed, Horovod), and model serving
Familiarity with storage systems (S3, HDFS, Lustre, Ceph) and database technologies (PostgreSQL, TimescaleDB, ClickHouse)

Preferred Qualifications

PhD in Computer Science or experience as staff engineer at leading ML infrastructure companies (Google, Meta, Microsoft)
Contributions to open-source ML infrastructure projects (Kubeflow, Ray, MLflow, Weights & Biases, Determined AI)
Experience building HPC clusters or working at national labs with supercomputing infrastructure
Background in hyperparameter optimization algorithms (Bayesian optimization, PBT, ASHA)
Understanding of GPU networking (InfiniBand, NVLink, RoCE) and high-performance interconnects
Track record optimizing infrastructure costs and improving GPU utilization at scale

Why This Role Matters

Your infrastructure will accelerate breakthrough discoveries in AI, healthcare, climate science, and fundamental research. Build systems empowering the next generation of researchers, eliminating infrastructure barriers to scientific innovation and discovery.

Express Interest

Senior Research Software Engineer

Location: Philadelphia, PA or Remote

Compensation: $165,000 - $230,000 + equity

Type: Full-time

Build research productivity tools enabling reproducible, collaborative, and efficient ML research. Design APIs, SDKs, and developer experiences that researchers love, integrating seamlessly with PyTorch, TensorFlow, and modern ML workflows.

Core Responsibilities

Design and implement Python SDKs and APIs for experiment tracking, hyperparameter logging, and artifact management
Build integrations with popular ML frameworks (PyTorch, TensorFlow, JAX, HuggingFace) ensuring zero-friction adoption
Develop collaborative Jupyter notebook environments with real-time editing, versioning, and sharing capabilities
Implement dataset versioning and data loading libraries optimizing I/O performance for large-scale training
Create visualization dashboards for training metrics, model comparisons, and hyperparameter analysis using React and D3.js
Build CLI tools and automation scripts for common research workflows (sweeps, reproducibility, model deployment)
Collaborate with research users to gather feedback, prioritize features, and improve developer experience

Required Qualifications

BS/MS in Computer Science or related field with 6+ years of software engineering experience building developer tools or ML infrastructure
Strong Python programming skills with experience designing APIs, SDKs, and libraries used by external developers
Deep understanding of ML research workflows, experiment tracking, and reproducibility challenges
Experience with PyTorch, TensorFlow, or JAX and familiarity with distributed training and model optimization
Proficiency in full-stack development with React, TypeScript, and modern web technologies
Understanding of data engineering, database systems, and efficient data loading for ML training
Excellent communication skills with ability to engage researchers, gather requirements, and provide technical support

Preferred Qualifications

PhD in Computer Science or ML research background with first-hand experience in academic or industry research
Contributions to open-source ML libraries (PyTorch, TensorFlow, HuggingFace, scikit-learn)
Experience building experiment tracking tools (MLflow, Weights & Biases, TensorBoard, Neptune)
Background in data visualization, interactive dashboards, and user interface design
Familiarity with containerization (Docker), environment management (Conda), and reproducibility tools (DVC)
Track record engaging with research communities through documentation, tutorials, and conference talks

Why This Role Matters

Your tools will be used daily by thousands of researchers advancing the frontiers of AI, science, and technology. Build products that eliminate friction, improve reproducibility, and accelerate the pace of discovery in machine learning research.

Express Interest

Director of Product, Research Platform

Location: Philadelphia, PA

Compensation: $185,000 - $255,000 + equity

Type: Full-time

Own product strategy for Research Tools Platform, driving adoption among academic institutions, corporate research labs, and AI startups. Translate researcher needs into product features enabling reproducible, collaborative, and efficient ML research at scale.

Core Responsibilities

Define and execute product strategy for Research Tools Platform, balancing academic research needs with enterprise requirements
Own P&L for Research Platform including revenue targets, pricing strategy, customer acquisition, and retention metrics
Lead customer discovery with academic labs (MIT, Stanford, Berkeley), corporate research (Google AI, Meta FAIR), and AI startups
Collaborate with engineering teams to prioritize features including experiment tracking, distributed training, collaboration tools, and integrations
Build product roadmap addressing reproducibility, scalability, cost optimization, and researcher productivity challenges
Drive go-to-market execution partnering with academic outreach, developer relations, and enterprise sales teams
Establish product metrics measuring adoption, engagement, experiment volume, research productivity, and customer satisfaction
Represent platform at ML conferences (NeurIPS, ICML, MLSys), publish blog posts, and build relationships with research community

Required Qualifications

8+ years of product management experience with 5+ years in developer tools, ML infrastructure, or research platforms
Proven track record owning P&L and driving adoption for technical products serving researchers and data scientists
Deep understanding of ML research workflows, experiment tracking, distributed training, and reproducibility challenges
Technical fluency to engage with researchers and engineers, understanding ML frameworks and infrastructure requirements
Experience with both academic (universities, national labs) and industry research (tech companies, AI labs) customers
Strong quantitative skills with expertise in product analytics, cohort analysis, and researcher productivity metrics
Executive presence to engage principal investigators, research directors, and present at academic conferences

Preferred Qualifications

PhD in Computer Science, ML, or quantitative field with first-hand research experience
Background in ML infrastructure companies (Weights & Biases, Determined AI, Comet, Neptune) or cloud ML platforms
Experience with academic partnerships, grant programs, and research community engagement
Technical degree with hands-on ML research or publications at top-tier conferences
Track record building 0-to-1 developer tools or research platforms with strong community adoption
Network in ML research community through conferences, workshops, or academic collaborations

Why This Role Matters

Shape how researchers conduct ML experiments, ensuring reproducibility, collaboration, and accelerated discovery. Build products supporting breakthrough research in AI, medicine, climate science, and fundamental physics at leading institutions worldwide.

Express Interest