HPC System Administrator

Full-time @Vector Institute in Information Technology
  • Post Date : January 10, 2023
  • Apply Before : January 20, 2023
  • 0 Click(s)
  • View(s) 24
Email Job
  • Share:

Job Detail

  • Job ID 20034

Job Description

POSITION SUMMARY

The Vector Institute is seeking an HPC System Administrator to join our growing team in Toronto as we continue the work of making Canada a centre of expertise for AI in the world.

 The incumbent in this role will participate in the building and maintenance of a High-Performance Computing environment for world class research in Machine Learning.

Being a member of the Scientific Computing team, the role will share responsibility in managing server, network, storage and security for the High Performance Computing infrastructure as well as provide support for the office local area network, servers and scientific computing workstations. The role will also perform installation and maintenance of server and AI & machine learning layered software to support our 1000+ researchers and affiliates.

We are seeking a highly motivated System Administrator with a hands-on, problem-solving approach to managing and troubleshooting high-tech environments. The role will be a combination of remote, on-site at the office and at our co-location facility as required.

Here’s What You’ll Get To Do:

  • Support the Vector HPC systems formed by more than 180+ node/6,000+ core/1,000+ GPU / 100GE and growing HPC compute cluster
  • Support our 100+ GPU-enabled workstation office environment
  • Provide guidance and support to our research community
  • Develop and maintain solutions for automatic installation and configuration of infrastructure;
  • Perform hardware and software system upgrades and maintenance;
  • Install new scientific software, libraries, on servers, workstations, or laptops, in a variety of operating systems (Linux, Mac OS, Windows);
  • Support researchers in all their computing needs;
  • Maintain network infrastructure and assist users;
  • Maintain system security: firewall, IPS, system logs; and,
  • General enterprise IT operations.

KEY SUCCESS MEASURES

  • Ensures the smooth functioning of the research systems, by undertaking troubleshooting, maintenance and installation tasks.
  • Researchers and the enterprise operations feel supported in all other computing needs.
  • Builds and maintains tools that facilitate the automated or direct administration of network and computing infrastructure, both locally and on the cloud.

Here’s What You’ll Need:

Degree or diploma in computer science or engineering or equivalency through more than three years systems administration in a UNIX/Linux environment or complex computing environment;

  • More than three years of proven, hands-on experience: Linux/UNIX systems administration preferably in a research environment; (Ubuntu, RedHat, CentOS)
  • Hands-on experience in managing an HPC grid, Slurm or equivalent scheduler
  • Proven programming/scripting skills as it pertains to systems administration
  • Managing and troubleshooting environments using mostly open-source software
  • Demonstrated ability to learn quickly
  • Demonstrated ability to prioritize tasks and resolve problems in a timely manner
  • Ability to work autonomously, multi-task and work in a fast-paced and stressful environment
  • Be proactive, addressing potential problems before they occur
  • Strong attention to detail
  • Problem-solving outlook
  • Excellent verbal and written communication skills

Qualifications and Experiences below are considered an asset:

  • Hands-on experience in managing HPC workload management systems such as, Slurm, SGE, Moab/Torque or equivalent scheduler
  • Experience supporting large storage infrastructure devices (SAN/NAS) and a good understanding of file systems such as ZFS and GPFS
  • Good understanding of high speed internetworking technologies such as 100GE, Infiniband, link aggregation, etc.
  • Good understanding of and experience with data management at scale, including performance, backups, archive, and monitoring
  • Experience maintaining application tools and databases, MySQL, postgreSQL
  • Experience with open source infrastructure systems, openLDAP, NFS, openZFS, 2FA systems

Please address applications (cover letter and resume) to Sameera Ali, Talent Acquisition Specialist, using the link provided. Review of applications will begin on January 16, 2023. We thank all applicants for their interest in this exciting opportunity and will be in touch with those whose qualifications most closely match with our needs.

Please note that all interviews are currently being held remotely due to the ongoing COVID-19 pandemic.

At the Vector Institute we are committed to driving excellence and leadership in Canada’s knowledge, creation, and use of AI to foster economic growth and improve the lives of Canadians. We strive for greater inclusion in the programs and culture that we build by welcoming and encouraging applications from all qualified candidates. This includes but is not limited to applicants who are indigenous, 2SLGBTQIA+, racialized persons/visible minorities, women, and people with disabilities.

If you require an accommodation at any point throughout the recruitment and selection process, please contact hr@vectorinstitute.ai and we will happily work with you to meet your needs.

Other jobs you may like

Designed by: Avando Mitchell
Translate »