James Salzman

Software Engineer | Site Reliability Engineer

About

Software Engineer with a strong background in Site Reliability Engineering (SRE) principles, and a focus on cloud-based infrastructure. Proven experience designing and implementing observability platforms (OpenTelemetry and Grafana's LGTM stack) to gain deep insights into application health and performance. Expertise in automating scalable infrastructure solutions using Terraform, streamlining deployment processes with CI/CD pipelines (Github Actions), developing high-quality, maintainable software solutions in Go, Python, and Elixir. Seeking a challenging role that leverages both my Software Development and SRE skillsets to contribute to a high-performing team.

For more information or to discuss potential opportunities, please visit my contact page. I look forward to connecting!

Skills

  • Programming Languages: Go, Python, Elixir, Java, C/C++/C#, React, JavaScript, HTML, CSS, BASH, Powershell, Assembly, Easylanguage, Thinkscript, Pinescript
  • Frameworks/Tools: Open Telemetry, Terraform, Github Actions, Kubernetes, Ansible, Docker Swarm/Compose, Git
  • Services: Grafana, Mimir, Tempo, Loki, TimescaleDB/Postgres, Prometheus, Aerospike, Kafka, Splunk, Dynatrace
  • Operating Systems: Linux (Red Hat, Ubuntu, Fedora), Mac OS, Windows 11
  • Cloud Providers: AWS, Azure, GCP

Work Experience

Omni Logistics / Forward Air - Site Reliability Engineer

September 2023 - June 2024

  • Developed an organization-wide monitoring platform based on the LGTM stack and OpenTelemetry, empowering the organization with visibility into their infrastructure and applications with logs, metrics and traces.
  • Defined and led the implementation of an organization-wide Incident Management and Postmortem Process, creating a robust framework for incident identification, resolution, and improvement.
  • Served as the sole maintainer of the in-house built Omni TMS application and infrastructure, implementing new features and maintaining service performance.
  • Enhanced the OpenTelemetry collector to streamline log collection from over 50 AWS accounts, significantly reducing monthly cloud bills.
  • Authored custom Prometheus Exporters in Go, enabling data collection from new sources and consolidating application telemetry into a single pane of glass.
  • Championed infrastructure automation and provisioning with Terraform, facilitating scalable, repeatable, and version-controlled infrastructure definitions.
  • Built and maintained CI/CD pipelines for application teams using Github Actions, boosting developer efficiency by automating application provisioning and deployment.
  • Rearchitected the company's DNS topology to enable seamless public and private DNS resolution across on-premises and multi-cloud environments, enhancing network security and efficiency.

Charles Schwab - Software Engineer II

Software Engineer II

February 2023 - September 2023

  • Developed and maintained a custom-built monitoring platform with Grafana, TimescaleDB, Kafka, and OpenTelemetry. Empowering the Trading organization with real time visibility and alerting over hundreds of application servers.
  • Served as Database Administrator and Support for the company's mission critical NoSQL Trading database clusters. This included typical DBA work, management of low-latency bare-metal servers, monitoring, alerting, and incident response.
  • Led the development of an automation framework in Python, which was importable as a library and had a custom command line interface, enabling teams to quickly perform critical tasks across multiple servers.
  • Developed custom Ansible modules in Python, roles and playbooks for automating application deployments and server provisioning. This reduced the time to perform maintenance from hours to minutes.
  • Developed CI/CD pipelines in Bamboo, and Harness for application teams. This included automated testing, deployment, and rollback strategies.

Software Engineer I

January 2022 - February 2023

  • Served as Database Administrator and Support for the company's mission critical NoSQL Trading database clusters. This included typical DBA work, management of low-latency bare-metal servers, monitoring, alerting, and incident response.
  • Served as the teams Observability SME, developing custom Grafana dashboards, Prometheus exporters, and alerting rules for monitoring our NoSQL Trading database.
  • Developed an internal full stack web application in Python's Django framework for managing database clusters, users, and permissions. This application was used by multiple teams and increased efficiency and consistency in database management.
  • Developed several Ansible roles and playbooks which automated software deployments across our fleet of servers. This included deploying and configuring monitoring agents, databases, and custom applications.
  • Served as the go-to automation expert for the team, developing custom tooling and scripts in Python for automating common tasks and reducing manual work.

Software Engineer Intern

Summer 2021

  • Authored additions to the company's internal monitoring platform, including custom Grafana dashboards.
  • Made code contributions to a Java application utilizing technology like JUnit, Maven, Kafka, and InfluxDB.
  • Developed tools for automating database deployments in Python, reducing time to deploy new instances.

Freelance - Software Engineer

Fall 2020 - January 2022

  • Developed Automated Trading systems for clients on platforms like ThinkorSwim, TradeStation, and Tradingview. This transformed clients thoughts into robust automated trading strategies.
  • Maintained a 100% job success rate with an average rating of 5 stars.

Personal Projects

Personal Website

Spring 2024

You're looking at it! jfs-web was developed with Go, Templ and HTMX. It's infrastructue was provisioned in Google Cloud Platform with Terraform, and its automatically built, tested, and deployed using a CI/CD Pipeline in Github Actions.

Want to know more? Check out the demos page of this website which showcases some of the implementations of my skills.

TimescaleDB Distributed High Availability

Fall 2022

Designed a custom solution for multi-node TimescaleDB High Availability with etcd and a custom management agent. TimescaleDB multi-node shipped with no out-of-the-box automatic failover capabilities. This agent addressed the issue by automatically removing unhealthy nodes from the cluster, allowing reads and writes to continue on node failure. It also rebalanced chunks when the nodes rejoined the cluster.

Prometheus Exporters

Fall 2022

Developed Prometheus Exporters for applications running on my home network, including TimescaleDB and hardware metrics for water-cooled computers. These metrics were ingested with Prometheus, and displayed in my home network's Grafana instance.

Ansible Web Interface

Spring 2022

Developed a web interface for Ansible that allows users to launch and monitor playbooks from their browser. This website included an inventory of Ansible playbooks, allowing me to run them ad-hoc with live log tailing in the UI.

Education

The University of Texas at Austin

M.S. in Computer Science

Expected Graduation: Fall 2024

The University of Texas at Dallas

B.S. in Computer Science with a minor in Business Administration

Graduated: Fall 2021