Software

For better or worse, most of the software I write I can’t make publicly available. But some I can, or at least do and hope for forgiveness later. If I can share a repo here below but it needs to be restricted (say to Stanford employees only), I’ll note that after the name. Some things I can only make available to my current team as of now, but I’ll work to change that. If you want some specific examples, I can probably provide something by request.

Jump to: javascript | python | bash | FORTRAN/C/C++

javascript

Coordinator: A DAG-like asynchronous operation processor for node.js for when you need sequences of operations to either all succeed or fail if any fails. Easily add operational “stages”, with or without prerequisites, rollback rules, and result transformations. Allows repeated executions, as many times as needed. Published as daat-coordinator on npm, where “DAAT” stands for “Directed Acyclic Asynchronous Task”.

statesampler: A simple node.js server for state-dependent sampling of data for custom javascript surveys, such as in Qualtrics. For when you need to sample data for custom questions, but your sampling strategy is not iid but rather Markov or something similar. Repo includes a node.js server and systemd service setup. This tool has been used in several experiments at Stanford, involving many thousands of experimental observations. I’ve also used real-world experimental data to show that properly randomized sampling achieves nearly-optimal distributional performance for experimental task distribution in the presence of participant dropout; better performance that that achievable with a priori experimental plans.

statesamples.png

Normalized counts (~ histograms) for item observations in actual experiments (grey) and simulations of the observations with different sampling method using detailed logs collected during an experimental run. Near-uniformity in the distribution of counts (black) is impossible to obtain in the presence of (stochastic) dropout; “smart” sampling methods using a global state (top, green; enabled by statesampler) come the closest to optimal, followed by experimental plans (bottom, blue; with and without randomness) and finally by uniformly random task choices (both, red) which are a poor choice for attaining distributional uniformity in experiments with dropout.

Contact me if you want to chat about this.

yenaccess (Stanford only): A full-stack javascript project (mongoDB, node.js, and react/redux) for managing sponsored access to Stanford GSB’s “yen” research computing servers. A node.js backend serves an API accessible from a react/redux SPA that furnishes a request form, approval/rejection/renewal forms, and an administration form for tool admins. The project logs all requests for access and request activity, sends notifications and reminders via email and Slack messages, automatically expires requests that get too old, and automatically notifies sponsors of possible renewals (on a timeline consistent with Stanford ISO minimum security guidelines).

cloudforest (in progress, public fall 2019): A full-stack javascript project (mongoDB, node.js, and react/redux) for managed on-demand AWS EC2 cloud computing services. Also includes monitoring and automation via influxdata‘s TICK stack.

cfoverview

Currently in use at Stanford GSB by at least 5 research groups to enable self-service, efficient, and secure cloud computing for research purposes. Our installation has managed tens of thousands of dollars worth of computing resources in 7-8 months of beta program use.
cfserver (Team only): a node.js server that interacts with mongoDB and AWS, providing the basic infrastructure for on-demand cloud computing.
cfmetrics (Team only): a node.js server that provides and interface to the TICK stack for instance monitoring.
cfdashbd (Team only): a react/redux app providing dashboard functionality for cloudforest including instance management (creation, starting/stopping, deletion), activity monitoring (CPU/memory usage), group management (viewing, adding, removing users), access to jupyter notebooks, and more:

cloudforest

cfsetup (Team only): a structured repository of bash scripts defining how to set instances up after they launch. Uses bitbucket pipelines and AWS CodeDeploy to consistently deploy updates to instances.
cfsite (Team only): repo for a jekyll site, with build and deploy pipelines, containing basic information about cloudforest, complete documentation, and articles about use of the platform.

python

idlogit (in progress): A python package for estimating (binary outcome) “Idiosyncratic Deviations Logit” models using ECOS. idLogit models are Logit models for heterogeneous observations with a non-parametric portrait of response heterogeneity and a convex maximum likelihood estimation problem. You can review the slides for an academic talk at Stanford’s ICME about this method here.

Serverless Code: I’ve written a few serverless functions in python. Some are for fun; for example, we have a few Slack apps that call python Lambdas to transform or manipulate text:

rossputer

A more useful example comes from our monitoring systems. We have a Slack app built on top of a Lambda that allows us to access summary metrics data about our machines; for example,

/yen 3 publish

displays

yenstats

for the entire channel to see and

/yen 3 users publish

displays

yenusers

for the entire channel to see. Actually, this user-specific data comes from a Lambda pipeline too: code on the actual servers sends process data to S3, and a python Lambda executes a rollup operation on the data to get and store user statistics. I’ve also helped teams set up complicated serverless applications, including Machine Learning model evaluations in python (with both Lambda using Layers and Google Cloud Functions).

Multilevel Optimization with Integrals (packaging for release): I have helped a GSB professor with code for solving a certain multi-level optimization problem (an optimization that depends on another optimization) whose objectives also are only approximately determinable, because they involve integrals. I wrote quite a bit of python code for exploration and solution of this subtly difficult numerical task, which I’ll try to package for some kind of demo or release during summer 2019.

bash

superserver: A bash script (basically) that can help load balance horizontally-scaled arbitrary servers (like from python, node.js, or go code). You specify the source, start/stop actions, and any install commands, superserver sets up all the (restarting) systemd services for you to run an arbitrary number of instances of your server load balanced by your Apache web server.

stanshib (Stanford only): A set of scripts and templates for making Stanford Shibboleth setup easy on new machines/addresses.

FORTRAN/C/C++

pthreader: A lightweight C++ class for executing arbitrary code in parallel using pthreads, requiring only a definition of (a) setup, (b) evaluation, and (c) cleanup for each thread. Setup and cleanup once, but evaluate as many times as needed. Extends the condition variable mutual exclusion method from Divakar Viswanath’s book.

gslregressmpi: A simple example in C I wrote for a GSB professor interested in using MPI on our clusters to parallel function/derivative calls when solving an optimization problem.  This example used OLS regression because it is trivial to implement other ways and verify results. Their real problem was much more complex, of course. This example also uses the GNU Scientific Libraries because this was the faculty’s preferred solver. The basic outline should work with any similar solver.

Code Optimization: I do a decent amount of low-level code optimization when people need it, but most of these cases are particular to a particular researcher and project and not shareable. In one particularly tangible case, I re-wrote some FORTRAN code multithreaded using OpenMP to simpler, optimized serial code which took a 35-day multi-task runtime down to 1 1/2 days. In another, I optimized and rewrote matlab scripts both in matlab and in C using the Intel IPP and MKL, improving speed by a factor of 5. This is a pretty small gain, actually, and is small due to the density of linear algebraic operations in the original matlab code (operations which matlab inherently does well already). Another case involved data extraction from ~ 50GB of detailed XML datafiles from a financial firm; using the Xerces library in C++ was able to tackle the task in an afternoon, whereas trying the same task with python (for fun) crashed a 32-core, 256GB memory AWS EC2 instance.