William Zhengwei Ma

Jersey City, New Jersey williammaucla@gmail.com

Hello there! Thanks for checking out my profile! I'm excited about all things ML and engineering, with experience in project management, MLOps, Deep Learning / Machine Learning, and engineering. Please take a look at some of the following links to see my expertise or scroll down below!


Work Experience

Senior Data Scientist

Office of Data Science - Liberty Mutual Insurance

I am a Senior Data Scientist for the Office of Data Science, a centralized DS team to promote DS excellence across the enterprise.

Model Deployment - MLOps

I am heavily involved in helping engineers create an enterprise standard for DS model deployment. I have in depth experience with the E2E MLOps cycle - I'm familiar with tools like Airflow/Luigi for ETL, MLflow for model training, Flask apps and Docker containers for packaging, Bamboo and Artifactory for builds, AWS Fargate / Sagemaker for deployment with DataDog and Splunk for monitoring. At the same time I am exploring bringing more innovation solutions like Seldon to solve challenges like inference graphs and AB deployments. In partnership with various business teams and data scientists, the models I am responsible for helping deploy create an estimate of $10m+ in annual value.

Fraud Modeling

I'm currently working on an initiative for claims fraud modeling using Neo4J.

Enterprise NLP

I supported NLP initiatives to develop models to assist in email information extraction using tools like Named Entity Recognition, Question Answer, and Natural Language Inference, based heavily on models from HuggingFace Transformers, as well as looking at various tools for extracting data from forms and documents.

Aerial Imagery

For various computer vision initiatives working with aerial imagery, I am focused on creating a sustainable and efficent pipeline for tile architecture . This includes pipeline design with async calls for fast inference as well as image processing for thousands of addresses. Project worth ~ 1m annually.

Additional

I was a manager for co-ops, and led development on deploying a NLP model for email triage (estimated 100k+ annual value).

2020 - Current

Graduate Teaching Assistant

Harvard Extension School

I was a TA for Advanced Python for Data Science, a Master's program course at Harvard Extension School for the Data Science Masters Degree. I taught software best practices leveraging open-source tools to aspiring data scientists.

I covered topics like CI/CD, unit testing, task workflows, git, containerization, APIs, skeletons - you can find the syllabus here for more topics - in weekly virtual sections and office hours. I also helped improve our homework grading process with more automated grading via unittests.

2019 - 2020

Senior Data Scientist

Munich Reinsurance America
MAPS - Machine Learning for Attending Physician Statements

MAPS is a project creating underwriting predictions from physician statements. I leveraged hugging face transformer's question answer fine-tuned BERT models (BERT, BioBert, SciBert) for NLP and pytorch pretrained image models (resnet) with Mask RCNN for computer vision. The entire project is pipelined with Airflow (previously Metaflow) for reproducibility and dockerized.

For OCR, we worked with Tesseract and OpenCV to produce higher-quality OCR, with experiments with Calamari and ABBYY.

Deepmatcher

I built an end-to-end deep learning entity resolution application. Pipelined with Luigi, it's deployed with Flask in a Docker container with Azure single sign on for security and utilizes Azure pipelines for CI/CD. Based on the Deepmatcher project.

2019 - 2020

Data Scientist II

Solaria Labs - Liberty Mutual Insurance
Driver Risk Score for Ridesharing Companies

I developed a machine learning pipeline (> 100GB of data) with dask, luigi, and dask-ml to model the riskiness of drivers using tree-based xgboost tweedie regression model. It leveraged a Bamboo pipeline for CI/CD for unit testing and deploying sphinx documentation.

Alternative Data via Telematics - Measuring Car Traffic at POIs

I built an end-to-end python ETL pipeline for processing/analyzing telematics data using dask, luigi, and spark. Incorporates geohashed hilbert points for spatial join, density based clustering, and geofences built from webscraping.

Other Projects

I prototyped an implementation of mask RCNN for detecting license plates & pools from satellite imagery, and developed an experimental design framework for testing extractive summaries on diarized Amazon audio transcripts.

2017 - 2019

Actuarial Analyst

Liberty Mutual Insurance

Prior to becoming a data scientist I was an actuarial analyst, focused on Solvency II regulation.

2016 - 2017

Skills

Cloud
DevOps
  • Github
  • TravisCI / Bamboo
  • Docker
  • Splunk

Workflow Management
  • Luigi
  • MLFlow
  • Metaflow
  • Airflow


Deep Learning
  • Hugging Face Transformers
  • Matterport's Mask RCNN
  • Pytorch
  • Tensorflow

Machine Learning
  • XGBoost
  • H2O
  • Scikit-learn

Data Analysis
  • Pandas
  • Numpy
  • Matplotlib / Seaborn
  • SQL
  • R

Data Visualization
  • Folium
  • Reveal JS
  • PowerBI


Platform as a Service
  • Heroku

Graph Databases
  • Neo4j / Cypher

Big Data
  • Dask
  • Scala/Spark

Web Development
  • FastAPI
  • Flask
  • Django
  • Streamlit
  • React JS (beginner)

Education

University of California, Los Angeles

Bachelor of Science
Financial Actuarial Science Major, Statistics Minor and Specialization in Computing

Graduated as a honors student, cum laude

2013 - 2016
I have pursued many LinkedIn Learning courses, such as Microservices Foundations, DevOps for Data Scientists, Learning AWS for Developers, etc. Please view more on my linkedin page. In addition, please check out my Coursera courses, also found on my LinkedIn.

Personal Projects

Here is a selection of some cool projects I've worked on. Please contact me if you want to see additional projects , and check out my github.


Readings

I enjoy reading different topics in AI, ML, and engineering. See the following for a glimpse into what I'm looking at!