Joshua Banks Mailman, Ph.D.

Data Scientist & ML Engineer - Language A.I., NLP / LLMs / Generative AI / AI Safety

New York, New York, United States

About

I am applying the constructive, analytical, and presentational skills I honed for years in academia to practical Data Science problems in business and beyond . Currently I work for vendor Grid Dynamics and I am embedded in an AI/ML team within a FAANG company, as an AI researcher / Staff ML Engineer / Staff Data Scientist, working with LLMs. Prior to that i worked for Northwestern Mutual as Senior Data Scientist. I joined Northwestern Mutual as a Data Scientist in 2021 after completing the 3-month full-time immersive bootcamp in Data Science at Metis, which is highly rated and selective. Before that, computer assisted data analysis has been a part of my routine for my entire academic career which includes a PhD in music theory and analysis as well as teaching music at Columbia University. My numerous solo-authored blind-peer-reviewed articles and book chapters focus on the music-analysis equivalent of feature engineering (presented at Institut de Recherche et Coordination Acoustique/Musique, IRCAM, in Paris in 2019) and on inventing/engineering audio-visual algorithmic music interactive systems (using wireless sensors and infrared video) which I demonstrated in a performance at the New York Philharmonic Biennial / NYC Electroacoustic Music Festival in 2016 and in the Leonardo Electronic Almanac vol. 19(2): Live Visuals (https://tinyurl.com/4hyvxwhx) My most up-to-date technical skills center on Artificial Intelligence, LLMs/NLP/Generative AI (Hugging Face Transformers library, Retrieval Augmented Generateion, aka RAG) as well as Machine Learning in Python, including the following packages: Pandas, NumPy, Matplotlib, Scikit-learn, SciPy, Statsmodels, Seaborn, NLTK, SpaCy, Gensim, Tensorflow, Beautiful soup, Jupyter notebooks, SQL, Excel, Dask, Streamlit, Tableau Concepts/techniques applied in Data Science projects in bootcamp: Linear regression, Logistic regression, Feature engineering (polynomial, log, etc), Cross-validation, Regularization, Lowess/loess, B-spline, OHE, Random forrest, KNN, Xgboost, PCA/SVD, Oversampling, TF-IDF, Levenshtein distance, LDF/NMF, Word2vec (word embedding), Scattertext, Wordcloud, Computer vision, Neural networks Other languages and technologies used in previous projects: C/C++, Java, Javascript, Max/MSP, OSC, OSCeleton, OpenNI

Experience

Staff Machine Learning Engineer / Data Scientist at Grid Dynamics
Apr 2024 - Present · 2 yrs 3 mos
Consultant of Generative AI at BNY Mellon
Feb 2024 - Apr 2024 · 3 mos
Part-time engagement: • Coaching editorial team to develop and optimize prompting strategies for generating promotional texts for social media and cover images for articles • Developing and refining POC prototypes of RAG-based chatbots (using external documents) to demonstrate crowdsourcing and other possibilities for long-term wins for editorial team
Northwestern Mutual (Full-time · 2 yrs 6 mos)
- Sr. Data Scientist (Strategy, Partnership, & Innovation: Innovation & Discovery)
  Oct 2022 - Jan 2024 · 1 yr 4 mos
- Data Scientist
  Aug 2021 - Oct 2022 · 1 yr 3 mos
  Innovation & Discovery Team, in NM's New York Corporate Office
Data Scientist in training (12-week immersive bootcamp in data science) at Metis
Jan 2021 - Mar 2021 · 3 mos
Experiments in captioning, 'Wit a twist: The Amusemater Captioner Web app lets ML dip its toes into waters of witty wordplay, inventing image captions to make you smile. https://tinyurl.com/6wumzunt Used a Neural Network image classifier joining hands with NLP to algorithmically dance right to the verbographic punchline. Data: 5500 images from IMAGENET, 1500 English idioms from https://7esl.com, and corpuses: SentiWordNet, Gensim Text8, and GoogleNews Pretrained Neural Network (Xception) used for image classification, Gensim Word2Vec word embedding used to identify words related to the image label, Scikit-learn to make TF-IDF correlation matrix between SentiWordNet corpus and IMAGENET labels to identify words related to the image label, NTLK, and other NLP packages (Pronouncing, Phonetics, and English-to-IPA) to identify similar sounding words (rhymes, assonances, etc.) Result: A Streamlit app that takes any image, classifies its content, and algorithmically invents a (witty) caption indirectly relating to the image’s content New ways to read 'Language on a Holiday' (2/21) Unsupervised Natural Language Processing (NLP) machine learning to distinguish individualized prose style & varying topics in philosophical texts Texts: three complete books: Hume's Enquiry Concerning Human Understanding, James's Pluralist Universe, and Whitehead's Process and Reality drawn from Gutenberg.org Leveraged NLTK, Scikit-learn, SpaCy, Numpy, Pandas, Matplotlib, the algorithms deployed include: Non-negative Matrix Factorization (NMF) topic modeling, feature- engineered (modified) TF-IDF (term frequency - inverse document frequency) matrix to model prose style, Scatterword and spline graph and visualizations Result: A computational method to distinguish prose styles and to isolate self-explanatory & idiosyncratic prose passages To cocktail or not to (1/21-2/21) Classification of food ingredients to be or not to be in a cocktail Logistic Reg., XGBoost, Oversampling, PCA, Dask, B Soup
Freelance at Freelance
Sep 2019 - Dec 2020 · 1 yr 4 mos
• Private lessons in music composition & theory • Lectures in music theory, composition, & interactive music technology for university/college seminars & academic conferences held online • Private coaching in PYTHON programming and basic data manipulation technologies