Training a Toxic Comment Detector

I’m writing learning-notes from implementing a “toxic comment” detector using a convolutional neural network (CNN). This is a common project across the interwebs, however, the articles I’ve seen on the matter leave a few bits out. So, I’m attempting to augment public knowledge–not write a comprehensive tutorial.

A common omission is what the data look like as they travel through pre-processing. I’ll try to show how the data look before falling into the neural-net black-hole. However, I’ll stop short before reviewing the CNN setup, as this is explained much better elsewhere. Though, I’ve put all the original code, relevant project links, tutorial links, and other resources towards the bottom.

The Code

Code: Imports

from __future__ import print_function

import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, GlobalMaxPooling1D, Conv1D, Embedding, MaxPooling1D
from keras.models import Model
from keras.initializers import Constant
import gensim.downloader as api
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score

The above code includes several packages which would need to be downloaded. The easiest way is to use pip.

pip install keras
pip install gensim
pip install pandas

Code: Variables

BASE_DIR = 'your project directory'

The above variables define the preprocessing actions and the neural-network.


The directory containing the data file train.csv


The toxic_comment data set contains comments collected from Wikipedia. MAX_SEQUENCE_LENGTH is used in the preprocessing stages to truncate a comment if too long. That is, greater than MAX_SEQUENCE_LENGTH. For example, a comment like:

You neeed to @#$ you mother!$@#$&...

Probably doesn’t need much more for the network to discern it’s a toxic comment. Also, if we create the network based around the longest comment, it will become unnecessarily large and slow. Much like the human brain (See Overchoice), we need to provide as little information as needed to make a good decision.


This variable is the maximum number of words to include–or, vocabulary size.

Much like truncating the sequence length, the maximum vocabulary should not be overly inclusive. The number 20,000 comes from a “study” stating an average person only uses 20,000 words. Of course, I’ve not found a primary source stating this–not saying it’s not out there, but I’ve not found it yet. (Halfhearted search results in the appendix.)

Regardless, it seems to help us justify keeping the NN nimble.


In my code, I’ve used gensim to download pre-trained word embeddings. But beware, not all pre-trained embeddings have the same number of dimensions. This variables defines the size of the embeddings used. Please note, if you use embeddings other than glove-wiki-gigaword-300 you will need to change this variable to match.


A helper function in Keras will split our data into a test and validation. This percentage represents how much of the data to hold back for validation.

Code: Load Embeddings

print('Loading word vectors.')
# Load embeddings
info =
embedding_model = api.load("glove-wiki-gigaword-300")

The info object is a list of gensim embeddings available. You can use any of the listed embeddings in the format api.load('name-of-desired-embedding'). One nice feature of gensim’s api.load is it will automatically download the embeddings from the Internet and load them into Python. Of course, once they’ve been downloaded, gensim will load the local copy. This makes it easy to experiment with different embedding layers.

Code: Process Embeddings

index2word = embedding_model.index2word
vocab_size = len(embedding_model.vocab)
word2idx = {}
for index in range(vocab_size):
    word2idx[index2word[index]] = index

The two dictionaries index2word and word2idx are key to embeddings.

The word2idx is a dictionary where the keys are the words contained in the embedding and the values are the integers they represent.

word2idx = {
    "the": 0,
    ",": 1,
    ".": 2,
    "of": 3,
    "to": 4,
    "and": 5,
    "blah": 12984,

index2word is a list where the the values are the words and the word’s position in the string represents it’s index in the word2idx.

index2word = ["the", ",", ".", "of", "to", "and", ...]

These will be used to turn our comment strings into integer vectors.

After this bit of code we should have three objects.

  1. embedding_model – Pre-trained relationships between words, which is a matrix 300 x 400,000.
  2. index2word – A dictionary containing key-value pairs, the key being the word as a string and value being the integer representing the word. Note, these integers correspond with the index in the embedding_model.
  3. word2idx – A list containing all the words. The index corresponds to the word’s position in the word embeddings. Essentially, the reverse of the index2word.

Code: Get Toxic Comments Labels

print('Loading Toxic Comments data.')
with open(TRAIN_TEXT_DATA_DIR) as f:
    toxic_comments = pd.read_csv(TRAIN_TEXT_DATA_DIR)

print('Getting Comment Labels.')
prediction_labels = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
labels = toxic_comments[prediction_labels].values

This loads the toxic_comment.csv as a Pandas dataframe called toxic_comments. We then grab all of the comment labels using their column names. This becomes a second a numpy matrix called labels.

We will use the text in the toxic_comments dataframe to predict the data found in the labels matrix. That is, toxic_comments will be our x_train and labels our y_train.

You may notice, the labels are also included in our toxic_comments. But they will not be used, as we will only be taking the comment_text column to become our sequences here in a moment.

toxic_comments dataframe

  id comment_text toxic severe_toxic obscene threat insult identity_hate
5 00025465d4725e87 Congratulations from me as well, use the tools well. · talk 0 0 0 0 0 0
6 0002bcb3da6cb337 COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK 1 1 1 0 1 0
7 00031b1e95af7921 Your vandalism to the Matt Shirvington article has been reverted. Please don’t do it again, or you will be banned. 0 0 0 0 0 0

labels (y_train) numpy matrix

0 1 2 3 4 5
0 0 0 0 0 0
1 1 1 0 1 0
0 0 0 0 0 0
0 0 0 0 0 0

Code: Convert Comments to Sequences

print('Tokenizing and sequencing text.')

tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
sequences = tokenizer.texts_to_sequences(toxic_comments['comment_text'].fillna("<DT>").values)
word_index = tokenizer.word_index

print('Found %s sequences.' % len(sequences))

The Tokenizer object comes from the Keras API. It takes chunks of texts cleans them and then converts them to unique integer values.

The num_words argument tells the Tokenizer to only preserve the word frequencies higher than this threshold. This makes it necessary to run the fit() on the targeted texts before using the Tokenizer. The fit function will determine the number of occurrences each word has throughout all the texts provided, then, it will order these by frequency. This frequency rank can be found in the tokenizer.word_index property.

For example, looking at the dictionary below, if num_words = 7 all words after “i” would be excluded.

    "the": 1,
    "to": 2,
    "of": 3,
    "and": 4,
    "a": 5,
    "you": 6,
    "i": 7,
    "is": 8,
    "hanumakonda": 210334,
    "956ce": 210335,
    "automakers": 210336,
    "ciu": 210337

Also, as we are loading the data, we are filling any missing values with a dummy token (i.e., “<DT>”). This probably isn’t the best way to handle missing values, however, given the amount of data, it’s probably best to try and train the network using this method. Then, come back and handle na values more strategically. Diminishing returns and all that.

Code: Padding

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

This is an easy one. It pads our sequences so they are all the same length. The pad_sequences function is part of the Keras library. A couple of important arguments have default values: padding and truncating.

Here’s the Keras docs explanation:

padding: String, ‘pre’ or ‘post’: pad either before or after each sequence.

truncating: String, ‘pre’ or ‘post’: remove values from sequences larger than maxlen, either at the beginning or at the end of the sequences.

Both arguments default to pre.

Lastly, the maxlen argument controls where padding and truncation happen. And we are setting it with our MAX_SEQUENCE_LENGTH variable.

Code: Applying Embeddings

num_words = min(MAX_NUM_WORDS, len(word_index)) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
        embedding_vector = embedding_model.get_vector(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

Here’s where stuff gets good. The code above will take all the words from our tokenizer, look up the word-embedding (vector) for each word, then add this to the embedding matrix. The embedding_matrix will be converted into a keras.layer.Embeddings object.

I think of an Embedding layer as a transformation tool sitting at the top of our neural-network. It takes the integer representing a word and outputs its word-embedding vector. It then passes the vector into the neural-network. Simples!

Probably best to visually walk through what’s going on. But first, let’s talk about the code before the for-loop.

num_words = min(MAX_NUM_WORDS, len(word_index)) + 1

This gets the maximum number of words to be addeded in our embedding layer. If it is less than our “average English speaker’s vocabulary”–20,000–we’ll use all of the words found in our tokenizer. Otherwise, the for-loop will stop after num_words is met. And remember, the tokenizer has kept the words in order of their frequency–so, the words which are lost aren’t as critical.

embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))

This initializes our embedding_matrix, which is a numpy object with all values set to zero. Note, if the EMBEDDING_DIM size does not match the size of the word-embeddings loaded, the code will execute, but you will get a bad embedding matrix. Further, you might not notice until your network isn’t training. I mean, not that this happened to me–I’m just guessing it could happen to someone.

for word, i in word_index.items():
        embedding_vector = embedding_model.get_vector(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

Here’s where the magic happens. The for-loop iterates over the words in the tokenizer object word_index. It attempts to find the word in word-embeddings, and if it does, it adds the vector to the embedding matrix at a row respective to its index in the word_index object.

Confused? Me too. Let’s visualize it.

Let’s walk through the code with a word in mind: “of”.

for word, i in word_index.items():

By now the for-loop is two words in. The words “the” and “to” have already been added. Therefore, for this iteration word = ‘of’ and i = 2.

embedding_vector = embedding_model.get_vector(word)

The the word-embedding for the word “of” is

-0.076947, -0.021211, 0.21271, -0.72232, -0.13988, -0.12234, ...

This list is contained in a numpy.array object.

embedding_matrix[i] = embedding_vector

Lastly, the word-embedding vector representing “of” gets added to the third row of the embedding matrix (the matrix index starts at 0).

Here’s how the embedding matrix should look after the word “of” is added. (The first column added for readability.)

word 1 2 3 4
the 0 0 0 0
to 0.04656 0.21318 -0.0074364 -0.45854
of -0.25756 -0.057132 -0.6719 -0.38082

Also, for a deep visualization, check the image above. The picture labeled “word embeddings” is actually the output of our embedding_matrix. The big difference? The word vectors in the gensim embedding_model which are not found anywhere in our corpus (all the text contained in the toxic_comments column) have been replaced with all zeroes.

Code: Creating Embedding Layer

embedding_layer = Embedding(len(word2idx),

Here we are creating the first layer of our NN. The primary parameter passed into the Keras Embedding class is the embedding_matrix, which we created above. However, there are several other attributes of the embedding_layer we must define. Keep in mind our embedding_layer will take an integer representing a word as input and output a vector, which is the word-embedding.

First, the embedding_layers needs to know the input dimensions. The input dimension is the number of words we are considering for this training session. This can be found by taking the length of our word2idx object. So, the len(word2idx) returns the total number of words to consider.

One note on the layer’s input, there are two “input” arguments for keras.layers.Embedding class initializer, which can be confusing. They are input and input_length. The input is the number of possible values provided to the layer. The input_length is how many values will be passed in a sequence.

Here are the descriptions from the Keras documentation:


int > 0. Size of the vocabulary, i.e. maximum integer index + 1.


Length of input sequences, when it is constant. This argument is required if you are going to connect Flatten then Dense layers upstream (without it, the shape of the dense outputs cannot be computed).

In our case, the input will be the vocabulary size and input_length is the number of words in a sequence, which should be MAX_SEQUENCE_LENGTH. This is also why we padded comments shorter than MAX_SEQUENCE_LENGTH, as the embedding layer will expect a consistent size.

Next, the embedding_layers needs to know the dimensions of the output. The output is going to be a word-embedding vector, which should be the same size as the word embeddings loaded from the gensim library.
We defined this size with the EMBEDDING_DIM variable.

Lastly, the training option is set to False so the word-embedding relationships are not updated as we train our toxic_comment detector. You could set it to True, but come on, let’s be honest, are we going to be doing better than Google?

Code: Splitting the Data

nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]

Here we are forming our data as inputs. We convert the data into x_train and x_val. The labels dataframe becomes y_train and y_val. And here marks the end of pre-processing.

But! Let’s recap before you click away:

  1. Load the word-embeddings. These are pre-trained word relationships. It is a matrix 300 x 400,000.
  2. Create two look up objects: index2word and word2idx
  3. Get our toxic_comment and labels data.
  4. Convert the comments column from toxic_comments dataframe into the sequences list.
  5. Create a tokenizer object and fit it to the sequences text
  6. Pad all the sequences so they are the same size.
  7. Look up the word-embedding vector for each unique word in sequences. Store the word-embedding vector in thembedding_matrix. If the word is not found in the embeddings, then leave the index all zeroes. Also, limit the embedding-matrix to the 20,000 most used words.
  8. Create a Keras Embedding layer from the embedding_matrix
  9. Split the data for training and validation.

And that’s it. The the prepared embedding_layer will become the first layer in the network.

Code: Training

Like I stated at the beginning, I’m not going to review training the network, as there are many better explanations–and I’ll link them in the Appendix. However, for those interested, here’s the rest of the code.

input_ = Input(shape=(MAX_SEQUENCE_LENGTH,))
x = embedding_layer(input_)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 3, activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
output = Dense(len(prediction_labels), activation='sigmoid')(x)
model = Model(input_, output)

print('Training model.')
# happy learning!
history =, y_train, epochs=2, batch_size=512, validation_data=(x_val, y_val))

Oh! There’s one more bit I’d like to go over, which most other articles have left out. Prediction.

Code: Predictions

I mean, training a CNN is fun and all, but how does one use it? Essentially, it comes down to repeating the steps above, but with with less data.

def create_prediction(model, sequence, tokenizer, max_length, prediction_labels):
    # Convert the sequence to tokens and pad it.
    sequence = tokenizer.texts_to_sequences(sequence)
    sequence = pad_sequences(sequence, maxlen=max_length)

    # Make a prediction
    sequence_prediction = model.predict(sequence, verbose=1)

    # Take only the first of the batch of predictions
    sequence_prediction = pd.DataFrame(sequence_prediction).round(0)

    # Label the predictions
    sequence_prediction.columns = prediction_labels
    return sequence_prediction

# Create a test sequence
sequence = ["""
            Put your test sentence here.
prediction = create_prediction(model, sequence, tokenizer, MAX_SEQUENCE_LENGTH, prediction_labels)

The function above needs the following arguments:

  • The pre-trained model. This is the Keras model we just trained.
  • A sequence you’d like to determine whether it is “toxic”.
  • The tokenizer, which is used to encode the prediction sequence the same way as the training sequences.
  • max_length must be the same as the maximum size of the training sequences
  • The prediction_labels are a list of strings containing the human readable labels for the predicted tags (e.g. “toxic”, “severe_toxic”, “insult”, etc.)

Really, the function takes all the important parts of our pre-processing and reuses them on the prediction sequence.

One piece of the function you might tweak is the .round(0). I’ve put this there to convert the predictions into binary. That is, if prediction for a sequence is .78 it is rounded up to 1. This is do to the binary nature of the prediction. Either a comment is toxic or it is not. Either 0 or 1.

Well, that’s what I got. Thanks for sticking it out. Let me know if you have any questions.


Full Code


If you want to know more about gensim and how it can be used with Keras.


The data are hosted by Kaggle.

Please note, you will have to sign-up for a Kaggle account.

Average Person’s Vocabulary Size

Primary sources on vocabulary size:

Distributing Machine Learning Jobs


A human sends machine learning job to the Boss. A Job is JSON object containing the the desired machine learning script and the parameters needed for successful execution. The Boss stores the Job and Creates an Order. The Order is another JSON object representing the state of a requested Job.

         Job #4
 0                        Boss
/|\ +----------------->   ____
/ \                       +""+
                         [ ==.]`)
                   +----+====== 0 +--+
                   +                 |
                Order #3           Job #3
                   |                 |
                Order #2           Job #2
                   |                 |
                Order #1           Job #1


The Worker uses node-scheduler to fire an HTTP request to the Boss letting it know the Worker is “bored.” The Boss will then search through the Orders for the oldest unassigned Order, if it finds one, it will return this Order to the Worker as a JSON object. At this point, the Boss updates the Order’s status to “assigned.”

The Worker sends another HTTP request, this time requesting the Job information associated with the Order the Boss had assigned.

         [ ==.]`)
   +----+====== 0 +--+
   +                 +            If the Boss finds an unassigned
Order #3           Job #3         Order it is returned. The worker requests the
   +                 +            related Job. The Boss updates the
Order #2           Job #2         the Order status to "assigned"
   +                 +                   Worker
Order #1           Job #1<-+              ____
  ^                        +----------->  +""+
  |                                       +__+
  +------------------------------------+ [ ==.]`)
          The worker checks with
          the boss periodically
          for the oldest submitted

The worker passes the Job information into the appropriate machine learning Python script via stdout. The script is executed and whether successful or not, an Outcome object is passed back to the Worker Node through stdout.

 +""+     Job #1
 +__+ +--------------->  Python Script
[ ==.]                         +
  ^                            |
  |                            |
  |                            v
  +------------------------ Outcome #1

The Worker then makes a callback API call and passes the Outcome object to the Boss to be stored in the database

          Boss                                Worker
          ____                                 ____
          +""+                                 +""+
          +__+                                 +__+
         [ ==.]`)                             [ ==.]`)
   +----+====== 0 +------+                       +
   |         |           |                       |
Order #3   Job #3     Outcome #1 <---------------+
   |         |
Order #2   Job #2
   |         |
Order #1   Job #1

MongoDB on Mac

brew install mongodb
nano /usr/local/etc/mongod.conf

Your file should look something like this

  destination: file
  path: usr/local/var/log/mongodb/mongo.log
  logAppend: true
  dbPath: /usr/local/var/mongodb

Change the dbPath to where you’d like Mongo to store your databases. Then, start and enable Mongo with brew’s services.

brew services start mongo

Sample Objects


    "_id" : "5bcc93d67f0b3f4844c87c7a",
    "jobId" : "5bcc93d67f0b3f4844c87c79",
    "createdDate" : ISODate("2018-10-21T14:57:26.980Z"),
    "status" : "unassigned",


    "_id" : ObjectId("5bcc93d67f0b3f4844c87c79"),
    "hiddenLayers" : [ 
            "activation" : "relu",
            "widthModifier" : 4,
            "dropout" : 0.2
            "activation" : "relu",
            "widthModifier" : 2.3,
            "dropout" : 0.2
            "activation" : "relu",
            "widthModifier" : 1.3,
            "dropout" : 0.2
    "dataFileName" : "wine_data.csv",
    "scriptName" : "",
    "projectName" : "wine_data",
    "depedentVariable" : "con_lot",
    "crossValidateOnly" : true,
    "crossValidationCrossingType" : "neg_mean_squared_error",
    "batchSize" : 100000,
    "epochs" : 3000,
    "patienceRate" : 0.05,
    "slowLearningRate" : 0.01,
    "loss" : "mse",
    "pcaComponents" : -1,
    "extraTreesKeepThreshd" : 0,
    "saveWeightsOnlyAtEnd" : false,
    "optimizer" : "rmsprop",
    "lastLayerActivator" : "",
    "learningRate" : 0.05,
    "l1" : 0.1,
    "l2" : 0.1,
    "minDependentVarValue" : 0,
    "maxDependentVarValue" : 1500,
    "scalerType" : "standard",


    "_id" : ObjectId("5bcc88fa7f0b3f4844c87c78"),
    "status" : 200,
    "jobId" : "5bcc724d7449f746b5aa6fe8",
    "loss" : 15109.168650257,
    "metric" : 14281.4453526111,




var express = require('express');
var bodyParser = require('body-parser');
var pythonRunner = require('./preprocessing-services/python-runner');
var schedule = require('node-schedule');
var axios = require('axios');
var fs = require('fs');
var {Worker} = require('./worker/worker');

// Get Worker Node configuration
var fs = require('fs');
var config = JSON.parse(fs.readFileSync('./python-scripts/worker-node-configure.json', 'utf8'));

if(!config) { 
    console.log('No configuration file found.')

// Boss' address
bossAddress = config.bossAddress;
nodeName = config.nodeName;
console.log(`Boss's address is ${bossAddress}`);
console.log(`This worker's name is ${nodeName}`);

var worker = new Worker('bored');

// Start server and add Middleware
var app = express();
const port = 3000;

// Start checking for Boredom
var j = schedule.scheduleJob('*/1 * * * *', function(){
    if (worker.status === 'bored') {
        console.log('Worker is bored.');
            method: 'post',
            url: bossAddress + `/bored/${nodeName}`
        }).then((response) => {
            let orderId =
            let jobId =;
            console.log(`Boss provided jobID #${jobId}`);
                method: 'get',
                url: bossAddress + `/retrieve/job/${jobId}`
            }).then((response) => {
                let job =;
                console.log(`Worker found the details for jobID #${jobId}`);
                job.callbackAddress = bossAddress;
                job.assignmentId = orderId;
                pythonRunner.scriptRun(job, worker)
                .then((response) => {
                    console.log('Worker started job, will let Boss know when finished.');
            }).catch((error) => {
        }).catch((error) => {
            console.log('Failed to find new job.')

// Python script runner interface'/scripts/run', (req, res) => {
    try {
        let pythonJob = req.body;
        .then((response) => {
    } catch (err) {

app.listen(port, () => {
    console.log(`Started on port ${port}`);


let {PythonShell} = require('python-shell')
var fs = require('fs');
var path = require('path');
var axios = require('axios');

var scriptRun = function(pythonJob, worker){
    worker.status = 'busy';
    return new Promise((resolve, reject) => {
        try {
            let callbackAddress = pythonJob.callbackAddress;
            let options = {
                mode: 'text',
                pythonOptions: ['-u'], // get print results in real-time
                scriptPath: path.relative(process.cwd(), 'python-scripts/'),
                args: [JSON.stringify(pythonJob)]
  , options, function (err, results) {
                if (err) throw err;
                try {
                    result = JSON.parse(results.pop());
                    if(result) {
                        console.log(callbackAddress + '/callback')
                            method: 'post',
                            url: callbackAddress + '/callback',
                            data: result
                        }).then((response) => {
                            console.log(`Worker let let the Boss know job is complete.`);
                            worker.status = 'bored';
                        }).catch((error) => {
                            worker.status = 'bored'
                    } else {
                        worker.status = 'bored'
                } catch (err) {
                   worker.status = 'bored'
            resolve({'message': 'job started'});
        } catch (err) {
            worker.status = 'bored'
module.exports = {scriptRun}



const express = require('express');
const bodyParser = require('body-parser');
const axios = require('axios');
var timeout = require('connect-timeout')

const {mongoose} = require('./backend/database-services/dl-mongo');
const workerNode = require('./backend/services/worker-node');
const work = require('./backend/services/work');

// Database collection
var {Job} = require('./backend/database-services/models/job');
var {Order} = require('./backend/database-services/models/order');

const bossAddress = ''

// Server setup.
var app = express();
const port = 3000;

// Add request parameters.
app.use((req, res, next) => {
    res.setHeader('Access-Control-Allow-Origin', '*');
                  'Origin, X-Requested-With, Content-Type, Accept'); 
    res.setHeader('Access-Control-Allow-Methods', 'GET, POST, PUT, PATCH, DELETE, OPTIONS');

// Add the middleware.

This route is for creating new Jobs on the queue
*/'/job/:method', (req, res) => {
    if (!req.body) { return { 'message': 'No request provided.' }};
    try {
        switch (req.params.method) {
            case 'create':
                .then((response) =>{
                }).catch((error) => {
                    res.send({'error': error })
                res.send({'error': 'No method selected.'})
    } catch (err) {
        res.send({'error': 'Error with request shape.', err})

This route is for adding new WorkerNodes to the database.
*/'/worker-node/:method', (req, res) => {
    if (!req.body) { return { 'message': 'No request provided.' }};
    try {
        switch (req.params.method) {
            case 'create':
                .then((response) =>{
                }).catch((error) =>{
                    res.send({'error': error.message});
                throw err;
    } catch (err) {
        res.send({'error': 'Error with request shape.', err })
});'/callback', (req, res) => {
    if (!req.body) { return { 'message': 'No request provided.' }};
    let outcome = req.body;
    try {
        .then((response) =>{
    } catch (err) {
        res.send({'error': 'Error with request shape.', err })

Route for Worker Node to let the Boss know it needs a Job.
The oldest Job which is unassigned is provided.
*/'/bored/:id', (req, res) => {
    if (!req.body) { return { 'message': 'No request provided.' }};
    try {
        let workerNodeId =;
        console.log(`${workerNodeId} said it's bored.`);
        if (!workerNodeId) { throw {'error': 'No id provided.'}}
        Order.findOne({ status: 'unassigned' }, {}, { sort: { 'created_at' : -1 } }, (err, order) => {
            console.log(`Found a work order, #${order._id}`)
            order.status = 'assigned';
            console.log(`Provided ${workerNodeId} with ${order.jobId}`);
            .then((doc) => {
                console.log(`Updated the Order #${}'s status to ${order.status}`);
        .catch((err) => {
            res.send({'message': `No work to do.  Don't get used to it.`})
    } catch (err) {
        res.send({'error': 'Error with request shape.', err })

Retrieve Orders or Job
app.get('/retrieve/:type/:id?/:param1?', (req, res) => {
    if (!req.body) { return { 'message': 'No request provided.' }};
    try {
        let type = req.params.type;
        let id =;
        let param1 = req.params.param1;
        switch(type) {
            case 'order':
                Order.find().then((response) => {
            case 'job':
                if (!id)  { throw {'error': 'Missing Id'} }
                Job.findOne({'_id': id })
                .then((response) => {
                throw error
    } catch (err) {
        res.send({'error': 'Error with request shape.', err })

app.listen(port, () => {
    console.log(`Started on port ${port}`);
Using Python, NodeJS, Angular, and MongoDB to Create a Machine Learning System

I’ve started designing a system to manage data analysis tools I build.

  1. An illegitimate REST interface
  2. Interface for existing Python scripts
  3. Process for creating micro-services from Python scripts
  4. Interface for creating machine learning jobs to be picked up my free machines.
  5. Manage a job queue for work machines to systematically tackle machine learning jobs
  6. Data storage and access
  7. Results access and job meta data
  8. A way to visualize results

I’ve landed on a fairly complicated process of handling the above. I’ve tried cutting frameworks, as I know it’ll be a nightmare to maintain, but I’m not seeing it.

  • Node for creating RESTful interfaces between the HQ Machine and the Worker Nodes
  • Node on the workers to ping the HQ machine periodically to see if their are jobs to run
  • MongoDB on the HQ Machine to store the job results data, paths to datasets, and possibly primary data
  • Angular to interact with the HQ Node for creating job creation and results viewing UI.
  • ngx-datatables for viewing tabular results.
  • ngx-charts for viewing job results (e.g., visualizing variance and linearity )
  • Python for access to all the latest awesome ML frameworks
  • python-shell (npm) for creating an interface between Node and Python.

Utilizing all Machines in the House

Machine learning is a new world for me. But, it’s pretty dern cool. I like making machines do the hard stuff while I’m off doing other work. It makes me feel extra productive–like, “I created that machine, so any work it does I get credit for. And! The work I did while it as doing its work.” This is the reason I own two 3D-printers.

I’m noticing there is a possibility of utilizing old computers I’ve lying around the house for the same effect. The plan is to abstract a neural network script, install it on all the computers lying about, and create a HQ Computer where I can create a sets of hyperparameters passed to the Worker Nodes throughout the house.

Why? Glad I asked for you. I feel guilty there are computers used. There’s an old AMD desktop with a GFX1060 in it, a 2013 MacBook Pro (my son’s), and my 2015 MacBook Pro. These don’t see much use anymore, since my employer has provided an iMac to work on. They need to earn their keep.

How? Again, glad to ask for you. I’ll create a system to make deep-learning jobs from hyperparameter sets and send them to these idle machines, thus, trying to get them to solve problems while I’m working on paying the bills. This comes from the power of neural networks. They need little manual tweaking. You simply provide them with hyperparameters and let them run.

Here are the napkin-doodles:

|                                                            |
|        ____                   ____      Each machine runs  |
|        |""|                   |""|      Node and Express   |
|  HQ    |__|             #1    |__|      server, creating   |
|       [ ==.]`)               [ ==.]`)   routes to Python   |
|       ====== 0               ====== 0   scripts using      |
|  The HQ machine runs          ____      stdin and stdout   |
|  Node and Express, but        |""|                         |
|  the routes are for     #2    |__|                         |
|  storing results in a        [ ==.]`)                      |
|  database.                   ====== 0                      |
|                               ____                         |
|                               |""|                         |
|                         #3    |__|        Worker           |
|                              [ ==.]`)     Nodes            |
|                              ====== 0                      |
|                                                            |
|                 Each worker Node checks         Workers    |
|        ____    with HQ on a set interval         ____      |
|        |""|       for jobs to run                |""|      |
|  HQ    |__|   <--------------------------+ #1    |__|      |
|       [ ==.]`)                                  [ ==.]`)   |
|       ====== 0                                  ====== 0   |
|       ^ |                                        ____      |
|       | |                                  #2    |""|      |
|       | +--------------------------------------->|__|      |
|       |             If there is a job, the      [ ==.]`)   |
|       |             Worker will send a GET      ====== 0   |
|       |              request for the job         ____      |
|       |                  parameters              |""|      |
|       |                                    #3    |__|      |
|       +-----------------------------------------[ ==.]`)   |
|         Once completed, the Worker updates HQ   ====== 0   |
|              with the job results.                         |

Worker Nodes

The Worker Nodes code is pretty straightforward. It uses Node, Express, and python-shell to create a bastardized REST interface to create simple interactions between the HQ Node controlling the job queue.

Node Side

Here’s the proof-of-concept NodeJS code.

var express = require('express');
var bodyParser = require('body-parser');
var pythonRunner = require('./preprocessing-services/python-runner');

var app = express();
const port = 3000;


// Python script runner interface'/scripts/run', (req, res) => {
    try {
        let pythonJob = req.body;
        .then((response, rejection) => {
    } catch (err) {

app.listen(port, () => {
    console.log(`Started on port ${port}`);

The above code is a dead simple NodeJS server using Express. It is using body-parser middleware to shape JSON objects. The pythonJob object looks something like this (real paths names have been changed to help protect their anonymity).

    "scriptsPath": "/Users/hinky-dink/dl-principal/python-scripts/",
    "scriptName": "",
    "jobParameters": {
    	"dataFileName": "",
        "dataPath": "/Users/hinky-dink/bit-dl/data/lot-data/wine_encoded/",
        "writePath": "/Users/hinky-dink/bit-dl/data/lot-data/wine_encoded/",
        "execution": {
        	"dataFileOne": "wine_2017_encoded.csv",
        	"dataFileTwo": "wine_2018_encoded.csv",
        	"outputFilename": "wine_17-18.csv"

Each of these attributes will be passed to the Python shell in order to execute They are passed to the shell as system arguments.

Here’s the python-runner.js

let {PythonShell} = require('python-shell')
var scriptRun = function(pythonJob){    
    return new Promise((resolve, reject) => {
        try {
            let options = {
                mode: 'text',
                pythonOptions: ['-u'], // get print results in real-time
                scriptPath: pythonJob.scriptsPath,
                args: [pythonJob.jobParameters.dataFileName, 
  , options, function (err, results) {
                if (err) throw err;
                try {
                    result = JSON.parse(results.pop());
                    if(result) {
                    } else {
                        reject({'err': ''})
                } catch (err) {
                    reject({'error': 'Failed to parse Python script return object.'})
        } catch (err) {
module.exports = {scriptRun}

Python Side

Here’s the Python script in the above example. It is meant to detect what type of data is in a table. If it’s is continuous it leaves it alone (I’ll probably add normalization option as some point), if it is categorical, it converts it to a dummy variable. It then saves this encoded data on the Worker Node side (right now). Lastly, it returns a JSON string back to the node side.

Created on Mon Jun 11 21:12:10 2018
@author: cthomasbrittain

import sys
import json
filename = sys.argv[1]
filepath = sys.argv[2]
pathToWriteProcessedFile = sys.argv[3]

request = sys.argv[4]
request = json.loads(request)

    cols_to_remove = request['columnsToRemove']
    unreasonable_increase = request['unreasonableIncreaseThreshold']
    # If columns aren't contained or no columns, exit nicely
    result = {'status': 400, 'message': 'Expected script parameters not found.'}

pathToData = filepath + filename

# Clean Data --------------------------------------------------------------------
# -------------------------------------------------------------------------------

# Importing data transformation libraries
import pandas as pd

# The following method will do the following:a
#   1. Add a prefix to columns based upon datatypes (cat and con)
#   2. Convert all continuous variables to numeric (float64)
#   3. Convert all categorical variables to objects
#   4. Rename all columns with prefixes, convert to lower-case, and replace
#      spaces with underscores.
#   5. Continuous blanks are replaced with 0 and categorical 'not collected'
# This method will also detect manually assigned prefixes and adjust the 
# columns and data appropriately.  
# Prefix key:
# a) con = continuous
# b) cat = categorical
# c) rem = removal (discards entire column)

def add_datatype_prefix(df, date_to_cont = True):    
    import pandas as pd
    # Get a list of current column names.
    column_names = list(df.columns.values)
    # Encode each column based with a three letter prefix based upon assigned datatype.
    # 1. con = continuous
    # 2. cat = categorical
    for name in column_names:
        if df[name].dtype == 'object':
                df[name] = pd.to_datetime(df[name])
                    new_col_names = "con_" + name.lower().replace(" ", "_").replace("/", "_")
                    df = df.rename(columns={name: new_col_names})
                    new_col_names = "date_" + name.lower().replace(" ", "_").replace("/", "_")
                    df = df.rename(columns={name: new_col_names})                    
            except ValueError:
    column_names = list(df.columns.values)
    for name in column_names:
        if name[0:3] == "rem" or "con" or "cat" or "date":
        if df[name].dtype == 'object':
            new_col_names = "cat_" + name.lower().replace(" ", "_").replace("/", "_")
            df = df.rename(columns={name: new_col_names})
        elif df[name].dtype == 'float64' or df[name].dtype == 'int64' or df[name].dtype == 'datetime64[ns]':
            new_col_names = "con_" + name.lower().replace(" ", "_").replace("/", "_")
            df = df.rename(columns={name: new_col_names})
    column_names = list(df.columns.values)
    # Get lists of coolumns for conversion
    con_column_names = []
    cat_column_names = []
    rem_column_names = []
    date_column_names = []
    for name in column_names:
        if name[0:3] == "cat":
        elif name[0:3] == "con":
        elif name[0:3] == "rem":
        elif name[0:4] == "date":
    # Make sure continuous variables are correct datatype. (Otherwise, they'll be dummied).
    for name in con_column_names:
        df[name] = pd.to_numeric(df[name], errors='coerce')
        df[name] = df[name].fillna(value=0)
    for name in cat_column_names:
        df[name] = df[name].apply(str)
        df[name] = df[name].fillna(value='not_collected')
    # Remove unwanted columns    
    df = df.drop(columns=rem_column_names, axis=1)
    return df

# ------------------------------------------------------
# Encoding Categorical variables
# ------------------------------------------------------

# The method below creates dummy variables from columns with
# the prefix "cat".  There is the argument to drop the first column
# to avoid the Dummy Variable Trap.
def dummy_categorical(df, drop_first = True):
    # Get categorical data columns.
    columns = list(df.columns.values)
    columnsToEncode = columns.copy() 

    for name in columns:
        if name[0:3] != 'cat':          

    # if there are no columns to encode, return unmutated.
    if not columnsToEncode:
        return df

    # Encode categories
    for name in columnsToEncode:

        if name[0:3] != 'cat':

        tmp = pd.get_dummies(df[name], drop_first = drop_first)
        names = {}
        # Get a clean column name.
        clean_name = name.replace(" ", "_").replace("/", "_").lower()
        # Get a dictionary for renaming the dummay variables in the scheme of old_col_name + response_string
        if clean_name[0:3] == "cat":
            for tmp_name in tmp:
                tmp_name = str(tmp_name)
                new_tmp_name = tmp_name.replace(" ", "_").replace("/", "_").lower()
                new_tmp_name = clean_name + "_" + new_tmp_name
                names[tmp_name] = new_tmp_name
        # Rename the dummy variable dataframe
        tmp = tmp.rename(columns=names)
        # join the dummy variable back to original dataframe.
        df = df.join(tmp)
    # Drop all old categorical columns
    df = df.drop(columns=columnsToEncode, axis=1)
    return df

# Read the file
df = pd.read_csv(pathToData)

# Drop columns such as unique IDs
    df = df.drop(cols_to_remove, axis=1)
    # If columns aren't contained or no columns, exit nicely
    result = {'status': 404, 'message': 'Problem with columns to remove.'}
# Get the number of columns before hot encoding
num_cols_before = df.shape[1]

# Encode the data.
df = add_datatype_prefix(df)
df = dummy_categorical(df)

# Get the new dataframe shape.
num_cols_after = df.shape[1]

percentage_increase = num_cols_after / num_cols_before

result = ""

if percentage_increase > unreasonable_increase:
    message = "\"error\": \"Feature increase is greater than unreasonableIncreaseThreshold, most likely a unique id was included."
    result = {'status': 400, 'message': message}
    filename = filename.replace(".csv", "")
    import os
    if not os.path.exists(pathToWriteProcessedFile):
    writeFile = pathToWriteProcessedFile + filename + "_encoded.csv"
    df.to_csv(path_or_buf=writeFile, sep=',')
    # Process the results and return JSON results object
    result = {'status': 200, 'message': 'encoded data', 'path': writeFile}

That’s the premise. I’ll be adding more services to as a series of articles.

Recording Brain Waves -- Mongo Database with a NodeJS API

Saving Brain Waves to Remote MongoDB by way of Node REST API

In this section I’m going to focus getting a remote Linux server setup with MongoDB and NodeJS. This will allow us to make POST requests to our Linux server, saving the EEG data.

I’m going to assume you are able to SSH into your Ubuntu 16 LTS server for this guide. You don’t have a server? No sweat. I wrote a guide on setting up a blog post which explains how to get a cheap Linux server setup.

1. Install MongoDB

SSH into your server. I’m assume this is a fresh new Linux install. Let’s start with upgrading the packages.

sudo apt-get update -y

I’ll be following the Mongo website for instructions on installing MonogoDB Community version on Ubuntu.

Let’s get started. Add the Debian package key.

sudo apt-key adv --keyserver hkp:// --recv 9DA31620334BD75D9DCB49F368818C72E52529D4

We need to create a list file.

echo "deb [ arch=amd64,arm64 ] xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list

Now reload the database

sudo apt-get update

If you try to update and run into this error

E: The method driver /usr/lib/apt/methods/https could not be found.
N: Is the package apt-transport-https installed?
E: Failed to fetch  
E: Some index files failed to download. They have been ignored, or old ones used instead.

Then install apt-transport-https

sudo apt-get install apt-transport-https

Now, let’s install MongoDB.

sudo apt-get install -y mongodb-org


2. Setup MongoDB

We still need to do a bit of setup. First, let’s check and make sure Mongo is fully installed.

sudo service mongod start

This starts the MongoDB daemon, the program which runs in the background and waits for someone to make connection with the database.

Speaking of which, let’s try to connect to the database


You should get the following:

root@localhost:~# mongo
MongoDB shell version v4.0.2
connecting to: mongodb://
MongoDB server version: 4.0.2
Welcome to the MongoDB shell.
For interactive help, type "help".
For more comprehensive documentation, see
Questions? Try the support group
Server has startup warnings:
2018-09-02T03:52:18.996+0000 I STORAGE  [initandlisten]
2018-09-02T03:52:18.996+0000 I STORAGE  [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine
2018-09-02T03:52:18.996+0000 I STORAGE  [initandlisten] **          See
2018-09-02T03:52:19.820+0000 I CONTROL  [initandlisten]
2018-09-02T03:52:19.820+0000 I CONTROL  [initandlisten] ** WARNING: Access control is not enabled for the database.
2018-09-02T03:52:19.820+0000 I CONTROL  [initandlisten] **          Read and write access to data and configuration is unrestricted.
2018-09-02T03:52:19.820+0000 I CONTROL  [initandlisten]
Enable MongoDB's free cloud-based monitoring service, which will then receive and display
metrics about your deployment (disk utilization, CPU, operation statistics, etc).

The monitoring data will be available on a MongoDB website with a unique URL accessible to you
and anyone you share the URL with. MongoDB may use this information to make product
improvements and to suggest MongoDB products and deployment options to you.

To enable free monitoring, run the following command: db.enableFreeMonitoring()
To permanently disable this reminder, run the following command: db.disableFreeMonitoring()

This is good. It means Mongo is up and running. Notice, it is listening on If you try to access the database from any network, other than locally, it will refuse. The plan, to have NodeJS connect to the MongoDB database locally. Then, will send all of our data to Node and let it handle security.

In the Mongo command line type:


And hit enter. This should bring you back to the Linux command prompt.

A few notes on MongoDB on Ubuntu.

  • The congfiguration file is located at /etc/mongod.conf
  • Log file is at /var/log/mongodb/mongod.log
  • The database is stored at /var/lib/mongodb, but this can be changed in the config file.

Oh, and one last bit. Still at the Linux command prompt type:

sudo systemctl enable mongod

You should get back

Created symlink from /etc/systemd/system/ to /lib/systemd/system/mongod.service.

This setup a symlink which will cause Linux to load mongod every time it boots–you won’t need to manually start it.

Next, NodeJS.

3. Install NodeJS and npm


sudo apt-get install nodejs -y

This should install NodeJS, but we also need the Node Package Managers npm.

sudo apt-get install npm -y

Let’s upgrade npm. This is important, as the mind-wave-journal-server depends on recent versions of several packages that are not accessible to earlier versions of npm.

The following commands should prepare npm for upgrading, then upgrade.

sudo npm cache clean -f
sudo npm install -g n
sudo n stable
sudo n latest

Let’s reboot the server to make sure all of the upgrades are in place.

sudo reboot now

When the server boots back up, ssh back in.

Check and make sure your mongod is still running


If mongo doesn’t start, then revisit step 2.

Let’s check our node and npm versions.

node -v

I’m running node v10.9.0

npm -v

I’m running npm v6.2.0

4. Clone, Install, and Run the mind-wave-journal-server

I’ve already created a basic Node project, which we’ll be able to grab from my Github account.

If you don’t already have git installed, let’s do it now.

sudo apt-get install git -y

Now, grab the Noder server I built.

git clone
cd mind-wave-journal-server/

Install all the needed Node packages.

npm install

This should download all the packages needed to run the little server program I wrote to store the EEG data into the Mongo database.

Let’s run the mind-wave-journal-server.

node server/server.js

This should be followed with:

root@localhost:~/mind-wave-journal-server# node server/server.js
(node:1443) DeprecationWarning: current URL string parser is deprecated, and will be removed in a future version. To use the new parser, pass option { useNewUrlParser: true } to MongoClient.connect.
Started on port 8080

5. Testing mind-wave-journal-server with Postman

Now, we are going to use Postman to test our new API.

For this next part you’ll need either a Mac or Chrome, as Postman has a native Mac app or a Chrome app.

I’m going to show the Chrome application.

Head over to the Chrome app store:

After you add the Postman app it should redirect you to your Chrome applications. Click on the Postman icon.

Your choice, but I skipped the sign-up option for now.

Select Create a Request

The purpose of Postman, in a nutshell, we are going to use it to create POST requests and send them to the mind-wave-journal-server to make sure it’s ready for the iOS app to start making POST requests, saving the EEG data to our Mongo server.

Let’s create our first test POST request. Start by naming the request Test eegsamples. Create a folder to put the new request in, I named it mind-wave-journal-server. Then click

You will need to set the type as POST. The url will be


No select the Headers section and add the Content Type: application/json

Lastly, select Body, then raw and enter the following JSON into the text area:


And then! Hit Send

If all goes well, then you should get a similar response in the Postman response section

Notice, the response is similar to what we sent. However, there is the additional _id. This is great. It is the id assigned to the by MongoDB when the data is entered. In short, it means it successfully saved to the database.

6. Now What?

Several caveats.

First, each time you restart your server you will manually need to start your mind-waver-journal-server. You can turn it into a Linux service and enable it. If this interests anyone, let me know in the comments and I’ll add it.

Second, notice I don’t currently have a way to retrieve data from the MongDB. The easiest way will probably be using Robot 3T. Like the first caveat, if anyone is interested let me know and I’ll add instructions. Otherwise, this series will stay on track to setup a Mongo BI connection to the database for viewing in Tableau (eh, gross).

Your Node server is ready to be called by the iOS app. In the next article I’ll return to building the MindWaveJournal app in iOS.