Setup a Local MySQL Database

The last two articles have been getting oriented to SQL, however, the information in them will disappear quickly if we don’t give you a way to practice on data meaningful to you. Let’s face it, as much fun as it is to find out random employees salaries, those don’t mean anything to you.

This article will show you how to setup a copy of MySQL Server on your PC, connect to it, load data from a CSV, and query those data. There’s a lot to get done, so let’s get started.

Local MySQL Server Setup

Each of the three operating systems are a little different on how you must go about setting up a local copy of MySQL Server. Unfortunately, Windows is the most complex. Anyway, feel free to skip to the appropriate section


First, download the MySQL MSI Installer.


After you’ve downloaded it, open the the file.


If you are prompted to “Upgrade” go ahead and say “Yes”


The installer is a bit confusing, but don’t worry, most everything is fine left on its default.

Click on MySQL Server then the Add button. Add “MySQL Server” and “Connector/OBDC x64.” Then click “Next.” You will see a Installation summary, click on “Execute” and wait for the download to finish and then install wizard to begin.

As I stated, most of the install wizard questions we will leave as default.





On the “Accounts and Roles” section you will need to decide on your password for the SQL Server on your local PC. I obviously didn’t pick a great one. MySQL Server will automatically setup a user with the name of root and a password you set here. This root user will permissions to do anything to the server.



Execute the installer and let it finish. mysql-windows-installer-download

Once it finishes you should now have MySQL Server installed on your local PC. Skip to the last section to test it out.


Mac’s a bit simpler.

Download the .dmg installer. mysql-windows-installer-download

Click on “No thanks, just start my download” and when the download is finished, double click on it. mysql-windows-installer-download

Double click on the installer. You will need to enter your system password for the installer to setup MySQL Server, but you will also need to provide the MySQL Server root user a password. Don’t consfuse the two, unless you plan for them to be the same.


Once it finishes you should now have MySQL Server installed on your Mac. Skip to the last section to test it out.

Testing your Local SQL Server

Go ahead and open MySQL Workbench and let’s connect to this new local server.

Click on the “New Connection” icon and leave everything default, except the “Connection Name,” here enter localhost. mysql-windows-installer-download

Double click on the new connection and enter the password you created during installation. Voila!

Let’s run a command to make sure everything is working.

SHOW databases;

You should see: mysql-windows-installer-download

Loading CSV

Create a database

CREATE DATABASE name_of_your_database;

Before we create a table, make sure we are using the created datebase.

USE name_of_your_database

Now, we need to go over a bit of boring stuff before we get to loading the CSV. Sorry, I’ll try to keep it brief.


In SQL, every field has something called a “datatype.” You can think of a datatype as a tag on your data tell the computer how to read them.

Ultimately, a computer can’t make sense of any human-words. It has to convert everything into 0 and 1 before it understand its. If this conversion was left up to the computer entirely, it might see a word and say, “Oh, yah, this is one of those French words,” when it is actually English, thus, the conversion to 0 and 1s are incorrect.

You may have encountered this in a spreadsheet. If you open a spreadsheet and see something like


The data I actually provided the spreadsheet were:


Notice the zero in front of 6579, this was due to the computer saying, “Oh, these data look like numbers–and since the human didn’t tell me otherwise, I’m going to treat them like numbers. And, well, it is perfectly valid to drop the leading zero of a number.”

I wish all datatypes were this simple, however, the above example is about is simple as it gets. We can try to skip over a lot of nuances of datatypes and focus on the three we will probably see the most:

  • DATE
  • TIME
  • INT (short of integer)
  • CHAR (short for character)

Here are what samples of the above data would look like in a spreadsheet:

2019-10-01 2019-10-01 12:01:22 42 42.4 The answer to it all.


Dates are pretty straightforward, they store a year, month, and day as a number. However, when we retrieve this number it is put in the human readable format listed above.


Time is exactly like DATE, but it also includes hours, minutes, and seconds (sometimes milliseconds).


An INT stores a number no bigger than 2,147,483,647. However, one thing an INT cannot do is store a partial numbers. For example, if we try to store 0.5 in an INT field it will probably get converted to 1.


FLOATs fill in where INTS fail. That is, a FLOAT store only up to the precision you specifiy. For example, if we tried to store a 0.5 in a FLOAT with two precision points we’d be fine. However, if we tried to store 0.4567 in a FLOAT with only two precision points, then it would be converted to 0.46, or rounded up.


CHAR is meant to store human readable text. When you put data into a CHAR field, the SQL program knows this is human readable information and doesn’t try to figure it out at all. It leaves it literally as it is. This is why CHARS are known as “literals.” They are also called “strings,” because the computer seems them as a bunch of characters strung together.

SQL Datatypes

In SQL there are a lot of datatypes, however, some you may never need to use. One way SQL is a bit different than a spreadsheet is it wants to know ahead of time the size it needs to make the field.

CHAR Revisited

This will mainly impact us when dealing with CHAR. When the SQL program creates a CHAR field it wants to know the maximum number of characters which will ever go into the field.

For example:

  • CHAR(19) could hold the following: <-------19-------->
  • CHAR(5) could hold the following: <-5->

One important note, if you put a single character in a CHAR(5) field, then the SQL program will fill in the other four characters with a NULL. In short, a CHAR field will always be full.


There is another type of character field which allows you to put more or less data than was decided at the beginning. The VARCHAR datatype stands for “variable character” field. It will allow you to store up to 65,535 characters on MySQL. This is around 3 pages of text.


Why have CHAR at all? Shouldn’t we always use VARCHAR for everything just in case? Well, usually, but not always.

Often, when you design a database you want to make it as efficient as possible (I mean, it’s going to be successful business product, right?). The maximum size of the data a human will try to stuff in the field is important to the SQL program, as it tries to store data in such a way it minimizes space used and maximizes efficiency in retrieving the data.

In short, CHAR has a few advantages for your database. And take Social Security Numbers, if your database has to store these data then it should probably be a CHAR as these data have historically been 9 characters (11 if you include dashes).

Pop quiz, why don’t we store a Social Security Number as an INT?

Creating the Table

Ok, I’ve put you through a crash course of datatypes to get you to this point.

We are going to:

  1. Create a database called tasksDB
  2. Active tasksDB
  3. Create a table on tasksDB, setting the fields datatype
  4. Then import a CSV into this table
  5. Lastly, we will write a query against the table

Ready!? Let’s do it!

Creating Database

Open Workbench, type, and run the following:


SHOW databases;

USE tasksDB;

SELECT * FROM tasks;

LOAD DATA INFILE './task.csv'  INTO TABLE tasks

SELECT * FROM tasks;

    title VARCHAR(255) NOT NULL,
    start_date DATE,
    due_date DATE,
    status TINYINT NOT NULL,
    priority TINYINT NOT NULL,
    description TEXT,
    PRIMARY KEY (task_id)

Understanding the MySQL Query

Welcome back! Alright, now we know how to connect to a remote server from within MySQL Workbench, let’s start writing some queries.

Here’s a common SQL query:

    SELECT e.emp_no,
      FROM employees AS e
 LEFT JOIN titles 	 AS t
        ON e.emp_no = t.emp_no
     WHERE e.hire_date > '1999-12-31'
  ORDER BY e.last_name DESC;

This query produces the following table when run on our employees database.

row_num emp_no last_name title
0 47291 Flexer Staff
1 60134 Rathonyi Staff
2 72329 Luit Staff
3 108201 Boreale Senior Engineer
4 205048 Alblas Senior Staff
5 222965 Perko Senior Staff
6 226633 Benzmuller Staff
7 227544 Demeyer Senior Staff
8 422990 Verspoor Engineer
9 424445 Boreale Engineer
10 428377 Gerlach Engineer
11 463807 Covnot Engineer
12 499553 Delgrande Engineer

When the data are presented like this, it appear similar to our traditional Excel spreadsheet, right?

Let’s compare the SQL query and a spreadsheet.


Now, here in a few weeks when you are SQL-writing-machine you’ll notice this analogy between Excel and a SQL query breaks down. But for now, let the above image comfort you in knowing the core functions of SQL are similar to those of a spreadsheet. And you know these spreadsheet functions well.

  • Selecting columns
  • Filtering columns and rows
  • Ordering rows
  • Combining data sets

However, SQL has a lot of superpowers an Excel spreadsheets doesn’t. Of course, the tradeoff is you must leave behind the comfort of a graphical user interface. But don’t let it scare you off–it only takes a 3-4 months to get used to, but then you’ll realize how much those graphical interfaces have been chaining you down.

Alright, back to the queries. Let’s take a look at the different parts of the query above.


The SELECT statement is how you choose what turns up in the results section. If don’t put something in the SELECT area, then you will not get anything. It is often used to retrieve data, called fields, from one or more tables within a database.

Select Area

You may ask, what is the “SELECT area.” It is everything between the word SELECT until FROM.

SELECT -------------------------------

Select Fields

There are two different ways to SELECT fields you want to get results from. You can use the *, which means “everything.” Or you can list the field names you want returned. Each item you put in the SELECT area should be followed by a comma, unless it is the last item.

For example:

    SELECT  emp_no,

The code above requests three different fields be returned in the result set: emp_no, last_name, and title.


    SELECT  *

Returns every field, in every table listed.

I should point out, if you forget a comma it can get messy. Often, the SQL server will send an error message, but not always. As we will see in a moment.

Select Calculations

The SELECT does more than retrieve data from tables within a database. It can also perform on-the-fly calculations, such as

SELECT 1 + 1,
       2 *25,
       55 / 75,

This should return the following:

  1 + 1 2 *25 55 / 75
0 2 50 0.7333


A field in SQL is similar to the column in a spreadsheet. It contains data of the same type on every row (more on datatypes later). Fields may be referenced throughout a SQL query, but for them to show in the query results they must be included in the SELECT area–as we went over in the “SELECT” section above.

SELECT emp_no,
FROM employees

Ambiguous Field List

The above query works. However, try running the following query, which includes two tables.

SELECT emp_no,
FROM employees
LEFT JOIN titles
    ON employees.emp_no = titles.emp_no

You get any results? Me either. Only an error message from the database stating:

Error Code: 1052. Column 'emp_no' in field list is ambiguous	

This is because both the employees and titles table have a field named emp_no and the SQL program can’t figure out which you want.

To solve this, we add the table name plus . to the front of each field name. This will tell the SQL program from which tables we would like to field to come from–leaving no ambiguity. Computers hate ambiguity.

Let’s run the query again with table names.

SELECT employees.emp_no,
FROM employees
LEFT JOIN titles
    ON employees.emp_no = titles.emp_no

This time we get the results we expected, without error.

Building on this, a good SQL coder will always prepend the table name to the front of the query, whether it’s required or not. This prevents future mistakes.

For example, let’s say you wrote this code:

SELECT emp_no,
FROM salaries

And your code was put into production (a term meaning put to use by your business) then a year later another coder added a second table to the query without critically looking at the query as a whole (something a bad SQL coder forgets to do).

The new query looks like this:

SELECT emp_no,

FROM salaries
LEFT JOIN employees
    ON salaries.emp_no = employees.emp_no;

Try to run this query. You will find the same field list is ambiguous error as we saw earlier.

The deeper lesson here is: A good coder is like a defensive driver; they code in a way it expects others to be reckless.

Back to the example above, if we include the table in the field names, then it doesn’t matter if a reckless coworker adds another table.

SELECT salaries.emp_no,

FROM salaries
LEFT JOIN employees
    ON salaries.emp_no = employees.emp_no;

Field Aliases

Often you will want to export your results into a CSV to send to someone. You may have noticed when you execute a query SQL returns the results in a neat spreadsheet. I don’t know if I’ve mentioned it, but you can export these results in a CSV by hitting the little disk button above the results.


However, you may not like the machine formatted column names. I mean, don’t get us wrong, we’re nerds! We read machine friendly words fine, but our bosses don’t.

Well, MySQL has a built in command allowing you to rename fields (and more) on the fly. This command is AS and is seen in the query below written to rename the column names.

SELECT salaries.emp_no 		AS Id,
       salaries.salary		AS Salary,
       employees.first_name	AS "First Name",
       employees.last_name	AS "Last Name"

FROM salaries
LEFT JOIN employees
    ON salaries.emp_no = employees.emp_no;

Now the column headers have “boss-friendly” names.


You’ve probably noticed the first two aliases are written without quotation marks and the second two are surrounded by them. The SQL program can get confused by spaces, so we wrap the new name in " marks. When the SQL program sees these marks, it says to itself, “Oh, I bet the user is going to have one of those fancy human names, I’m going to assume everything between the first quotation mark and the next one I find is all one fancy human word. Silly humans.”

A more technical term for someone inside quotations marks is a literal constant. However, programmers know them as “strings.” It’s probably b

Don’t Lose Your AS

Go ahead and try to run this query:

SELECT emp_no
FROM employees;

Did you run it? Anything jump out as weird? You don’t really run it did you? Go run it, I’ll wait.

Ok, you’ll see something like this: | first_name | last_name | |:———–|:———-| | 10001 | Facello | | 10002 | Simmel | | 10003 | Bamford | | … | … | Super weird right? There are only two columns and it seems like the column names are jumbled up. That’s exactly what’s happened. It’s due to a missing , right after the emp_no. This is a result of something in SQL I think is silly–you can omit the AS keyword between a field and its alias.

Meaning, we could rewrite the query from earlier where we showed alias use like this:

SELECT salaries.emp_no 		Id,
       salaries.salary		Salary,
       employees.first_name	"First Name",
       employees.last_name	"Last Name"

FROM salaries
LEFT JOIN employees
    ON salaries.emp_no = employees.emp_no;

But, the first time you miss a comma you’ll be asking, “Why!? Why does MySQL allow this!” I’m not sure, but we have to deal with it. This is why I ask you always include the AS keyword. Again, you are helping prevent bugs before they happen.


As you’ve already seen, the FROM command tells SQL where on the database it should look for data. If you don’t specify a table in the FROM clause, then the SQL program acts if it doesn’t exist, and will not be able to find the fields you request.

FROM employees
LEFT JOIN departments
    ON employees.emp_no = departments.emp_no

In the next article we are going to talk about JOINS, they are an extension to the FROM clause of a query, but, they deserve their own article. Right now, look at the LEFT JOIN as an extension of the FROM clause. A join tells the SQL program, “First look in the employees table, then, check in the departments table, if there is a relationship with the employees table.”

Like I said, we will review JOINS thoroughly in the next article.

Table Aliases

Like we could give fields nicknames, called aliases, we can do the same with table names. However, this is usually done for a different reason: To save on typing.

One of the primary reason bad coders don’t write out the table names (not you, you’re going to be a good coder) is it adds a lot more to type. You may say, “Well, that’s just lazy.” It is, but it’s smart-lazy–also know as efficient. And efficiency is something you want to strive for in your code and coding.

Let’s look at an example from earlier.

SELECT salaries.emp_no 		AS Id,
       salaries.salary		AS Salary,
       employees.first_name	AS "First Name",
       employees.last_name	AS "Last Name"

FROM salaries
LEFT JOIN employees
    ON salaries.emp_no = employees.emp_no;

This query could be rewritten by using table aliases and save a lot of typing. It’s probably best to show you.

SELECT s.emp_no 	AS Id,
       s.salary		AS Salary,
       e.first_name	AS "First Name",
       e.last_name	AS "Last Name"

FROM salaries       AS s
LEFT JOIN employees AS e
    ON s.emp_no = e.emp_no;

Execute this query and compare its results to the query without table aliases. You will find the results are exactly the same. Moreover, this rewrite has saved 45 keystrokes. You may think, “Eh, not much.” Well, this is a small query. Imagine writing queries twice this size all day long. Your savings are worth it–may the time for an extra cup of coffee (or pot, in my case).

It is also easier for the human brain to comprehend–at least, once you’ve been reading SQL for awhile. Your brain will understand e and employees the same, but it doesn’t have to work as hard to understand e.

In short, good coders use table aliases.


In spreadsheets there will usually be a way to sort your data. Often your options will be based on a column’s contextual order. If the data are numbers it will be low-to-high, or high-to-low, respectively. If it’s text then your choice will probably be alphabetical, either A-Z to Z-A. And if it’s a date, then it will be first-to-last, or last-to-first. Each of these order types share a commonality, they value either goes to a low-values to high-values, or high-values to low-values. These types of ordering are known as ascending and descending, respectively.

In SQL, there are two types of ORDER BYs, ASC and DESC, for ascending and descending. They operate a bit different than most spreadsheet applications. They still order data by low-to-high or high-to-low, however, when you apply an ORDER BY it affects the entire result set. When a field is targeted by an ORDER BY all other fields on the same row are ordered along with the targeted field.

Enough words. Let’s take a look at some examples:


SELECT employees.emp_no,
FROM employees
ORDER BY employees.emp_no DESC



SELECT employees.emp_no,
FROM employees
ORDER BY employees.emp_no ASC


One note about ASC, if you do not specifcy what type of ORDER BY then it will default to ASC.

For example, this query will provide the exact same results as the one above:

SELECT employees.emp_no,
FROM employees
ORDER BY employees.emp_no

Most of ORDER BY is used for humans, making it easier to find whether your data were returned correctly. However, there are instances where ORDER BY will actually change the results of your queries, but it will be awhile before we get into those sorts of queries.

Later, we’re going to start working on making our queries efficient and fast, but now I’ll state: Make sure you need your results ordered before you ORDER BY.

It can be hard work for SQL program to order your results, which translates to longer execution times. Something you will want to avoid if you are trying to write a query for speed (which you will when writing code for production software).

Multiple Column Sort

SQL can also do multiple-field sorts. This works by sorting by the first field in the ORDER BY and where there are ties, then sort by the second field.

For example:

SELECT employees.emp_no,
FROM employees
ORDER BY employees.last_name ASC, employees.emp_no DESC 


“Aamodt” is the first employee in the last_name field when the ORDER BY is set to ASC, however, there are many “Aamodt”s in this table. This is where the second ORDER BY comes in. The second ORDER BY is set on the emp_no field and is DESC, this is why all the numbers start at the highest values and move towards the lowest. Of course, when the the last_name value changes the emp_no order will restart, still moving from highest to lowest.


Alright, let’s move on. Just remember, ORDER BY is extremely useful for humans, but it makes it slower for computers to process. Therefore, when you write a query, consider your audience.


The WHERE clause of a SQL query is a filter. Simple as that. It further limits your results. And it is probably the second most important portion of a query, next to the FROM clause. Reducing your results not only help you find what you need, it also makes it easier on the computer to find the results.

Though, before we get into more detail let’s take a look at an example:

SELECT employees.emp_no         AS Id,
       employees.first_name     AS "First Name",
       employees.last_name      AS "Last Name"
FROM employees
WHERE employees.emp_no = 10006
ORDER BY employees.emp_no, employees.first_name

This returns a single record, which makes sense. We told the SQL program we want emp_no, first_name, last_name from the employees table where the emp_no is equal to 10006.


But, let’s also look at the Database Message

Time Action Message Duration / Fetch
07:35:17 SELECT employees.emp_no, employees.first_name, employees.last_name FROM employees ORDER BY employees.last_name ASC, employees.emp_no DESC LIMIT 0, 1000 1000 row(s) returned 0.152 sec / 0.0035 sec
07:48:56 SELECT employees.emp_no AS Id, employees.first_name AS “First Name”, employees.last_name AS “Last Name” FROM employees WHERE employees.emp_no = 10006 ORDER BY employees.emp_no, employees.first_name LIMIT 0, 1000 1 row(s) returned 0.0036 sec / 0.0000072 sec

Notice how our query for one result took much less time than the query for a 1,000 results? I’ll cover this more later, but felt it was import to point out now. Using the WHERE clause to limit the data to only what you need will greatly increase the efficiency of your query.

Ever been to a cheap buffet with the sign posted on the sneeze-guard reading: “Take only what you will eat!!!” Well, imagine your SQL database has the same sign–you choose what you need with the WHERE clause.

Ok, enough on efficiency for now, let’s focus on how the WHERE clause will allow you to get the results you are after.

In queries we’ve written earlier, we’ve received every row on the database, from every table included in the FROM clause. Now, we are narrowing the results down to those of interest.

This can also be done with strings (text inside of " marks).

SELECT employees.emp_no         AS Id,
       employees.first_name     AS "First Name",
       employees.last_name      AS "Last Name"
FROM employees
WHERE employees.first_name = "Ramzi"
ORDER BY employees.emp_no, employees.first_name


But what if we want to include multiple different employees, but not all? That’s where IN comes…in.


The WHERE clause can be followed by the IN keyword, which is immediately followed by a set of parentheses; inside the parentheses you may put list of values you want to filter on. Each value must be separated by a comma.

For example:

SELECT employees.emp_no         AS Id,
       employees.first_name     AS "First Name",
       employees.last_name      AS "Last Name"
FROM employees
WHERE employees.last_name IN ("Bamford", "Casley", "Benveniste")
ORDER BY employees.last_name ASC, employees.first_name ASC;


This can also be done with numbers

SELECT employees.emp_no         AS Id,
       employees.first_name     AS "First Name",
       employees.last_name      AS "Last Name"
FROM employees
WHERE employees.emp_no IN (422990, 428377)
ORDER BY employees.last_name ASC, employees.first_name ASC;

Greater and Less Than

If the field you are using is numeric data, then you can also use the >, <, <=, and >= comparisons.

SELECT employees.emp_no         AS Id,
       employees.first_name     AS "First Name",
       employees.last_name      AS "Last Name"
FROM employees
WHERE employees.emp_no > 40000
ORDER BY employees.emp_no, employees.first_name;


If you aren’t familiar with the equalities, here’s a breakdown.

  • ”> 5000” will find all values which come after 5000, but does not include 5000 itself
  • ”< 5000” will find all values which come before 5000, but does not include 5000 itself
  • ”>= 5000” will find all values which come after 5000 including 5000 itself
  • ”<= 5000” will find all values which come before 5000 including 5000 itself

Closing Whew, these are the basic of a SQL query, but, it’s just the beginning. There are many more parts to SQL queries, such as AND, OR, <>, !=, JOIN, functions, UNION, DISTINCT–we’ve got a lot more to do. But! No worries, you’ve totally got this.

Don’t believe me? Don’t worry, I’m going to let you prove it to yourself. Let’s do some homework! :)

Homework #1

The following homework will have you take the query provided and modify it to return the described result. Once all queries are completed, fill free to email the queries to me and I’ll “grade” them for you.

For questions #1-6 use the following query:

FROM employees
LEFT JOIN dept_emp
	ON employees.emp_no = dept_emp.emp_no
LEFT JOIN departments
	ON dept_emp.dept_no = departments.dept_no
LEFT JOIN titles
	ON employees.emp_no = titles.emp_no
LEFT JOIN salaries
	ON employees.emp_no = salaries.emp_no;
  • Question #1 – Modify the above query to use table aliases instead of full table names.
  • Question #2 – Modify resulting query to only return results for emp_no, first_name, last_name, dept_name, salary.
  • Question #3 –Modify resulting query to *rename the fields to the following “Employee #”, “First Name”, “Last Name”, “Department #”, and “Salary”.
  • Question #4 –Modify resulting query to list employees by their salaries; order them lowest salary to the highest.
  • Question #5 –While keeping the lowest-to-highest salary order, modify resulting query to list the employees in alphabetical order by their last name where their salaries are tied.
  • Question #6 – Modify resulting query to only provide clients who have make over 50,000

For questions #7-10 use the following query:

FROM employees 			AS e
LEFT JOIN dept_emp		AS de
	ON e.emp_no = de.emp_no
LEFT JOIN departments	AS d
	ON de.dept_no = d.dept_no
LEFT JOIN titles 		AS t
	ON e.emp_no = t.emp_no
LEFT JOIN salaries 		AS s
	ON e.emp_no = s.emp_no
  • Question #7 – Modify the above query to only return results for those with the first name “Yishay”, “Huan”, or “Otmar”
  • Question #8 – Modify resulting query to to also show only their first_name, last_name, and salary.
  • Question #9 – Modify resulting query to to also show what departments they work in.
  • Question #10 – Modify resulting query to also show their hire date.
Beginning MySQL for Data Analysts

I’m usually writing about hacking, robotics, or machine learning, but I thought I’d start journaling thoughts on data analytics, which is how I pay the bills these days. I wanted to begin with a series on MySQL, as I’ve some friends I feel it’d help enter the field. But, I’ll eventually expand the series to include visualizations, analysis, and maybe machine learning. And I hope these articles help someone move from manually generating reports in Excel to writing scripts that’ll automate the boring stuff. As I like to say, “knowing to code gives you data superpowers!”

I’m a professional data analyst, but, if I’m confident of anything, it’s I’ve holes in my understanding. That stated, these articles may contain mistakes. If you spot one, let me know in the comments and I’ll get it fixed quick.

Also, I’m pretty opinionated. I’m sure these opinions will find their way into my writings. When I notice them, I’ll provide a caveat and reasoning for why I hold the opinion.

One last thing, these articles will focus on immediately usable techniques. Honestly, I believe I’ve failed you if you finish an article without a new skill–or, at least an affirmation of existing skill. Don’t get me wrong, I plan to do deep-dives into needed skills, but I believe those are only useful if you have a mental framework to hang them on.

Ok! Let’s do this!


When getting started in data analytics Structured Query Language (SQL) is a great place to begin. It is a well established data language, having been around since the 70s. The intent of SQL is to empower an individual to retrieve data from a database in an efficient and predictable manner. However, nowadays SQL is used for lots more, such as abstraction, analysis, and semantic changes.

What does it look like? Here’s a example of a SQL query:

FROM employees AS e
LEFT JOIN salaries AS s
	ON e.emp_no = s.emp_no
WHERE e.emp_no = 10004;

The above code is referred to as a query. It’s a question we’d like to get an answer to, written in a language a machine understands. In such, running this query should return all the data needed to answer the question. That’s really what SQL’s about. Writing out a question and getting an answer from the database.

Though! We’re not going to go into those details yet. Right now, let’s setup a practice environment where we can learn to apply concepts along with the concepts themselves.

Sooo Many SQLs

I’d love to tell you SQL is simple. It’s not, well, at least not simple to master. It’s complex–every day I learn something new (one reason I enjoy it). One of its complexities is there are different versions of SQL dialects. Here, we refer to “dialect” as slightly different ways of coding the same thing.

Some of the most common are:

Source / Vendor Common name (Dialectic)
Microsoft / Sybase T-SQL
Oracle PL/SQL
PostgreSQL PL/pgSQL

Let’s make it a bit more confusing. SQL refers to the language, but we often refer to a SQL dialect by it’s vendor or source. Thus, even though MySQL and MariaDB largely speak the same dialect, “SQL / PSM,” we refer to them not by their common name, but by the source name. Thus, “I write MySQL queries.” Or, “At work I use PostgresSQL.”

So which one do you focus on?

Well, we have to start somewhere. I’ve picked MySQL because I use it’s identical twin, MariaDB, at work. It’s a great SQL dialect to begin with, as it’s used by many potential employers.

Source Companies Use
MySQL 58.7%
SQL Server 41.2%
PostgreSQL 32.9%
MongoDB 25.9%
SQLite 19.7%
Redis 18.0%
Elasticsearch 14.1%

Source: Stackoverflow 2018 Developer Survey.

At this point you might be saying, “That’s great? I’ve no idea what any of this means.” No worries! Bookmark this page and come back later. For now, let’s move into setting up a practice MySQL environment.

  • One last note, if you’re going into a job interview it’s a good trick to wait until you hear how they pronounce “SQL” and then say it how they do. As the “correct” pronunciation is “Ess-cue-ell,” however, most professionals I know pronounce it “sequel” (as do I).


Setting up MySQL

These instructions assume you are using Windows. If not, don’t worry, most of them still apply, but you get to skip some steps!

Ok, were are going to install MySQL Workbench. This program will allow us to write SQL queries, send them to a database, get back and view the results.

Preparing to Install MySQL Workbench (Windows Only)

If you are using Windows you need to install software MySQL Workbench uses on Windows.


Click on the link above. Select the vc_redist_x64.exe file and click “Download.” Once the file has finished downloading, install it.


MySQL Workbench

Ok! Now we are ready to download and install MySQL. Visit the link below, select your operating system, and choose “Download.”

Select your operating system and hit “Download” download-mysql-workbench

Once the file has finished downloading, run it and follow the install prompts. All choices are fine left on default.

Connecting to the Server

Once you’ve installed MySQL Workbench, open it. When it comes up you should see the main screen, which looks something like: mysql-workbench-welcome-screen

Before we can start querying a database we need to create a database connection. A “connection” here is all the information MySQL Workbench needs to find the database and what permissions you have regarding data access.


We will be connecting to a database I’ve setup on a remote computer. Connecting to a remote computers is the most common way to interact with a SQL database, however, later I’ll show you how to build your own database using CSVs. This will be hosted on your local PC.

Ok, back to setting up the remote connection. Click on the circle and plus icon next to “MySQL Connections.” This will cause a screen to pop up for connection information.

Enter the following:

Connection name:
Username: the username I've provided you

Please don’t be shy, if you need a username email me at cthomasbrittain at yahoo dot com. I’ll gladly make you one.

Once you’ve entered the connection information hit “Ok”. You should be brought back to the “Welcome” screen, but now, there will be a connection listed called “”. our-sql-connection

Double click on it. You will most likely get the following warning. sql-connection-warning Click “Continue Anyway” (and if there’s an option, check “Dont Show this Message Again”).

If the connection was successful you should see a screen like: our-sql-connection

Show / Use Databases

Alright! Let’s get into the action. Before we start executing queries let me point out a few things in the user interface: mysql-workbench-interface

Write Query

This area is where you will write queries. Each query should end with a ;, or the MySQL Workbench will get confused and try to jumble two queries together.

View Results

This is the area where the result of whatever command you send the SQL will server will be shown. Often, it will be a table containing data you requested in your query

Database Messages

Here is where you can spot if you written a query incorrectly, as the database will send a message letting you know. Also, the database will tell you when it has successfully returned results from a query, how many results, and how long they took to retrieve. Useful for when you are trying to make a query fast.

Getting Around in MySQL

Let’s send a query to the database. In the query area type:

SHOW databases;

Now, select those text with your mouse and hit the lighting (execute) icon above it. show-databases-command

This will return a list of all the databases found on this server. You should see this in the View Results area. Each SQL server can have multiple databases on it, and they often do. For right now we want to focus on the employees database. show-databases-command

To select a database type USE and then the name of the database. In our case it will be:

USE employees;

Now, highlight the text and hit the execute button.


This will show the following in the database messages:

13:21:55	USE employees	0 row(s) affected	0.0031 sec
13:21:55	Error loading schema content	Error Code: 1146 Table 'performance_schema.user_variables_by_thread' doesn't exist	

Don’t worry about the error, that’s a product of my hasty setup. The important message is the USE employees message. This means you are now connected to the employees database. Any query you write in this session will now be sent to this specific database.

But, now what? We’ve no idea of what’s on the database. No worries, we’ve a command to see the tables found on this database. If you are not familiar with the term “table,” don’t worry. Just think of a table as a single spreadsheet. It’s a bit more complicated and we will investigate their structure further in a bit. But, right now, the spreadsheet analogy works.

To see all the tables this database contains execute the command:

SHOW tables;

This should return the following table names show-databases-command

By now, you know the next question, “But how do I know what’s in a table?”

You can use the DESCRIBE command to get more information about a table. Let’s take a look at the departments tables.

Type and execute:

DESCRIBE departments;

This should return:

show-databases-command The Field column here gives you the names of all the fields in the departments table. What’s a field? As with table, we will go into them with more depth later. But for now, think of a field as a named column in a spreadsheet.

Our First Query!

Now we know the database, table, and field names, let’s write our first query!

Still in the query area type and execute:

SELECT departments.dept_no, departments.dept_name
FROM departments

This will return all the entries for the fields (columns) dept_no and dept_name for the table (spreadsheet) called departments. You did it! You’re a SQL’er. show-databases-command

What Comes Next?

Lot’s to come! We will learn a bit more about SQL, it’s parts, their proper names. We’ll also dive into the “proper” SQL names for different data parts. And we’ll write tons more queries.

Please feel free to ask any questions in the comments. I’ll answer them ASAP.

Creating a Neural Network Webservice

We’re almost done. In the previous articles we’ve used a local machine to train a CNN to detect toxic sentiment in text. Also, we prepared a small (1GB RAM) server to use this pre-trained network to make predictions. Now, let’s finish it and create a webservice where anyone can access our awesome magical algorithm.

Prediction Service

On your remote server, navigate to your flask_app folder and create a file called The following code creates an HTTP request endpoint /detect-toxic and it exposes to other programs running on the server. A bit more explanation after the code.

cd /home/my_user/flask_app

Enter the following:

from flask import Flask, request
application = Flask(__name__)

from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences
import numpy as np
import pymongo
import json

# Parameters
mongo_port = 27017
embedding_collection = 'word_embeddings'
word_embedding_name = 'glove-wiki-gigaword-50'
pad_length = 100

# Globals
global model, graph

# Connection to Mongo DB
    mong = pymongo.MongoClient('', mongo_port)
    print('Connected successfully.')
except pymongo.errors.ConnectionFailure:
    print('Could not connect to MongoDB: ' + e)

db = mong[embedding_collection]
coll = db[word_embedding_name]

# Load Keras Model
model = load_model('/home/my_user/flask_app/models/tox_com_det.h5')

# Start flask
if __name__ == '__main__':'')

@application.route('/detect-toxic', methods=['POST'])
def sequence_to_indexes():
    with open('nn_service.log', 'w+') as file:
    if request.method == 'POST':
            sequence = request.json['sequence']
            return get_error('missing parameters')
        response = {
            'prediction': prediction_from_sequence(sequence, pad_length)
        return str(response)

def get_word_index(word):
    index = ''
        index = coll.posts.find_one({'word': word})['index']
    return index

def get_error(message):
    return json.dumps({'error': message})

def prediction_from_sequence(sequence, pad_length):
    sequence = sequence.lower()
    sequence_indexes = []
    for word in sequence.split():
            index = int(get_word_index(word.strip()))
            index = 0
        if index is not None:
    sequence_indexes = pad_sequences([sequence_indexes], maxlen=pad_length)
    sample = np.array(sequence_indexes)
    prediction = model.predict(sample, verbose = 1)
    prediction_labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
    prediction_results = str({prediction_labels[0]: prediction[0][0],
                              prediction_labels[1]: prediction[0][1],
                              prediction_labels[2]: prediction[0][2],
                              prediction_labels[3]: prediction[0][3],
                              prediction_labels[4]: prediction[0][4],
                              prediction_labels[5]: prediction[0][5]
    return prediction_results

What’s going on? Well, it’s an extension of code I’ve detailed in earlier parts of this series. However, there are a couple of new pieces.

First, we are connecting to our MongoDB database containing the contextual word-embeddings. This database is used to look up words, which have been sent to our service endpoint.

The only route in this server is a POST service. It takes one argument: sequence. The sequence is the text the webservice consumer would like to have analyzed for toxic content. The endpoint calls the prediction_from_sequence(). Inside the function, the word indexes are pulled from the word_embeddings database. After, the newly converted sequence is padded to the needed 100 dimensions. Then, this sequence is passed to our CNN, which makes the prediction. Lastly, the prediction is converted to JSON and returned to the user.

Before we go much further, let’s test the script to make sure it actually works. Still in the flask_app directory type, replacing my_user with your user name and with the name of your Flask app:

echo "# Flask variables" &>> /home/my_user/.bashrc
echo "export" &>> /home/my_user/.bashrc

This sets FLASK_APP variable, which is used when executing the Flask webservice.

Ok, we should be able to test the app fully now:

flask run

You should be greeted with something similar to:

 * Serving Flask app ""
 * Environment: production
   WARNING: Do not use the development server in a production environment.
   Use a production WSGI server instead.
 * Debug mode: off
Using TensorFlow backend.
Connected successfully.
2019-02-03 15:53:26.391389: I tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-02-03 15:53:26.398145: I tensorflow/core/common_runtime/] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
 * Running on (Press CTRL+C to quit)

Great! We’re on the home stretch.

I’ve prepared a curl statement to test the server. You will need to leave the Flask program running and open a second terminal to your server. When the second terminal is up paste in the following, replacing the “sequence” with something nasty or nice.

curl -X POST \
  http://localhost:5000/detect-toxic \
  -H 'Content-Type: application/json' \
  -d '{"sequence":"im pretty sure you are a super nice guy.","padding": 100}'

You should get back an appropriate response: local-curl-test-neural-net-webservice

NodeJS and node-http-proxy

It gets a bit weird here. Usually, one will setup a Flask server with uwsgi or gunicorn combined with nginx. However, I found the uwsgi middle-ware was creating two instances of my project, which would not fit in the microserver’s RAM. I spent a lot of time creating a server the proper only to be disheartened when I discovered uwsgi was creating two instances of the, thereby attempting to load two of the CNNs into memory. Our poor server. I gave up on “proper” and went with what I describe below. However, I’ve created a bash script to completely setup a server for you the “proper” way. I’ve added it to the Appendix.

I’ve opted to run Flask and serve it with a nodejs server as a proxy.

The nodejs is atypical, but I found it probably the most simple to setup. So, eh.

Let’s install NodeJS on the server.

sudo yum install -y nodejs

Now move to the directory containing your flask_app and initialize a node project.

cd /home/my_user/flask_app
npm init

You will be prompted to enter the project–take your time to fill it out or skip it by hitting return repeatedly.

Once the project has been setup, let’s install the node-http-proxy package. It will allow us to create a proxy server sitting on top of our Flask service in a couple of lines of code.

Still in your project directory:

npm install node-http-proxy
nano server.js

Inside the server file place:

var http = require('http'),
    httpProxy = require('http-proxy');

Alright, before testing our Flask webservice we need to allow 8000 port access and allow HTTP / HTTPS request on the firewall.

firewall-cmd --permanent --zone=public --add-service=http
firewall-cmd --permanent --zone=public --add-service=https
sudo firewall-cmd --zone=public --add-port=8000/tcp --permanent
sudo firewall-cmd --reload

You can test the whole proxy setup by opening two terminals to your server. In one, navigate to your Flask app and run it:

cd /home/my_user/flask_app
flask run

In the other navigate to the node proxy file and run it:

cd /home/my_user/flask_app/proxy
node server.js

Now, you should be able to make a call against the server. This time, run the curl command from your local machine–replacing the my_server_ip with your server’s IP address:

curl -X POST \
  http://my_server_ip:8000/detect-toxic \
  -H 'Content-Type: application/json' \
  -d '{"sequence":"im pretty sure you are a super nice guy.","padding": 100}'

You should get a response exactly like we saw from running the curl command locally.

Daemonize It

The last bit of work to do is create two daemons. One will keep the Flask app running in the background. The other, will keep the proxy between the web and the Flask app going.

One caveat before starting, because daemons are loaded without the PATH variable all file references must use absolute paths.

At the server’s command prompt type:

sudo nano /etc/systemd/system/nn_service.service

And add the following replacing my_user with your user name:

Description=Flask instance to serve nn_service

ExecStart=/usr/local/miniconda/bin/flask run


This will create a service. It will run the program pointed to by ExecStart, in our case flask run, inside the directory pointed by WorkingDirectory.

Save and exit.

Now, let’s create the nn_service_proxy.service daemon:

sudo nano /etc/systemd/system/nn_service_proxy.service

And enter the following replacing my_user with your user name:

Description=Proxy to Flask instance to serve nn_service

ExecStart=/usr/bin/node /home/my_user/flask_app/node/nn_service_proxy.js


Great! We’re ready to enable and start them.

sudo systemctl enable nn_service.service
sudo systemctl enable nn_service_proxy.service
sudo systemctl start nn_service.service
sudo systemctl start nn_service_proxy.service

Alright, you can now check the system journal to make sure they loaded correctly:

sudo journalctl -xe

But, it should be good. If something goes wrong, definitely ask questions in the comments. Otherwise, we should be ready to test our full functioning toxic text detection webservice!

curl -X POST \
  http://my_server_ip:8000/detect-toxic \
  -H 'Content-Type: application/json' \
  -d '{"sequence":"im pretty sure you are a super nice guy.","padding": 100}'

Wow! What a journey right. But pretty damn cool. We now have a webservice which can be called by anyone who wants to check text to see if it contains toxic sentiment. I didn’t have an application when starting this project, but I’m learning webscraping with a friend, and I think it’ll be great to pass text off to this webservice and have it flagged if contains nasty content.

“Proper” Flask Webservice Setup

I’ve written a script to setup the webservice for you. First, you will need to be logged into your Centos 7 server as root.

Then type:

yum install -y wget
chmod +x

What this script does:

  1. Sets up a new user
  2. Adds Miniconda to the PATH variable.
  3. Adds Flask environment variables (needed to run app).
  4. Updates the server.
  5. Creates the flask_app directories
  6. Opens the needed ports
  7. Installs nginx
  8. Creates a nginx .conf file with information to proxy uwsgi service.
  9. Installs uwsgi creates a .ini file for wrapping the Flask app.
  10. Creates and enables a uwsgi daemon.
  11. Creates and enables a Flask daemon.
  12. Installs Miniconda, tensorflow, and sets Python to 3.6.8.
  13. Installs MongoDB
  14. Enables remote editing from VSCode (info)

We’re about to execute the script, but there’s a critical step I wanted to explain first. The script is going to take several commandline arguments. If these are wrong, it’ll royally jake up your server.

./ user_name user_password flask_app_name flask_port
  • user_name This will be the user who provides the webservice
  • user_password The user’s password. You’ll need this to ssh into the server as this user.
  • flask_app_name This is the name of your app. Everything from the Python script to the daemon will be labeled with this name.
  • flask_port This is the port which will be exposed to the web.

Ok, replace all of the above commandline arguments with the ones you prefer and execute it. Cross your fingers or yell at me in the comments.

Preparing a Small Server for a Neural Network Webservice

Previously, I wrote about training a CNN to detect toxic comments from text alone. But, I realized, even if one has a nice little NN to solve all the world’s problems it doesn’t help unless it is in production.

This article is going to cover how to prepare a server and needed word embeddings to mechanize the NN in a Flask webservice.

Server Setup: Preamble

For this project I’m using a small server from Linode–called a “Nanode.” At the time of writing these servers are only $5 a month. The catch? They only have 1GB of RAM. It’s definitely going to be tricky to deploy our CNN there, but let’s see it through.


As for setting up the server, I’ve written about it elsewhere:

For this particular project, I decided to go with a CentOS 7 distribution.

For those of you who know me; I’m not betraying Arch Linxu, however, this project will be using MongoDB and there’s a bit of drama going on. I will leave some Arch Linux instructions in the Appendix, in case it is ever resolved.

I chose CentOS because it is the distro we use at work and I hoped to get some experience using it.

Setup User on Centos

Login as root and update the system

yum update -y

Let’s add another user; setting up the system as root is not a best practice.

useradd my_user
passwd my_user

Set the password for the my_user

Now, let’s give the my_user sudo powers

EDITOR=nano visudo

Find line with:

root    ALL=(ALL)    ALL

And add the exact same entry for my_user. It should look like this when done

root    ALL=(ALL)    ALL
my_user    ALL=(ALL)    ALL

Save the file and exit.

Let’s login as our new user. Exit your shell and login back in as the my_user. It should look something like this, typed on your local computer command line.

ssh my_user@erver_ip_address

Once logged in, let’s test the my_user’s sudo powers

sudo ls

If you are greeted with:

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

    #1) Respect the privacy of others.
    #2) Think before you type.
    #3) With great power comes great responsibility.

[sudo] password for my_user: 

Then task complete! Otherwise, feel free to ask questions in the comments.

Setup Miniconda on Centos

Anaconda is a great package system for Python data analyst tools. It takes care of a lot of silly stuff. Miniconda is the commandline version fo Anaconda, which we will be using.

Install it by entering the following and agreeing to the terms.

sudo yum install -y wget bzip2
chmod +x
source .bashrc

Side note here, if you install Miniconda and have trouble executing conda, most likely it didn’t add the executable path to your PATH variables.

This should add the path for both your user and root:

echo "export PATH='/usr/local/miniconda/bin:$PATH'" &>> /home/my_user/.bashrc
echo "export PATH='/usr/local/miniconda/bin:$PATH'" &>> /root/.bashrc

You will need to make sure to reload your shell (log out and back in or run source .bashrc) after adding the conda path.

As of this writing Tensorflow only supports Python as late as 3.6, while Miniconda sets up your environment to use 3.7. To rectify this we can set Python to 3.6.8 by using the Miniconda installer conda.

conda install -y -vv python=3.6.8

Also, we need to install a few Python packages.

conda install -y -vv tensorflow scikit-learn keras pandas

Ok, one last important step: Reboot and log back in.

sudo reboot now

Create MongoDB Tokenizer Collection

Here’s where we get clever. We are trying to fit our model into less than 1GB of RAM, to do this, we are going to need to find a way to access the word-embeddings’ index2word and word2index lookup objects without loading them in RAM, like we did in training. We are going to shove them into a database to be loaded into RAM only when a specific word is needed.

Disk access is slower, but hey! I don’t want to pay $40 a month for a hobby server, do you?

To move the word-embeddings will take a few steps. First, we’ll run a Python script to save the embeddings matching the context of our original training. Then, we will export those embeddings from our local MongoDB. Next, we’ll move them to the remote server and import them into the MongoDB there. Simple!

Install MongoDB Locally

To create the local word-embedding databases we will need to install MongoDB locally. This could vary based upon your OS. I’ve used homebrew to install on the Mac.


Here are instructions on installing MongoDB on the Mac:

Don’t forget you’ll need to start the MonogDB service before starting the next step.

On the Mac, using Homebrew, it can be started with:

brew services start mongodb

Create a Word Embedding Database

Once you’ve installed it locally, here’s the script I used to convert the word_embeddings into a MongoDB database. It loads the word-embeddings using gensim, tokenizes them.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
Created on Tue Jan 22 05:19:35 2019
@author: cthomasbrittain
import pymongo
import gensim.downloader as api
import pandas as pd
from keras.preprocessing.text import Tokenizer

# Convenience Macros
word_embedding_name = "glove-wiki-gigaword-50"

BASE_DIR = '/path/to/embeddings'

# Load embeddings
info = # show info about available models/datasets
embedding_model = api.load(word_embedding_name) # download the model and return as object ready for use

vocab_size = len(embedding_model.vocab)

index2word = embedding_model.index2word
word2idx = {}
for index in range(vocab_size):
    word2idx[embedding_model.index2word[index]] = index
# Get labels
print('Loading Toxic Comments data.')
with open(TRAIN_TEXT_DATA_DIR) as f:
    toxic_comments = pd.read_csv(TRAIN_TEXT_DATA_DIR)

# Convert Toxic Comments to Sequences
print('Processing text dataset')

tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
sequences = tokenizer.texts_to_sequences(toxic_comments['comment_text'].fillna("DUMMY_VALUE").values)
word_index = tokenizer.word_index

# Save Embeddings to MongoDB
mong = pymongo.MongoClient('', 27017)

# Create collection database
mongdb = mong["word_embeddings"]

# Create this word_embeddings 
coll = mongdb[word_embedding_name]

for i, word in enumerate(index2word):
    if i % 1000 == 0:
        print('Saved: ' + str(i) + ' out of ' + str(len(index2word)))
        embedding_vector = list(map(str, embedding_model.get_vector(word)))
        post = {
                'word': word,
                'index': word_index[word],
                'vector': list(embedding_vector)
        posts = coll.posts
        post_id = posts.insert_one(post).inserted_id

One note here, you could set the database directly to your remote. However, I found saving the >2 GB enteries one at a time across a 38.8bps SSH connection took most of the day. So, I’ve opted to create them locally and then copy them in bulk.

Install MongoDB Remote Server

MongoDB has license with some strict redistribution clauses. Most distros no longer include it in the package repos. However, MongoDB has several distro repos of their own–luckily, REHL and Centos are included. But not Arch Linux? Really? :|

Ok, to install MongoDB from the private repo we need to add it to the local repo addresses.

We can create the file by typing:

sudo nano /etc/yum.repos.d/mongodb-org-4.0.repo

One word of caution, the following text was copied from the MongoDB website.

It’s probably best to copy the repo information directly from the link above, in case there is a newer version.

Or, here’s what I put in the file:

name=MongoDB Repository

Save the file.


sudo yum install -y mongodb-org

Yum should now find the private repo and install MongoDB.

Setup MongoDB

We need to enable the mongod.service.

sudo systemctl enable mongod.service

And reboot

sudo reboot now

I’ll be setting up MongoDB to only for local access. This enables it to be accessed by our Flask program, but not remotely. This is a best practice in securing your server. However, if you’d like to enable remote access to the MongoDB I’ve included instructions in the Appendix.

Move the Model to Server

Since we trained the model locally, let’s move it to the server. Open your terminal in the directory where the model was stored.

scp toxic_comment_detector.h5 my_user@my_server_ip:/home/my_user

Replace my_user with the user name we created earlier and my_server_ip with the address of your server. It should then prompt you to enter the server password, as if you were ssh’ing into the server. Once entered, the model should be copied to the server.

Move word_embeddings Database to Server

Once ou’ve created the local word_embeddings DB, at local the terminal type the following to make a copy:

mongodump --out /directory_to_save

Now, copy this DB backup to your remote server

scp -r /directory_to_save/name_of_output_folder user_name@remote_ip_address:/home/user_name/

Now, log in to your remote server and create a DB from the data dumps.

mkdir /home/user_name/word_embeddings
mongorestore --db word_embeddings /home/user_name/word_embeddings

We also need to restart the MongoDB service

sudo systemctl restart mongod.service

If you would like to enable access to the database remotely (see instructions in Appendix) you could use Robo3T to make sure everything is in place. But if you didn’t get any errors, we’re probably good to go.

Test the Model

Log into your server. We are going to test the model, since it needs to fit in the RAM available. The my_user in the script should be replaced with the user name you created while setting up your server and proejct.



Now, enter the following into the Python interpreter.

from keras.models import load_model
model = load_model('/home/my_user/toxic_comment_detector.h5')

If all goes well it will mention it’s using the Tensorflow backend and return you to the interpreter prompt.

If you trained your network like me, then the following will allow you to fully test the model deployed remotely.

import numpy as np
test_prediction = np.array([[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 1873,147,6, 3476,324, 15, 29,141]])

If you get back something similar to:

array([[0.97645617, 0.21598859, 0.92201746, 0.01897666, 0.7753273,
0.11565485]], dtype=float32)

We’re in good shape. These are the predictions for the the following respectively:

["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

The test_prediction was the following text sequence pre-encoded.


So, the toxic and obscene label should definitely be close to 1. Looks like we’re in good shape.

In the next article I’ll show how to create a Flask webservice to access the model. Well, at least I hope, not sure how to do that yet.


Arch Linux Miniconda Setup

sudo pacman -Syu
sudo pacman -S git wget tk valgrind gcc make
adduser -m user_name
passwd user_name
EDITOR=nano visudo
(add user_name to sudo)
su user_name

source .bashrc
conda install keras h5py pillow flask numpy gensim pandas scikit-learn matplotlib
conda install tensorflow=1.8

Setup MongoDB on Arch Linux

Apparently MongoDB’s license change means the Arch Linux official repos cannot distribute it. So, we have to compile from source. Waaawaaah.

Note, it took more than 1GB of RAM to compile from source.

sudo pacman -S fakeroots automake autoconf gcc make snappy \ 
            yaml-cpp lsb-release  gperftools \
            libstemmer scons python2-setuptools python2-regex \
            python2-cheetah python2-typing python2-requests \
            python2-yaml python2-pymongo 
git clone
cd wiredtiger
makepkg -i
git clone
cd mongodb
makepkg -i

Enabling Remote Access to MongoDB

To enable remote connections edit the mongod.conf file:

sudo nano /etc/mongod.conf

Find the following lines in the file and comment out bindIp.

Your file should look like this:

# network interfaces
  port: 27017
  #bindIp:  # Enter,:: to bind to all IPv4 and IPv6 addresses or, alternatively, us$

This allows us to connect to the MongoDB from any IP address. If we’d left this line, then we could only connect to the database from within the server itself ( = local).

Monitoring System Resources

I like using htop for this, but you’ve gotta build it from source on Centos

sudo rpm -ihv epel-release-7-11.noarch.rpm
sudo yum install -y htop