Building a Language Toxicity Classification – Cambridge ML Summit ‘19

Building a Language Toxicity Classification – Cambridge ML Summit ‘19


EKABA BISONG: Good afternoon. My name is Ekaba Bisong, and
I’m a data science lead at T4G. So T4G is an editor consulting
company up north, in Canada. I came here from Calgary
earlier, as of this morning. And so today, briefly,
we’ll be talking about using machine
learning to assist in detecting toxic
language on online forums. And I think when I came
in, I heard someone also from the Google Brain team
talk about a jigsaw data set. So that is the exact
type of data set that was used in this project. Or what I call it– this is a weekend
project anyway. But the cool thing here
is to kind of promote how we can leverage
AutoML on GCP for building actual
data products that can make an impact in society. So there is a challenge of
pruning online communities. And one is that it
can be time consuming. And looking at charts, or
images, or videos, or whatnot. And also there is an issue
of consistency in judgment. As far as we have two
people, even two tweens, who would make two
different judgments. [INAUDIBLE] of having 1,000
people or tens of thousands of people having to make a
judgment of what is toxic and what is not toxic. And another challenge is
the psychological harm to the moderators. Now, imagine for a
second that you’re looking at tons of
beheading videos every day, or tons of obscene photographs. It does a lot of
damage to the mind. I mean, even taking
a look a jigsaw– which jigsaw data set
is a text data set– just looking at some of what
was labeled as toxic words, reading some of them,
my mind took a beating. It was not– some of those
sentences are a bit intense. Have to do a mental
cleaning after. So these are some
of the challenges in this sort of moderating. So then I’ll go and talk
about the case for AutoML. Now what they call AutoML, it
encapsulates three concepts, broadly speaking. And the first is the concept
of transfer learning. Now, as we all know, building
machine learning models, large-scale models can be
extremely computationally– can be computationally
intensive. And so it makes a lot of sense
to leverage pre-built models, and then use it in solving
your own domain use case. So that’s the moonshot
idea of transfer learning. So how it works is it
takes just the layer before the output layer, which
is called the bottleneck. And then you train that
layer or your own data set. So you’re leveraging
the weights, or the already-understood
parameters of a large-scale network
in your own data set. And that’s how that
works from a high level. And the other is what is called
Neural Architectural Search. It’s a fancy word, but it’s
really very interesting. And that has been around for a
couple of years now, probably two or three. I can’t remember when. But the idea is
how can you build– because part of the challenges
of building a large-scale network in production is– one is to find out what’s the
best architecture for my use case. In line with the idea
of no free lunch, there is no silver
bullet ML architecture. So Neural Architecture
Search is using the idea of a reinforcement
learning agent where the action it’s taking
is proposing a certain set of layers for the network. And then the idea is you want
to optimize some utility. And that could be the
performance index of the model to decide which is the best
architecture for your problem space or for your use case. So that’s what a Neural
Architecture Search is. And then the other is
hyper-parameter tuning. So take any method. So it’s a neural
network, or let’s say linear regression method. They’re a set of parameters
that you may need to tune. A common one would
be [INAUDIBLE] or some metric, some
method, some parameter you want to tune. How can we tune this at scale? So you’re handed
three things, you’re trying to find the
best hyper-parameter, you’re trying to find the best
architecture, a lot of things going on. So that’s where AutoML comes in. And then GCP– that’s
Google Cloud Platform– they have a product
called GCP AutoML for NFP, for Natural Language Processing. So it packages these concerns
into a product and trains on scale. So you have Google’s massively– I mean, parallel distributed
machines available for you to load
your own data set, process it first, load
it, and then train. So that’s what we did
with the jigsaw data set. So there’s a method
for preparing it. I will just gloss over that,
and waive that for this talk. So I’ll just say that
we upload the data set to AutoML for natural language. But it has to be
processed or developed– arranged in a particular format. And as you see in the
screen, the next thing is that you train it. And the longer you
train, of course, you could have it better,
depending on the size of your data set anyway. So you do the training, and
then the model is deployed. Now, in the deployed model,
you have a user interface where you can kind of
validate/evaluate your model. And I can’t see
what I put there, but there is an example
of a toxic or a clean set. And it did pretty well. But of course, there
are obvious limitations. It’s one thing
you’re asking when you talk about toxic and clean. It doesn’t understand things. There’s a lot of issues
when we talk about context. And when we play around
with a funny context– like someone can use the F-word
on the current president– many people would
think it’s toxic. Others would think–
a lot of others would say that’s not toxic. Yeah. [AUDIENCE CHUCKLING] Now, whichever divide you
lie, that’s up for the jury. That’s part of the issues
of this kind of stuff. [INAUDIBLE] used the F-word
on the previous president. A lot of people would think
that’s toxic, and a lot more say, that’s not toxic. But that’s where you are, and
that’s the world, and it’s OK. It’s OK. There’s differences of opinion. It cannot be otherwise. But this is strange. I would say there’s
mixed sense in moderating what is clear and obvious. And I think this is part of the
thin line we need to toe as AI researchers and developers. I think my friend Wally talked
about AI for social good. These are some of
the things that– or AI ethics–
some of the things that we should look
at in terms of how we draw the line between
having our online community safe for us and then going to
another length of what we will call intellectual or speech
fascism, groupthink, of how you should think or see. So freedom is important,
but being cordial, congenial– and these are
some of the trade-offs that I think that this is
more in the arena of wisdom. So those are some of the
limitations of these. But what is clear and obvious,
I think it does pretty well. Now, the goal of this is to see
that there are opportunities for products, leveraging
these kind of opportunities for downstream software
products or applications that leverage this infrastructure
on Google Cloud Platform. There’s an object detection– AutoML for object detection
that just came out, I think, the early
part of this year. I’m not too sure. But the image specification
one has been there. These things can tie together. And it exposes
endpoints which you can infuse into your software
pipeline– product pipeline. So there are a lot of
opportunities here. And we could talk. And these things can be
leveraged to create products that improve the society. And the meat and bones
of how I built this, it’s part of the chapter in
that book you see up there. So it’s the book that it is
available for preordering. It will come out October,
I guess, mid-October, so “Building Machine
Learning and Deep Learning Models on GCP.” And thank you so
much for your time, and it’s a pleasure to be here. [APPLAUSE] [TITLE MUSIC]

Leave a Reply

Your email address will not be published. Required fields are marked *