Building a Fake News Detector

In this tutorial we’ll add a voice of reason to the chorus of people crying FAKE NEWS all day long. Let’s make a fake news detector that actually works with better reliability!


Important: The code in this tutorial is licensed under the GNU 3.0 open source license and you are free to modify and redistribute the code, given that you give others you share the code with the same right, and cite my name (use citation format below). You are not free to redistribute or modify the tutorial itself in any way. By reading on you agree to these terms. If you disagree, please navigate away from this page.
Citation format
van Gent, P. (2018). Building a Fake News Detector. A tech blog about fun things with Python and embedded electronics. Retrieved from: http://www.paulvangent.com/2018/08/31/building-a-fake-news-detector/


Introduction
‘Fake News’ is an umbrella term for an epidemic that is running rampant. At its core lie disinformation and distrust in experts. On the funny end of the spectrum dangle conspiracy nuts claiming lizard people rule the world, which would actually be flat, and which we never really left for the (strangely round) moon. Our neighbouring planet Mars is also round according to these people, it’s just that earth is flat because [reasons]. All in all it has pretty good entertainment value, and if anything it gives a sobering insight into the limitless nature of human ignorance and willful stupidity.

On the other end of the spectrum lie dangerous ideas though. Vaccines have never caused autism, not a single shred of evidence is available for that. Yet, easily preventable disease is on the rise because parents trust random people sharing dubious links on Facebook more than they do actual doctors.

Politics isn’t immune either. As I’m writing this, more and more links between Russia, Trump and alt-right media outlets are appearing. All the while more disinformation leaks into Facebook and Twitter feeds through what appear to be organised misinformation campaigns from various origins. People are being influenced in insidious ways to vote one way or the other, something that will have real implications for the next years to decades.

It’s happening everywhere, also in my country. A growing party in the Netherlands for example claims on their website as a major talking point that there are over 150.000 illegal immigrants living in the country. That sure sounds bad. However, in reality the exact number is unknown. The most recent estimates are from a 2015 report and put it at 35 thousand, down from 41 something thousand in 2009, note that the numbers have been declining since at least 1997, when estimates put the number between 112.000 and 163.000. But that doesn’t fit the narrative of the party, because foreigners=bad, and shouting that things are getting worse fits well with the cynical “back in my day it used to be so much better” sentiment in the aging voter base. Numbers? Meh just write down something big (citing high, fictional numbers appears to be somewhat of a track record). Also omit the reference to the source of the data while you’re at it. It’s part of a larger trend, a toxic mix of voters lacking in critical thinking skills and politicians for whom only the number of votes matter.

Now I’m a trained data scientist that works with lots of data daily, so it’s become a habit to always go for the numbers and sources. If an article makes a grand claim but the source reference is missing (it generally is), that is suspect right away. If a quick internet search doesn’t produce comparable estimates it is even more suspect. I appreciate that most people don’t do this. It takes time and effort to do so and our lives are busy enough as it is. If anything, this shows how flippin’ easy it is to slip falsehoods into a narrative. In the age of (dis)information excess it’s easy to find articles supporting whatever viewpoint is desired. ‘Personalised content feeds’ only serve to reinforce the viewer’s current beliefs by showing articles and videos similar to what’s been viewed before. It’s a sure way of never having to encounter something critical of held ideas or beliefs. It’s a comfortable, brainless confirmation bias in action. This promotes closed-mindedness, and is incredibly dangerous as it can lead to polarisation easily. We need a fix.

Maybe deep learning can help us assimilate the data more effectively? Let’s take a look! In this post we’ll explore one way of building a fake news detector, as well as the caveats it brings. The main problem from the outset is that the data sets out there are not very big, but the classification task we want to perform relies on language which is very complex. Generally, getting a deep learning net to learn more complicated patterns means you need to give it more examples: you’d need a lot of data. We can either spend months and a lot of money to make our own dataset, or be smart about it: transfer learning with word embeddings!


Getting Started
In this tutorial we will:

  • Go over the theory of how to make a word embedding matrix;
  • Discuss why using pre-trained word embeddings is helpful;
  • Load up a fake news dataset;
  • Build two network architectures and evaluate;
  • Discuss the subtleties of fake news detection.

 

To follow along with the code, you’ll need:

  • Python 3+ (Anaconda recommended);
  • Tensorflow (or Theano);
  • Keras;
  • A reasonable GPU to speed up training. Not necessary but highly recommended.

Background: Word Embeddings Encode Semantic Relations
For those familiar with principal component analysis (PCA), word embeddings will make intuitive sense. Using word embeddings means creating an n-dimensional vector space where each word is associated with its own vector. Each dimension in this vector space, then, serves a role similar to a ‘component’ from PCA: words that load similarly on it share some relationship. If this doesn’t make a whole lot of sense to you, don’t worry about it, read on below.

Building a word embedding matrix is done using a shallow neural network with a bottleneck in the middle. Data generation for this task is done by mining a large body of text for word-target pairs. For each word in the text, a series of word pairs are generated by sampling a random target word within a window. Let’s visualise how it works to make it a bit more clear.

Consider the sentence “The general lack of critical thinking ability helps fake news spread like cancer“. Generating word-pairs is usually done with a window function, meaning for each word in the body of text one or several target words are selected from the surrounding words (let’s say 5). Generating word pairs could then look something like this:

Words like “the”, “of”, “and” are essentially meaningless words in this context and can be excluded. Whether this is done is up to the creator of the embedding matrix. The job of the network is to associate input words with target words. What the model essentially will learn to predict are the probabilities that certain words co-occur nearby each other. Words that co-occur do so because they share some relationship, which is exactly what we’re interested in because this carries semantic meaning.


Background: The Making of an Embedding Matrix
So how do we go from word pairs to an embedding matrix? Easy, actually! We need to make two decisions: how many words are in our dictionary, and how many dimensions the embedding vectors will have. We could then train a shallow neural network with these properties:

To train the network we create a keyed dictionary. The input is a one-hot vector the same size as the dictionary, with all zeros except for the index location of the input word in the dictionary. For example if you encode “Aardvark”, which happens to be the second word in this example dictionary, the input vector looks like [0, 1, 0, .., 0]. The output layer has a shape equal to the input vector (size of dictionary) and needs to have a softmax activation function. Softmax scales the entire output vector so that the total sum is 1. We need this because here the output classes are mutually exclusive (each time we have one input word and one target word). Using softmax means the network outputs probabilities for each word given an input. The network also has a single hidden layer of the size we want the embedding layer dimensionality to be. It is trained on the word pairs.

Once training is complete the output layer is removed. We’re not interested in predicting a target word from an input word at all. We’re interested in the weights of the hidden layer, because from these we can make an embedding matrix.

Doing something simple in an incredibly convoluted and dumb way
After training we have a network that encodes for the individual word vectors. Each time we present the neural net with an input vector encoding a word, the corresponding weights in the hidden layer are selected (the word’s feature vector).

There’s a catch though. For those with some background in neural nets: note that because the input vector is all zeros except for one position, we waste tremendous resources by multiplying all weight matrices in all neurons with almost all zeroes, except for a single weight in each matrix in each neuron corresponding to the position of the word in the dictionary. Why is this dumb? Because we’re just emulating a look-up table in an overly complicated manner! By multiplying with all zeros except for the input word entry, what the network is doing is selecting one weight in every neuron weight matrix by zeroing out the rest and multiplying the one weight with 1. A hidden layer with 300 units and a dictionary of 10.000 words will entail 3.000.000 multiplications, while we only use the output from 0.1% of all these multiplications (300 weights, one selected in each neuron).

We can do better, and linguistic scientists certainly have. To make this process less idiotic the hidden layer weight matrices are extracted and encoded into a look-up table the of shape(size_of_dictonary, n_dimensions). This way, each word in the dictionary has an associated n-dimensional vector attached. To get a word vector, we only need to find its index in the dictionary, which is computationally a very cheap operation.

Each vector describes that word’s position and orientation in n-dimensional space, which encodes the information on word relations into the spatial relations between the vectors. Consider this 3-dimensional example:

The above figure shows a hypothetical visualisation of a 3-dimensional embedding where Cats and Dogs share a dimension (e.g. pets), lions and wolves share a dimension (e.g. wild animals), but there is also a relationship among them. Here you can visually see how an algorithm would solve the problem “Cats are to Lions, as Dogs are to ____”. It needs to find a vector with a similar length and orientation going from Dog to the target word! In mathematical terms, it needs to find the vector pair with the highest cosine similarity.

So what good does it do the fake news detector?
The word embeddings process will produce a numerically efficient way of representing semantic relations. That is great, since because they’re numbers we can do all sorts of cool math with it like the reasoning task mentioned above. But perhaps the most important feature is that they encode a lot of linguistic knowledge.

Incorporating such an embedding matrix in a deep learning architecture for other tasks is a form of transfer learning. Transfer learning means that in stead of starting from scratch when training the network, we inject it with knowledge learned from another related task. This will help our fake news detector. In stead of presenting it with plain text and having it learn everything from scratch, by presenting it with each word’s embedding vector we offer it a much more information-rich diet. In practice this translates to much lower time and data requirements to fit the network properly. In essence, the network can now focus solely on learning to discriminate fake and real news, without having to first learn what the hell language is.


Getting the Data
First we need data. In this tutorial we’ll use the ‘train.csv’ dataset from here. Download it and extract it. The dataset is nicely balanced, with 10.387 real news articles, and 10.413 biased/fake articles. Now let’s write a function to load up the data.

#LOAD THE DATA
import pandas as pd
import numpy as np
import random

def load_kagglefakenews():
    #load training data and put into arrays
    df = pd.read_csv('Kaggle_FakeNews/train.csv', encoding='utf8') # be sure to point to wherever you put your file
    train_data = df['text'].values.tolist() #'text' column contains articles
    train_labels = df['label'].values.tolist() #'label' column contains labels

    #Randomly shuffle data and labels together
    combo = list(zip(train_data, train_labels))
    random.shuffle(combo)
    train_data, train_labels = zip(*combo)
    del df #clear up memory

    return np.asarray(train_data).tolist(), np.asarray(train_labels).tolist()

Call the function to load the data:

train_data, train_labels = load_kagglefakenews()

After loading the data we need to tokenize it, meaning we’ll split the text into separate words and remove punctuation and other unwanted characters. We then assign each word a unique numerical identifier that corresponds to the word’s position in our dictionary.

We also set a few constants with values that will help construct the training sets and embedding layer, then tokenize our loaded data:

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding
from keras.utils import to_categorical
import pickle

MAX_NB_WORDS=50000 #dictionary size
MAX_SEQUENCE_LENGTH=1500 #max word length of each individual article
EMBEDDING_DIM=300 #dimensionality of the embedding vector (50, 100, 200, 300)
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~')

def tokenize_trainingdata(texts, labels):
    tokenizer.fit_on_texts(texts)
    pickle.dump(tokenizer, open('Models/tokenizer.p', 'wb'))

    sequences = tokenizer.texts_to_sequences(texts)

    word_index = tokenizer.word_index
    print('Found %s unique tokens.' % len(word_index))

    data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

    labels = to_categorical(labels, num_classes=len(set(labels)))

    return data, labels, word_index

#and run it
X, Y, word_index = tokenize_trainingdata(train_data, train_labels)    

And split the data into training, validation and test sets:

#split the data (90% train, 5% test, 5% validation)
train_data = X[:int(len(X)*0.9)]
train_labels = Y[:int(len(X)*0.9)]
test_data = X[int(len(X)*0.9):int(len(X)*0.95)]
test_labels = Y[int(len(X)*0.9):int(len(X)*0.95)]
valid_data = X[int(len(X)*0.95):]
valid_labels = Y[int(len(X)*0.95):]

Labeling is done so that 0 = real news, and 1 = fake or biased news


Getting the Embeddings
We’ll be using a 300-dimensional embedding matrix, trained on a text dump of the full wikipedia site as it was in 2014 + the Gigaword set (a large set of news article content). It has a vocabulary of 400.000 words, so it’s also a lot smarter than I am. Grab the dataset from the Stanford GloVe project page: http://nlp.stanford.edu/data/glove.6B.zip (heads up: 822MB).

The official Keras blog has a nice article about loading up word embeddings and using them, so let’s adapt the code to our needs. Extract the downloaded ZIP file, and consider the following block of code to build our embedding layer:

def load_embeddings(word_index, embeddingsfile='wordEmbeddings/glove.6B.%id.txt' %EMBEDDING_DIM):
    embeddings_index = {}
    f = open(embeddingsfile, 'r', encoding='utf8')
    for line in f:
        #here we parse the data from the file
        values = line.split(' ') #split the line by spaces
        word = values[0] #each line starts with the word
        coefs = np.asarray(values[1:], dtype='float32') #the rest of the line is the vector
        embeddings_index[word] = coefs #put into embedding dictionary
    f.close()

    print('Found %s word vectors.' % len(embeddings_index))

    embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector
    
    embedding_layer = Embedding(len(word_index) + 1,
                                EMBEDDING_DIM,
                                weights=[embedding_matrix],
                                input_length=MAX_SEQUENCE_LENGTH,
                                trainable=False)
    return embedding_layer
    
#and build the embedding layer
embedding_layer = load_embeddings(word_index)

With all that squared away we can go on to train a model!


Baseline Performance
Let’s set a simple convnet architecture as a baseline model. This will train fast and give us a baseline performance that a simple model will achieve. Remember that deep learning is a very empirical discipline: prototype quickly and refine along the way.

Let’s perform some more imports and set the model architecture:

from keras import Sequential, Model, Input
from keras.layers import Conv1D, MaxPooling1D, AveragePooling1D, Flatten, Dense, GlobalAveragePooling1D, Dropout, 
                         LSTM, CuDNNLSTM, RNN, SimpleRNN, Conv2D, GlobalMaxPooling1D
from keras import callbacks

def baseline_model(sequence_input, embedded_sequences, classes=2):
    x = Conv1D(64, 5, activation='relu')(embedded_sequences)
    x = MaxPooling1D(5)(x)
    x = Conv1D(128, 3, activation='relu')(x)
    x = MaxPooling1D(5)(x)
    x = Conv1D(256, 2, activation='relu')(x)
    x = GlobalAveragePooling1D()(x)
    x = Dense(2048, activation='relu')(x)
    x = Dropout(0.5)(x)
    x = Dense(512, activation='relu')(x)
    x = Dropout(0.5)(x)
    preds = Dense(classes, activation='softmax')(x)

    model = Model(sequence_input, preds)
    return model

Now let’s train it on the loaded Kaggle set and monitor for signs of overfitting. The telltale sign is that validation loss will start increasing steadily, and often validation accuracy will decrease. We will use ‘early stopping’ when we observe this. Early stopping simply means aborting training before the set number of epochs is reached.

#put embedding layer into input of the model
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)

model = baseline_model(sequence_input, embedded_sequences, classes=2)

model.compile(loss='categorical_crossentropy',
              optimizer='adamax',
              metrics=['acc'])

print(model.summary())

model.fit(train_data, train_labels,
          validation_data=(valid_data, valid_labels),
          epochs=25, batch_size=64)

For me the model converged very quickly (thanks word embeddings!), and overfitting started occurring after a few epochs:

Train on 18720 samples, validate on 1040 samples
Epoch 1/25
18720/18720 [8==============================] - 8s 420us/step - loss: 0.2927 - acc: 0.8676 - val_loss: 0.1233 - val_acc: 0.9452
Epoch 2/25
18720/18720 [8==============================] - 7s 390us/step - loss: 0.0876 - acc: 0.9679 - val_loss: 0.0902 - val_acc: 0.9644
Epoch 3/25
18720/18720 [8==============================] - 7s 392us/step - loss: 0.0462 - acc: 0.9850 - val_loss: 0.0875 - val_acc: 0.9692
Epoch 4/25
18720/18720 [8==============================] - 7s 392us/step - loss: 0.0230 - acc: 0.9923 - val_loss: 0.0913 - val_acc: 0.9692
Epoch 5/25
18720/18720 [8==============================] - 7s 392us/step - loss: 0.0157 - acc: 0.9950 - val_loss: 0.1004 - val_acc: 0.9712

Great! In just a few epochs the model reaches very impressive performance statistics. We end up choosing the model from epoch 3 because afterwards, validation loss increases steadily. Time to test it on the held out set:

model.evaluate(test_data, test_labels)

1040/1040 [8==============================] - 0s 228us/step
[0.11665208717593206, 0.9634615384615385]

Looking good! There are no big deviations from validation statistics so that indicates good performance. Now on to see what it thinks of a few real world examples. Luckily there’s a site that maintains a database of known sites known for fake news content: http://opensources.co.

Now let’s test the model out on an obviously fake article that came out a day or so ago. I’ve simply taken the first front page article (at the time of writing this) from the first site mentioned on opensources.co. First let’s write a function to tokenize the text of any article:

def tokenize_text(text):
    sequences = tokenizer.texts_to_sequences(text)
    data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

    return data

Then paste the text into a string and tokenize it (be careful not to paste ads in it as well)

text = """ """ #put the article text here

#tokenize
tok = tokenize_text([faketext])

#ask the model that it thinks
model.predict(tok)

#out comes
array([[0.04552918, 0.9544708 ]], dtype=float32)

Remember that 0=real, 1=fake. This indicates that the model gives it a 95.45% chance of being fake (or biased) news. That is correct, dear model!


A less obvious example
Sweden has been a recent target of a stream of fake news in Europe. According to some less-than-reputable sources, it’s on the brink of civil war, overrun by migrants, and contains the rape-capital of the world. While Sweden has problems, like any other country does, what is presented in biased articles is often a gross exaggeration of the existing situation or factually incorrect. Let’s see if the model can spot some things!

We’ll take a look at this Daily Telegraph article and see what the model thinks of it:

array([[1.7819532e-06, 9.9999821e-01]], dtype=float32)

99.99% change of being fake or biased. That certainly is interesting! Going through the article there is some loaded language and hyperboles, but without detailed knowledge of the situation it is difficult to spot what’s out of place. However, keep in mind that:

  • the telegraph receives lots of money (£900.000 a year) from Russia in exchange for reproducing content from the Russian state-controlled newspaper.
  • its chief political commentator resigned in 2015 in protest to the paper publishing articles heavily influenced by the interests and wishes of the large advertisers in the paper.
  • it has the highest number of upheld complaints regarding factual inaccuracies of all UK newspapers.
  • it has been implicated in regularly reproducing Chinese communist party propaganda following a payment deal from China that brings in £800.000,- a year.

 

See the Wikipedia here for more info and references. I’ll leave it up to you to decide whether you think it’s a trustworthy source. I don’t think it is, and frankly I found it impressive that the model also seems to pick this up.

The Wikipedia section linked, by the way, is marked real by the model with 99.27% certainty:

wikitext = """Accusation of news coverage influence by advertisers
In July 2014, the Daily Telegraph was criticised for carrying links on its website to pro-Kremlin articles supplied by a Russian state-funded publication that downplayed any Russian involvement in the downing of the passenger jet Malaysia Airlines Flight 17.[58] These had featured on its website as part of a commercial deal, but were later removed.[59] The paper is paid £900,000 a year to include the supplement Russia Beyond the Headlines, a publication sponsored by the Rossiyskaya Gazeta, the Russian government's official newspaper. It is paid a further £750,000 a year for a similar arrangement with the Chinese state in relation to the pro-Beijing China Watch advertising supplement.[60][61]

In February 2015 the chief political commentator of the Daily Telegraph, Peter Oborne resigned. Oborne accused the paper of a "form of fraud on its readers"[11] for its coverage of the bank HSBC in relation to a Swiss tax-dodging scandal that was widely covered by other news media. He alleged that editorial decisions about news content had been heavily influenced by the advertising arm of the newspaper because of commercial interests.[12] Professor Jay Rosen at New York University stated that Oborne's resignation statement was "one of the most important things a journalist has written about journalism lately".[12]

Oborne cited other instances of advertising strategy influencing the content of articles, linking the refusal to take an editorial stance on the repression of democratic demonstrations in Hong Kong to the Telegraph's support from China. Additionally, he said that favourable reviews of the Cunard cruise liner Queen Mary II appeared in the Telegraph, noting: "On 10 May last year The Telegraph ran a long feature on Cunard's Queen Mary II liner on the news review page. This episode looked to many like a plug for an advertiser on a page normally dedicated to serious news analysis. I again checked and certainly Telegraph competitors did not view Cunard's liner as a major news story. Cunard is an important Telegraph advertiser."[11] In response, the Telegraph called Oborne's statement an "astonishing and unfounded attack, full of inaccuracy and innuendo".[12]

In January 2017, the Telegraph Media Group had a higher number of upheld complaints than any other UK newspaper by its regulator IPSO.[62] Most of these findings pertained to inaccuracy, as with other UK newspapers.[63]

In October 2017, a number of major western news organisations whose coverage has irked Beijing were excluded from Xi Jinping's speech event launching new politburo. However, the Daily Telegraph, which regularly publishes Communist party propaganda in the UK in an advert section as part of a reported £800,000 annual contract with Beijing’s China Daily, has been granted an invitation to the event.[64]
"""

tok = tokenize_text([wikitext])

model.predict(tok)

array([[0.99272037, 0.0072796 ]], dtype=float32)

Now let’s look at reporting on the same topic (hand grenade possessions and misuse by migrants in Sweden) as reported by a reputable source: the BBC, the article here. Pasting the text and running the model, my version gives me:

array([[0.99580574, 0.0041943 ]], dtype=float32)

The model would classify this article as real with a 99.58% certainty.

Let’s do one last random article, which happened to be front page CNN when I was writing this. Rated real with 89.01% certainty.


Improving Performance
Before wrapping up let’s take a look at a somewhat more complicated architecture using ‘smart’ neurons. LSTM layers implement a learnable forgetting mechanism that allows them to spot relationships between different elements in a series quite efficiently. They are widely used in time-series analysis, but can of course also be applied to language since it also shows strong relationships between individual elements.

Consider this architecture including LSTM layers:

def LSTM_model(sequence_input, embedded_sequences, classes=2):
    x = CuDNNLSTM(32,
                  return_sequences=True)(embedded_sequences)
    x = CuDNNLSTM(64,
                  return_sequences=True)(x)
    x = CuDNNLSTM(128)(x)
    x = Dense(4096,
              activation='relu')(x)
    x = Dense(1024,
              activation='relu')(x)
    preds = Dense(classes,
              activation='softmax')(x)

    model = Model(sequence_input, preds)
    return model

I’ve trained the model on a synthesis of five large datasets:

In the end it reached an aggregate test accuracy of about 95% (the baseline model got stuck at 90%). Be cautious when combining these datasets: the labels are reversed for some.


Concluding – What does this all mean, really?
We’ve trained two models on a single dataset, and then expanded the LSTM model training to a synthesis of five available datasets to make it more versatile. Despite the impressive accuracies of the models in this tutorial they do come with some important disclaimers: the datasets the models have been trained on contain mostly articles of a specific type, namely politically themed and reporting on societal issues. Deep learning models are pretty bad at doing tasks outside what they’ve been trained for. As such, it is difficult for the model to generalise to news articles that fall outside the categories they’re trained on. Take for example a random BBC article about cats: https://www.bbc.com/news/world-asia-45347136

It is classified as a fake with 65.23% certainty by the baseline model, most likely because of the hyperboles and strongly worded segments throughout the article. In this case they serve as a writing device that makes the article funny to read, but they are employed by many fake news writers as well to stir up emotion and make grand claims. The more complex LSTM was trained on a more diverse dataset and didn’t make the same mistake, even though the narrowly trained baseline got a higher test accuracy. This highlights the importance of being cautious when deploying these types of systems in the real world: even a high model accuracy doesn’t necessarily translate into high real-world performance, if the training sets are different from the data encountered in real-world applications. Performance cannot be guaranteed on just any text the fake news detector is presented with, it may be compromised if writing styles change, or if the fake news detector judges on a topic it is unfamiliar with.

A word of caution
Is a fake news detector with about 95% accuracy useable? After all, on the test set it will get about one in every 20 articles wrong. I think we would need to be very cautious when implementing it. One possibility is that false negatives (fake articles labeled real) can be misused by politicians to further their agenda (see! This fake article is not fake after all!). Do we really want that? Our model certainly hasn’t reached 100% accuracy on a diverse dataset, so this happening remains a very real danger.

Another serious issue is that publicly available fake news detectors open up the possibility for false negative mining. Imagine you are writing an intentionally fake or biased news article and have fake news detectors available. What do you do? Bingo! You would just keep re-writing and adjusting the wording in your article until it fools the detector. Now the detector has become useless. This underscores another important thing: you cannot just implement a fake news detector on a website and be done with it. More harm than good will likely come from it that way. You also need to be aware of how people will abuse it (no, not might, people will abuse it), and need to implement some form of protection against this (and other) kind of abuse.

But, perhaps the biggest hidden danger lies in the power vested in whomever hosts a robust fake news detector. If it works well enough and gains mainstream adoption, and if the wrong people gain control of it, there will be a tremendous opportunity to expand the spread of disinformation exponentially. If the general population trusts the machine and someone nefarious sabotages it, will we notice? Probably not for at least a while. The biggest and best fake news detector is still your brain. Remember to check your sources, be critical of what you read, and be vigilant. This book is an excellent resource on learning to be better at critical thinking.

In the end, who is to say what is fake and what is real? The LSTM model certainly isn’t perfect, especially when you show it a text far outside of its expertise: it rated this blog post FAKE NEWS with a certainty of 99.91%.

But don’t tell anyone, ok?

 


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.