Sequential Sentence Classification in Medical Abstracts
PubMed 200k RCT: a Dataset for Sequenctial Sentence Classification in Medical Abstracts.
- Overview:
- Dataset Used:
- Prepare Data For Deep Neural Network Models
- Create text vectorize
- Create custom text embedding
- Model 1: Conv1D with token embedding
- Model 2: Feature extraction with pretrained token embedding (USE)
- Model 3: Conv1D with character embedding
- Model 4: Combining pretrained token embeddings + character embeddings (hybrid embedding layer)
- Model 5: Transfer Learning with pretrained token embeddings + character embeddings + positional embeddings
- Evaluate model on test dataset
- Future Work
Overview:
- Classify a Randomized clinical trials (RCTs) abstarct to subclasses for easier to read and understand.
- Basically convert a medical abstarct to chunks of sentences of particaular classes like "Background", "Methods", "Results" and "Conclusion".
- Its a Many to One Text Classification problem. Where we categorize a sequence to a prticular class.
Dataset Used:
PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts
-
PubMed 20k is a subset of PubMed 200k. I.e., any abstract present in PubMed 20k is also present in PubMed 200k.
-
PubMed_200k_RCT is the same as PubMed_200k_RCT_numbers_replaced_with_at_sign, except that in the latter all numbers had been replaced by @. (same for PubMed_20k_RCT vs. PubMed_20k_RCT_numbers_replaced_with_at_sign)
!git clone https://github.com/Franck-Dernoncourt/pubmed-rct.git
!ls pubmed-rct
!ls pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign
- train.txt - training samples.
- dev.txt - dev is short for development set, which is another name for validation set (in our case, we'll be using and referring to this file as our validation set).
- test.txt - test samples.
data_dir = "/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/"
import os
filenames = [data_dir +filename for filename in os.listdir(data_dir)]
filenames
def get_lines(filename):
"""
"""
with open(filename, "r") as f:
return f.readlines()
train_lines = get_lines('/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/train.txt')
train_lines[:10]
Example returned preprocessed sample (a single line from an abstract):
Return all of the lines in the target text file as a list of dictionaries containing the key/value pairs: "line_number" - the position of the line in the abstract (e.g. 3). "target" - the role of the line in the abstract (e.g. OBJECTIVE).
- "text" - the text of the line in the abstract.
- "total_lines" - the total lines in an abstract sample (e.g. 14)
[{'line_number': 0,
'target': 'OBJECTIVE',
'text': 'to investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( oa ) .',
'total_lines': 11},
...]
def preprocess_text_with_line_numbers(filename):
input_lines = get_lines(filename) # get all lines from filename
abstract_lines = "" # create an empty abstract
abstract_samples = [] # create an empty list of abstracts
for line in input_lines:
if line.startswith("###"): # check to see if line is an ID line
abstract_id = line
abstract_lines = "" # reset the abstract string
elif line.isspace():
abstract_line_split = abstract_lines.splitlines() # split the abstract into separate lines
for abstract_line_number, abstract_line in enumerate(abstract_line_split):
line_data = {}
target_text_split = abstract_line.split("\t")
line_data["target"] = target_text_split[0]
line_data["text"] = target_text_split[1].lower()
line_data["line_number"] = abstract_line_number
line_data["total_lines"] = len(abstract_line_split) - 1
abstract_samples.append(line_data)
else:
abstract_lines += line
return abstract_samples
%%time
train_samples = preprocess_text_with_line_numbers(data_dir + "train.txt")
val_samples = preprocess_text_with_line_numbers(data_dir + "dev.txt") # dev is another name for validation set
test_samples = preprocess_text_with_line_numbers(data_dir + "test.txt")
len(train_samples), len(val_samples), len(test_samples)
As we are experimenting Some Text Preprocessing are left (like url and special char removal) , we'll do it future and see acuuracy deference.
train_samples[:10]
import pandas as pd
train_df = pd.DataFrame(train_samples)
val_df = pd.DataFrame(val_samples)
test_df = pd.DataFrame(test_samples)
train_df.head(14)
train_df["target"].value_counts()
train_df["target"].value_counts().plot(kind = 'bar')
train_df.total_lines.plot(kind= "hist")
train_sentences = train_df["text"].tolist()
val_sentences = val_df["text"].tolist()
test_sentences = test_df["text"].tolist()
len(train_sentences), len(val_sentences), len(test_sentences)
train_sentences[:10]
from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)
train_labels_one_hot = one_hot_encoder.fit_transform(train_df["target"].to_numpy().reshape(-1, 1))
val_labels_one_hot = one_hot_encoder.transform(val_df["target"].to_numpy().reshape(-1, 1))
test_labels_one_hot = one_hot_encoder.transform(test_df["target"].to_numpy().reshape(-1, 1))
# Check what training labels look like
train_labels_one_hot
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
train_labels_encoded = label_encoder.fit_transform(train_df["target"].to_numpy())
val_labels_encoded = label_encoder.transform(val_df["target"].to_numpy())
test_labels_encoded = label_encoder.transform(test_df["target"].to_numpy())
# Check what training labels look like
train_labels_encoded
num_classes = len(label_encoder.classes_)
class_names = label_encoder.classes_
num_classes, class_names
Our first model we'll be a TF-IDF Multinomial Naive Bayes as recommended by Scikit-Learn's machine learning map.
we'll create a Scikit-Learn Pipeline
which uses the TfidfVectorizer
class to convert our abstract sentences to numbers using the TF-IDF (term frequency-inverse document frequecy) algorithm and then learns to classify our sentences using the MultinomialNB
aglorithm.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
# Create a pipeline
model_0 = Pipeline([
("tf-idf", TfidfVectorizer()),
("clf", MultinomialNB())
])
# Fit the pipeline to the training data
model_0.fit(X=train_sentences,
y=train_labels_encoded);
# Evaluate baseline on validation dataset
model_0.score(X=val_sentences,
y=val_labels_encoded)
baseline_preds = model_0.predict(val_sentences)
baseline_preds
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
from helper_functions import calculate_results
# Calculate baseline results
baseline_results = calculate_results(y_true=val_labels_encoded,
y_pred=baseline_preds)
baseline_results
When our model goes through our sentences, it works best when they're all the same length (this is important for creating batches of the same size tensors)
- Finding the average sentence length in the Dataset.
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers
sen_len = [len(sentences.split()) for sentences in train_sentences]
avg_sen_len = np.mean(sen_len)
avg_sen_len
import matplotlib.pyplot as plt
plt.hist(sen_len,bins=21) # Checking the Sequence Length Distribution and getting most occurance sequence length
Looks like the vast majority of sentences are between 0 and 50 tokens in length.
We can use NumPy's percentile
to find the value which covers 95% of the sentence lengthsHow long of a sentesnces cover majority of the data ? (95%)
np.percentile(sen_len,95)
max(sen_len) # max length sentence in training set
Creaating a text vectorizer layer
Create text vectorize
Section 3.2 of the PubMed 200k RCT paper states the vocabulary size of the PubMed 20k dataset as 68,000. So we'll use that as our max_tokens
parameter.
max_tokens = 68000
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
text_vectorizer = TextVectorization(max_tokens=max_tokens,standardize='lower_and_strip_punctuation',
output_sequence_length=55)
text_vectorizer.adapt(train_sentences)
import random
target_sentence = random.choice(train_sentences)
print(f"Text:\n{target_sentence}")
print(f"\nLength of text: {len(target_sentence.split())}")
print(f"\nVectorized text:\n{text_vectorizer([target_sentence])}")
rct_20k_text_vocab = text_vectorizer.get_vocabulary()
most_common = rct_20k_text_vocab[:5]
least_common = rct_20k_text_vocab[-5:]
print(f"Number of words in vocabulary: {len(rct_20k_text_vocab)}"),
print(f"Most common words in the vocabulary: {most_common}")
print(f"Least common words in the vocabulary: {least_common}")
text_vectorizer.get_config()
Create custom text embedding
To create a richer numerical representation of our text, we can use an embedding.
The input_dim
parameter defines the size of our vocabulary. And the output_dim
parameter defines the dimension of the embedding output.
Once created, our embedding layer will take the integer outputs of our text_vectorization
layer as inputs and convert them to feature vectors of size output_dim
.
token_embed = layers.Embedding(input_dim=len(rct_20k_text_vocab),
output_dim= 128,
mask_zero=True,
input_length=55)
print(f"Sentence before Vectorization : \n{target_sentence}\n")
vec_sentence = text_vectorizer([target_sentence])
print(f"Sentence After vectorization :\n {vec_sentence}\n")
embed_sentence = token_embed(vec_sentence)
print(f"Embedding Sentence :\n{embed_sentence}\n")
train_dataset = tf.data.Dataset.from_tensor_slices((train_sentences, train_labels_one_hot))
valid_dataset = tf.data.Dataset.from_tensor_slices((val_sentences, val_labels_one_hot))
test_dataset = tf.data.Dataset.from_tensor_slices((test_sentences, test_labels_one_hot))
len(train_dataset) , train_dataset
train_dataset = train_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
valid_dataset = valid_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
train_dataset
inputs = layers.Input(shape = (1,),dtype = tf.string)
text_vector = text_vectorizer(inputs)
embed = token_embed(text_vector)
x = layers.Conv1D(filters = 64, kernel_size= 5, padding="same",activation="relu",kernel_regularizer=tf.keras.regularizers.L2(0.01))(embed)
x = layers.GlobalMaxPool1D()(x)
x = layers.Dropout(0.2)(x)
outputs = layers.Dense(num_classes,activation="softmax")(x)
model = tf.keras.Model(inputs,outputs)
model.summary()
len(train_dataset)
model.compile(optimizer='Adam', loss="categorical_crossentropy", metrics=["accuracy"])
model_1_history = model.fit(train_dataset,
steps_per_epoch=int(0.1 * len(train_dataset)),
epochs = 10,
validation_data = valid_dataset,
validation_steps=int(0.1 * len(valid_dataset)),)
model.evaluate(valid_dataset)
model_1_pred_probs = model.predict(valid_dataset)
model_1_pred_probs
model_1_preds = tf.argmax(model_1_pred_probs, axis=1)
model_1_preds
model_1_results = calculate_results(y_true=val_labels_encoded,
y_pred=model_1_preds)
model_1_results
Model 2: Feature extraction with pretrained token embedding (USE)
Here We use Universal Sentence Encoder here from TF-HUB.
Since we're moving towards replicating the model architecture in Neural Networks for Joint Sentence Classification in Medical Paper Abstracts, it mentions they used a pretrained GloVe embedding as a way to initialise their token embeddings.
The model structure will look like:
Inputs (string) -> Pretrained embeddings from TensorFlow Hub (Universal Sentence Encoder) -> Layers -> Output (prediction probabilities)
import tensorflow_hub as hub
tf_hub_embedding_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
trainable=False,
name="universal_sentence_encoder")
Beautiful, now our pretrained USE is downloaded and instantiated as a hub.KerasLayer
instance, let's test it out on a random sentence
random_training_sentence = random.choice(train_sentences)
print(f"Random training sentence:\n{random_training_sentence}\n")
use_embedded_sentence = tf_hub_embedding_layer([random_training_sentence])
print(f"Sentence after embedding:\n{use_embedded_sentence[0][:30]} (truncated output)...\n")
print(f"Length of sentence embedding:\n{len(use_embedded_sentence[0])}")
inputs = layers.Input(shape=[], dtype=tf.string)
pretrained_embedding = tf_hub_embedding_layer(inputs) # tokenize text and create embedding
x = layers.Dense(128, activation="relu")(pretrained_embedding) # add a fully connected layer on top of the embedding
# x = layers.Dropout(0.2)(x)
outputs = layers.Dense(5, activation="softmax",kernel_regularizer=None)(x) # create the output layer
model_2 = tf.keras.Model(inputs=inputs,
outputs=outputs)
# Compile the model
model_2.compile(loss="categorical_crossentropy",
optimizer=tf.keras.optimizers.Adam(),
metrics=["accuracy"])
model_2.summary()
model_2_history = model_2.fit(train_dataset,
steps_per_epoch=int(0.1 * len(train_dataset)),
epochs = 10,
validation_data = valid_dataset,
validation_steps=int(0.1 * len(valid_dataset)))
model_2.evaluate(valid_dataset)
model_2_pred_probs = model_2.predict(valid_dataset)
model_2_pred_probs
model_2_preds = tf.argmax(model_2_pred_probs, axis=1)
model_2_preds
model_2_results = calculate_results(y_true=val_labels_encoded,
y_pred=model_2_preds)
model_2_results
Creating a character-level tokenizer
The Neural Networks for Joint Sentence Classification in Medical Paper Abstracts paper mentions their model uses a hybrid of token and character embeddings.
The difference between a character and token embedding is that the character embedding is created using sequences split into characters (e.g. hello
-> [h
, e
, l
, l
, o
]) where as a token embedding is created on sequences split into tokens.
Token level embeddings split sequences into tokens (words) and embeddings each of them, character embeddings split sequences into characters and creates a feature vector for each.
Before we can vectorize our sequences on a character-level we'll need to split them into characters. Let's write a function to do so
" ".join(list(train_sentences[0]))
def split_chars(text):
return " ".join(list(text))
split_chars(random_training_sentence)
train_chars = [split_chars(sentence) for sentence in train_sentences]
val_chars = [split_chars(sentence) for sentence in val_sentences]
test_chars = [split_chars(sentence) for sentence in test_sentences]
print(train_chars[0])
train_chars[:5]
char_lens = [len(sentence) for sentence in train_sentences]
avg_char_lens = sum(char_lens)/len(char_lens)
avg_char_lens
import matplotlib.pyplot as plt
plt.hist(char_lens,bins =25)
Okay, looks like most of our sequences are between 0 and 200 characters long.
Let's use NumPy's percentile to figure out what length covers 95% of our sequences
output_seq_char_len = int(np.percentile(char_lens, 95))
output_seq_char_len
random.choice(train_sentences)
import string
alphabet = string.ascii_lowercase + string.digits + string.punctuation
alphabet
NUM_CHAR_TOKENS = len(alphabet) + 2 # num characters in alphabet + space + OOV token
char_vectorizer = TextVectorization(max_tokens=NUM_CHAR_TOKENS,
output_sequence_length=output_seq_char_len,
standardize="lower_and_strip_punctuation",
name="char_vectorizer")
# Adapt character vectorizer to training characters
char_vectorizer.adapt(train_chars)
char_vocab = char_vectorizer.get_vocabulary()
print(f"Number of different characters in character vocab: {len(char_vocab)}")
print(f"5 most common characters: {char_vocab[:5]}")
print(f"5 least common characters: {char_vocab[-5:]}")
random_train_chars = random.choice(train_chars)
print(f"Charified text:\n{random_train_chars}")
print(f"\nLength of chars: {len(random_train_chars.split())}")
vectorized_chars = char_vectorizer([random_train_chars])
print(f"\nVectorized chars:\n{vectorized_chars}")
print(f"\nLength of vectorized chars: {len(vectorized_chars[0])}")
Creating a character-level embedding
We've got a way to vectorize our character-level sequences, now's time to create a character-level embedding.
The input dimension (input_dim
) will be equal to the number of different characters in our char_vocab
(28). And since we're following the structure of the model in Figure 1 of Neural Networks for Joint Sentence Classification
in Medical Paper Abstracts, the output dimension of the character embedding (output_dim
) will be 25.
char_embed = layers.Embedding(input_dim=NUM_CHAR_TOKENS,
output_dim= 25,
mask_zero= True,
name= "char_embed")
# Test out character embedding layer
print(f"Charified text (before vectorization and embedding):\n{random_train_chars}\n")
char_embed_example = char_embed(char_vectorizer([random_train_chars]))
print(f"Embedded chars (after vectorization and embedding):\n{char_embed_example}\n")
print(f"Character embedding shape: {char_embed_example.shape}")
Before fitting our model on the data, we'll create char-level batched PrefetchedDataset
's.
train_char_dataset = tf.data.Dataset.from_tensor_slices((train_chars, train_labels_one_hot)).batch(32).prefetch(tf.data.AUTOTUNE)
val_char_dataset = tf.data.Dataset.from_tensor_slices((val_chars, val_labels_one_hot)).batch(32).prefetch(tf.data.AUTOTUNE)
train_char_dataset
Building a Conv1D model to fit on character embeddings
Now we've got a way to turn our character-level sequences into numbers (char_vectorizer
) as well as numerically represent them as an embedding (char_embed
) let's test how effective they are at encoding the information in our sequences by creating a character-level sequence model.
The model will have the same structure as our custom token embedding model (model_1
) except it'll take character-level sequences as input instead of token-level sequences.
``` Input (character-level text) -> Tokenize -> Embedding -> Layers (Conv1D, GlobalMaxPool1D) -> Output (label probability)
inputs = layers.Input(shape=(1,), dtype="string")
char_vectors = char_vectorizer(inputs)
char_embeddings = char_embed(char_vectors)
x = layers.Conv1D(64, kernel_size=5, padding="same", activation="relu",kernel_regularizer=tf.keras.regularizers.L2(0.01))(char_embeddings)
x = layers.GlobalMaxPool1D()(x)
outputs = layers.Dense(num_classes, activation="softmax")(x)
model_3 = tf.keras.Model(inputs=inputs,
outputs=outputs,
name="model_3_conv1D_char_embedding")
# Compile model
model_3.compile(loss="categorical_crossentropy",
optimizer=tf.keras.optimizers.Adam(),
metrics=["accuracy"])
model_3.summary()
model_3_history = model_3.fit(train_char_dataset,
steps_per_epoch=int(0.1 * len(train_char_dataset)),
epochs=10,
validation_data=val_char_dataset,
validation_steps=int(0.1 * len(val_char_dataset)))
model_3.evaluate(val_char_dataset)
model_3_pred_probs = model_3.predict(val_char_dataset)
model_3_pred_probs
model_3_preds = tf.argmax(model_3_pred_probs, axis=1)
model_3_preds
model_3_results = calculate_results(y_true=val_labels_encoded,
y_pred=model_3_preds)
model_3_results
Model 4: Combining pretrained token embeddings + character embeddings (hybrid embedding layer)
In moving closer to build a model similar to the one in Figure 1 of Neural Networks for Joint Sentence Classification in Medical Paper Abstracts, it's time we tackled the hybrid token embedding layer they speak of.
- This hybrid token embedding layer is a combination of token embeddings and character embeddings. In other words, they create a stacked embedding to represent sequences before passing them to the sequence label prediction layer
To start replicating (or getting close to replicating) the model in Figure 1, we're going to go through the following steps:
- Create a token-level model (similar to
model_1
) - Create a character-level model (similar to
model_3
with a slight modification to reflect the paper) - Combine (using
layers.Concatenate
) the outputs of 1 and 2 - Build a series of output layers on top of 3 similar to Figure 1 and section 4.2 of Neural Networks for Joint Sentence Classification in Medical Paper Abstracts
- Construct a model which takes token and character-level sequences as input and produces sequence label probabilities as output
1 # Token_level Model (using Pretrained -- Universal Sentence Encoder)
token_inputs = layers.Input(shape = [], dtype= tf.string, name = "token_input")
token_embedding = tf_hub_embedding_layer(token_inputs)
token_dense = layers.Dense(128,activation="relu")(token_embedding)
token_model = tf.keras.Model(inputs = token_inputs,
outputs = token_dense)
2 # char_level Model
char_inputs = layers.Input(shape=(1,), dtype= tf.string, name="char_input")
char_vectors = char_vectorizer(char_inputs)
char_embedding = char_embed(char_vectors)
char_bi_lstm = layers.Bidirectional(layers.LSTM(25,activation="relu"))(char_embedding)
char_model = tf.keras.Model(inputs= char_inputs, # char_dense = layers.Dense(128,activation="relu")(char_bilstm)
outputs =char_bi_lstm)
3 # Now Concatenate token_model and char_model
concat_layer = layers.Concatenate(name = "token_char_hybrid")([token_model.output,
char_model.output])
4 # Add Some Layer on top of concat_layer
concat_dropout = layers.Dropout(0.5)(concat_layer)
concat_dense = layers.Dense(256,activation="relu")(concat_dropout)
final_dropout = layers.Dropout(0.2)(concat_dense)
output_layer = layers.Dense(num_classes,activation="softmax")(final_dropout)
model_4 = tf.keras.Model(inputs = [token_model.input, char_model.input],
outputs = output_layer,
name="model_4_token_and_char_embeddings")
model_4.summary()
tf.keras.utils.plot_model(
model_4, to_file='model.png', show_shapes=False, show_dtype=False,
show_layer_names=True, rankdir='TB', expand_nested=False, dpi=96
)
model_4.compile(loss="categorical_crossentropy",
optimizer=tf.keras.optimizers.Adam(), # section 4.2 of https://arxiv.org/pdf/1612.05251.pdf mentions using SGD but we'll stick with Adam
metrics=["accuracy"])
train_char_token_data = tf.data.Dataset.from_tensor_slices((train_sentences, train_chars)) # make data
train_char_token_labels = tf.data.Dataset.from_tensor_slices(train_labels_one_hot) # make labels
train_char_token_dataset = tf.data.Dataset.zip((train_char_token_data, train_char_token_labels)) # combine data and labels
# Prefetch and batch train data
train_char_token_dataset = train_char_token_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
# Repeat same steps validation data
val_char_token_data = tf.data.Dataset.from_tensor_slices((val_sentences, val_chars))
val_char_token_labels = tf.data.Dataset.from_tensor_slices(val_labels_one_hot)
val_char_token_dataset = tf.data.Dataset.zip((val_char_token_data, val_char_token_labels))
val_char_token_dataset = val_char_token_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
train_char_token_dataset, val_char_token_dataset
model_4_history = model_4.fit(train_char_token_dataset, # train on dataset of token and characters
steps_per_epoch=int(0.1 * len(train_char_token_dataset)),
epochs=10,
validation_data=val_char_token_dataset,
validation_steps=int(0.1 * len(val_char_token_dataset)))
model_4.evaluate(val_char_token_dataset)
model_4_pred_probs = model_4.predict(val_char_token_dataset)
model_4_pred_probs
model_4_preds = tf.argmax(model_4_pred_probs, axis=1)
model_4_preds
model_4_results = calculate_results(y_true=val_labels_encoded,
y_pred=model_4_preds)
model_4_results
As it's a Sequential classification problem the sequences come in a particular order. Like OBJECTIVE comes first rather then CONCLUSION.
Abstracts typically come in a sequential order, such as:
-
OBJECTIVE
... -
METHODS
... -
METHODS
... -
METHODS
... -
RESULTS
... -
CONCLUSIONS
...
Or
-
BACKGROUND
... -
OBJECTIVE
... -
METHODS
... -
METHODS
... -
RESULTS
... -
RESULTS
... -
CONCLUSIONS
...
Here we do some Feature Engineering so that our model can learn the order sentences in the Abstract and know where the sentence appear in the Abstract. The "line_number"
and "total_lines"
columns are features which didn't necessarily come with the training data but can be passed to our model as a positional embedding.
But to avoid our model thinking a line with "line_number"=5 is five times greater than a line with "line_number"=1, we'll use one-hot-encoding to encode our "line_number" and "total_lines" features.
That is why we have to use one-hot encoding. We use tf.one_hot for it.
train_df["line_number"].value_counts()
train_df.line_number.plot.hist()
Looking at the distribution of the "line_number" column, it looks like the majority of lines have a position of 15 or less.
Knowing this, let's set the depth parameter of tf.one_hot to 15.
train_line_numbers_one_hot = tf.one_hot(train_df["line_number"].to_numpy(),depth= 15)
val_line_numbers_one_hot = tf.one_hot(val_df["line_number"].to_numpy(),depth= 15)
test_line_numbers_one_hot = tf.one_hot(test_df["line_number"].to_numpy(),depth= 15)
train_line_numbers_one_hot.shape, train_line_numbers_one_hot[:20]
We could create a one-hot tensor which has room for all of the potential values of "line_number" (depth=30), however, this would end up in a tensor of double the size of our current one (depth=15) where the vast majority of values are 0. Plus, only ~2,000/180,000 samples have a "line_number" value of over 15. So we would not be gaining much information about our data for doubling our feature space. This kind of problem is called the curse of dimensionality. However, since this we're working with deep models, it might be worth trying to throw as much information at the model as possible and seeing what happens. I'll leave exploring values of the depth parameter as an extension.
We can do the same above process for the total line also in data.
train_df["total_lines"].value_counts()
train_df.total_lines.plot.hist();
It shows that majority of data has line number below 20. We can perform numpy percentile to check this.
np.percentile(train_df.total_lines, 98) # a value of 20 covers 98% of samples
train_total_lines_one_hot = tf.one_hot(train_df["total_lines"].to_numpy(), depth=20)
val_total_lines_one_hot = tf.one_hot(val_df["total_lines"].to_numpy(), depth=20)
test_total_lines_one_hot = tf.one_hot(test_df["total_lines"].to_numpy(), depth=20)
# Check shape and samples of total lines one-hot tensor
train_total_lines_one_hot.shape, train_total_lines_one_hot[:10]
Creating The Beast (tribrid embedding model)
Steps for Creating the Model:
- Create a token-level model (similar to
model_1
) - Create a character-level model (similar to
model_3
with a slight modification to reflect the paper) - Create a
"line_number"
model (takes in one-hot-encoded"line_number"
tensor and passes it through a non-linear layer) - Create a
"total_lines"
model (takes in one-hot-encoded"total_lines"
tensor and passes it through a non-linear layer) - Combine (using
layers.Concatenate
) the outputs of 1 and 2 into a token-character-hybrid embedding and pass it series of output to Figure 1 and section 4.2 of Neural Networks for Joint Sentence Classification in Medical Paper Abstracts - Combine (using
layers.Concatenate
) the outputs of 3, 4 and 5 into a token-character-positional tribrid embedding - Create an output layer to accept the tribrid embedding and output predicted label probabilities
- Combine the inputs of 1, 2, 3, 4 and outputs of 7 into a
tf.keras.Model
# 1. Token Model
token_inputs = layers.Input(shape=[], dtype="string", name="token_inputs")
token_embeddings = tf_hub_embedding_layer(token_inputs)
token_outputs = layers.Dense(128, activation="relu")(token_embeddings)
token_model = tf.keras.Model(inputs=token_inputs,
outputs=token_outputs)
# 2. Char Model
char_inputs = layers.Input(shape=(1,), dtype= tf.string, name="char_input")
char_vectors = char_vectorizer(char_inputs)
char_embedding = char_embed(char_vectors)
char_bi_lstm = layers.Bidirectional(layers.LSTM(25,activation="relu"))(char_embedding)
char_model = tf.keras.Model(inputs= char_inputs, # char_dense = layers.Dense(128,activation="relu")(char_bilstm)
outputs =char_bi_lstm)
# 3. Line numbers inputs
line_number_inputs = layers.Input(shape=(15,), dtype=tf.int32, name="line_number_input")
x = layers.Dense(32, activation="relu")(line_number_inputs)
line_number_model = tf.keras.Model(inputs=line_number_inputs,
outputs=x)
# 4. Total lines inputs
total_lines_inputs = layers.Input(shape=(20,), dtype=tf.int32, name="total_lines_input")
y = layers.Dense(32, activation="relu")(total_lines_inputs)
total_line_model = tf.keras.Model(inputs=total_lines_inputs,
outputs=y)
# 5. Combine token and char embeddings into a hybrid embedding
combined_embeddings = layers.Concatenate(name="token_char_hybrid_embedding")([token_model.output,
char_model.output])
z = layers.Dense(256, activation="relu")(combined_embeddings)
z = layers.Dropout(0.5)(z)
# 6. Combine positional embeddings with combined token and char embeddings into a tribrid embedding
z = layers.Concatenate(name="token_char_positional_embedding")([line_number_model.output,
total_line_model.output,
z])
# 7. Create output layer
output_layer = layers.Dense(5, activation="softmax", name="output_layer")(z)
# 8. Put together model
model_5 = tf.keras.Model(inputs=[line_number_model.input,
total_line_model.input,
token_model.input,
char_model.input],
outputs=output_layer)
model_5.summary()
from tensorflow.keras.utils import plot_model
plot_model(model_5)
model_5.compile(loss =tf.keras.losses.CategoricalCrossentropy(label_smoothing=0.2), optimizer="Adam", metrics= ["accuracy"])
train_pos_char_token_data = tf.data.Dataset.from_tensor_slices((train_line_numbers_one_hot, # line numbers
train_total_lines_one_hot, # total lines
train_sentences, # train tokens
train_chars)) # train chars
train_pos_char_token_labels = tf.data.Dataset.from_tensor_slices(train_labels_one_hot) # train labels
train_pos_char_token_dataset = tf.data.Dataset.zip((train_pos_char_token_data, train_pos_char_token_labels)) # combine data and labels
train_pos_char_token_dataset = train_pos_char_token_dataset.batch(32).prefetch(tf.data.AUTOTUNE) # turn into batches and prefetch appropriately
# Validation dataset
val_pos_char_token_data = tf.data.Dataset.from_tensor_slices((val_line_numbers_one_hot,
val_total_lines_one_hot,
val_sentences,
val_chars))
val_pos_char_token_labels = tf.data.Dataset.from_tensor_slices(val_labels_one_hot)
val_pos_char_token_dataset = tf.data.Dataset.zip((val_pos_char_token_data, val_pos_char_token_labels))
val_pos_char_token_dataset = val_pos_char_token_dataset.batch(32).prefetch(tf.data.AUTOTUNE) # turn into batches and prefetch appropriately
# Check input shapes
train_pos_char_token_dataset, val_pos_char_token_dataset
history_model_5 = model_5.fit(train_pos_char_token_dataset,
steps_per_epoch=int(0.1 * len(train_pos_char_token_dataset)),
epochs=10,
validation_data=val_pos_char_token_dataset,
validation_steps=int(0.1 * len(val_pos_char_token_dataset)))
model_5_pred_probs = model_5.predict(val_pos_char_token_dataset, verbose=1)
model_5_pred_probs
model_5_preds = tf.argmax(model_5_pred_probs, axis=1)
model_5_preds
model_5_results = calculate_results(y_true=val_labels_encoded,
y_pred=model_5_preds)
model_5_results
all_model_results = pd.DataFrame({"baseline": baseline_results,
"custom_token_embed_conv1d": model_1_results,
"pretrained_token_embed": model_2_results,
"custom_char_embed_conv1d": model_3_results,
"hybrid_char_token_embed": model_4_results,
"tribrid_pos_char_token_embed": model_5_results})
all_model_results = all_model_results.transpose()
all_model_results
all_model_results["accuracy"] = all_model_results["accuracy"]/100
all_model_results.plot(kind="bar", figsize=(10, 7)).legend(bbox_to_anchor=(1.0, 1.0));
all_model_results.sort_values("f1", ascending=False)["f1"].plot(kind="bar", figsize=(10, 7));
model_5.save("tribrid_model")
model_path = "/content/tribrid_model"
loaded_model = tf.keras.models.load_model(model_path)
loaded_pred_probs = loaded_model.predict(val_pos_char_token_dataset, verbose=1)
loaded_preds = tf.argmax(loaded_pred_probs, axis=1)
loaded_preds[:10]
loaded_model_results = calculate_results(val_labels_encoded,
loaded_preds)
loaded_model_results
Evaluate model on test dataset
To make our model's performance more comparable with the results reported in Table 3 of the PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts paper, let's make predictions on the test dataset and evaluate them.
test_pos_char_token_data = tf.data.Dataset.from_tensor_slices((test_line_numbers_one_hot,
test_total_lines_one_hot,
test_sentences,
test_chars))
test_pos_char_token_labels = tf.data.Dataset.from_tensor_slices(test_labels_one_hot)
test_pos_char_token_dataset = tf.data.Dataset.zip((test_pos_char_token_data, test_pos_char_token_labels))
test_pos_char_token_dataset = test_pos_char_token_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
# Check shapes
test_pos_char_token_dataset
test_pred_probs = loaded_model.predict(test_pos_char_token_dataset,
verbose=1)
test_preds = tf.argmax(test_pred_probs, axis=1)
test_preds[:10]
loaded_model_test_results = calculate_results(y_true=test_labels_encoded,
y_pred=test_preds)
loaded_model_test_results
%%time
# Get list of class names of test predictions
test_pred_classes = [label_encoder.classes_[pred] for pred in test_preds]
test_pred_classes
test_df["prediction"] = test_pred_classes # create column with test prediction class names
test_df["pred_prob"] = tf.reduce_max(test_pred_probs, axis=1).numpy() # get the maximum prediction probability
test_df["correct"] = test_df["prediction"] == test_df["target"] # create binary column for whether the prediction is right or not
test_df.head(20)
Future Work
-
As we trained our above Models with subset of actual data(PubMed 20k), training the same model with larger samples(PubMed 200k) might have chance of Increase in Accuracy.
-
Except Universal Sentence Encoder, we'll try to replace embedding layers with preratined embedding (Contex Independent) like Word2Vec, GloVe and FastText and compare between them.
-
Try replacing the TensorFlow Hub Universal Sentence Encoder pretrained embedding for the TensorFlow Hub BERT PubMed expert (a language model pretrained on PubMed texts) pretrained embedding. Does this effect results? Note: Using the BERT PubMed expert pretrained embedding requires an extra preprocessing step for sequences (as detailed in the TensorFlow Hub guide). Does the BERT model beat the results mentioned in this paper? https://arxiv.org/pdf/1710.06071.pdf What happens if you were to merge our line_number and total_lines features for each sequence? For example, created a X_of_Y feature instead? Does this effect model performance?