import random
import string
import re
import os
# Init and datafile
corpus_file = "./the_art_of_war.txt"
text = None
# reading the datafile
with open(corpus_file,'r') as f:
text = f.read()
# extracting the words
words_order = [s.lower() for s in re.compile('\w+').findall(text)]
words = list(set(words_order))
# print a random sequence from the corpus
print("Random sentence of size 10:")
print(' '.join( [words[random.randint(0,len(words))] for i in range(10)] ))
Each word being independent of each other, the generation is simply done by sampling words one by one from the corpus. The constraint that we just satisfied was a simple unary constraint applied to each variable, restricting their values to a set of allowed ones.
Since we are doing a standalone process of sampling here, we could extract the frequency of each word in the text and use them in our sampling. While this would not change the set of possible solutions, it biases the generation and allows us to reach solutions whose word frequency is closer to the corpus more frequently. Imagine the impact if you apply then NLP using bags of words to your generated sequences.
The ability to generate text with rhymes is an important question in lyrics and poetry generation. In poetry, a couplet is a set of two successive lines that rhyme and have the same meter. Thus, our next constraint is to be able to generate a sequence of 4 lines, such that the endings of the first two and last two lines rhyme.
There exists a large number of different rhymes. In order to simplify the extraction of the rhyme information, we will consider that two words rhyme if their last four letters are the same. Even if that is not necessarily true ('equally', 'ally'), the exactness of the group of rhyming words does not change the mathematical model used to resolve it and an exact extractor of rhymes is not one of the goals of this tutorial.
# define rhymes set
rhyme_groups = {}
for w in words:
if len(w) < 4:
continue
end_of_word = w[-4:]
if end_of_word in rhyme_groups:
rhyme_groups[end_of_word].append(w)
else:
rhyme_groups[end_of_word] = [w]
# Remove small sets
key_to_del = []
for key in rhyme_groups:
value = rhyme_groups[key]
if len(value) <= 3:
key_to_del.append(key)
for key in key_to_del:
del rhyme_groups[key]
# 1) How to get a rhyme
rhyme1 = random.sample([*rhyme_groups],1)[0]
print(rhyme1 + "+>" + str(rhyme_groups[rhyme1]))
We now have at our disposition a rhyme dictionary. Use this dictionary to generate a sequence of 4 lines of 5 words with words from our corpus with the contraint that the endings of the first two and last two lines rhyme.
# 2) generate
n_random_words = 4
n_sentences = 2
sequence = []
# TODO
#Generate a sequence with rhymes and store it into the sequence list
print("Final text")
for i, w in enumerate(sequence):
print(w + " ", end='' if i%(n_random_words+1) != n_random_words else '\n')
Generating sequence with rhymes was not that hard either. But we now have the capability of generating a sequence of values such that some of them are related.
High school students are often asked to write stories or poems or other kinds of creative documents in the same style as the authors that they are currently studying. Most of the time this task consists of extracting characteristics of a certain text, like the special use of some hyperbole or even the family of the objectives themselves. When listening to music the style may mean something different, and so too with paintings.
Interest in learning and/or using the style of an artist is one of the fundamental questions of art based generative methods. Recently impressive results have been accomplished, for example with deep-learning-based style transfer methods, which can recreate paintings with the style of another.
In this tutorial, we consider the style of a body of text (or music to be the succession of words (or musical notes) used in the given corpus. So, if we are able to generate new content using the already seen successions of pairs of words, then the style will be the same.
Our first step is to extract the transitions between words from our corpus.
# Reminder: words_order was storing all the text, word by word.
# extracting the transitions
transition = {}
for i, w in enumerate(words_order[:-1]):
if w in transition:
transition[w].append(words_order[i+1])
else:
transition[w] = [words_order[i+1]]
# just removing duplicates
transition = {word: list(set(nexts)) for word, nexts in transition.items()}
# Example of transition
word = random.sample([*transition],1)[0]
print("the word '"+word+ "' can be followed by "+ str(transition[word]))
Now that we have extracted the style from our corpus, we can use it to generate some sequences. Generate a sequence of 10 words with the same style of our corpus.
sequence_length = 10
# TODO
# Generate a sequence with style and store it into the sequence list
print(' '.join(sequence))
The first two steps were not really hard to implement and a simple sampling algorithms was able to handle them.
What do you think about the complexity of sampling a sequence subject to the combination of the two constraints?
More precisely: What is the complexity of generating a sequence, following a transition function, and by putting dependency (here equality => rhyme) between variables?
The answer to this question has been established by several different works, and is that it is hard, like NP-complete hard just to prove that there exists a solution. It has also been proven that sampling sequences with exact probability (using the transition frequency for example) is #P-complete.
While we can easily find solutions with the two constraints independently, finding a solution that respect both of them simulaneously is hard. While this statement may seem like bad news, its counterpart is the core of model-based problem solving: hard problem can often be defined by the intersection of some easy problems.
Many problems can be defined by the intersection of several subproblems. We create a model by defining a finite set of variables, their domains of definition (which is typically a finite set of values), and their constraints. Constraints, relationships between variables, are a way of expressing subproblems, just like the rhyme and style constraints were able to define our problem. They usually consist of a subset of the Cartesian product of the variables' domain. The intersection of the subsets of all the constraints of a problem is thus the set of valid solutions.
Many different frameworks exist for modeling problems, here we are going to use one of the most flexible one, in which we will be able to define our model easily. Many constraints, like the transition constraints or the equality constraints, are native to this framework.
We use the solver OR-tools made by google because of its efficient python binding. Another advantage is that it includes both a CP and a CP/SAT solver with good performance.
The installation of this solver can be done using pip:
pip install ortools
Here follows the definition of the basic requirement for our model, its variables and their associated domain: here, these are the indices of the words in the list word.
#import the solver
from ortools.constraint_solver import pywrapcp
# Create the solver.
solver = pywrapcp.Solver("one-over-f")
sequence_length = 20
min_dom = 0
max_dom = len(words)-1
# sequence variables
seq = [solver.IntVar(min_dom, max_dom, "seq(%i)" % i) for i in range(sequence_length)]
Transitions between variables is a well-studied problem in constraint programming. In this example, the transition is even simpler since it only depends on the value of the previous variable. Thus using the transition dictionary, we can create a binary table constraint (i.e. constraints applied to two variables), which is a constraint defined by the list of allowed combination (constraints named AllowedAssignement).
solver.Add(solver.AllowedAssignments(variable_list, tuple_list))
# Extract the conversion between words and indices
word_to_i = {w:i for i, w in enumerate(words)}
# TODO
# Build the table
transition_table = []
# TODO
# create the constraints
Now that the transition constraint is done, we can create our rhyme constraint. From our previous definition, we choose to say that the words that must rhyme must belong to the same bag of rhyming words.
Once again, this can be easily done using our rhyme_groups dictionary and table constraint. Here we just want to enforce that any combination of the two words from the same list is allowed.
# TODO
# Build the rhyme table
# TODO
# Create the constraints
Finally, once we have defined all our constraints, we need to ask the solver to find a (or many) solution for us. This is an example of black box optimization, we do not necessarily know how the solver works, but only how to define a model.
The following code set the search for a solution to a random selection of the values.
db = solver.Phase(seq,
solver.CHOOSE_FIRST_UNBOUND,
solver.ASSIGN_RANDOM_VALUE)
solver.NewSearch(db)
The search for a solution is done by calling the NextSolution method from the solver object.
num_solution=0
while solver.NextSolution():
num_solution +=1
print([v.Value() for v in seq])
break
for i, w in enumerate(seq):
print(words[w.Value()] + " ", end='' if i%(5) != 4 else '\n')
Here is our solution, with rhymes and style from our corpus. Even if the problem is hard, we get a solution in an instant. This seems fast so let's try to generate a bunch of solutions, say 50 thousand.
while solver.NextSolution():
num_solution +=1
if num_solution > 50000:
break
print(num_solution)
for i, w in enumerate(seq):
print(words[w.Value()] + " ", end='' if i%(5) != 4 else '\n')
Let's recall what we just did. First, we defined a set of variables with domain the interval [0,D] where D is the maximum index in the words list. Second, we have defined the constraints. One constraint uses the transition list and ensures that each two consecutive values appear consecutively in the text. Another ensures that the words at the end of each 5-word sentence rhymes. Finally, we ran the solver and obtained a solution of our model.
The CP solver API doesn't seem to have the forbidden table constraint, here follows the model considering that the constraint is available. In the next section, this problem is used using the CP/SAT solver.
# Create the solver.
solver = pywrapcp.Solver("one-over-f")
# sequence variables (use previous init variables)
seq = [solver.IntVar(min_dom, max_dom, "seq(%i)" % i) for i in range(sequence_length)]
# create the transition constraints
for i, v in enumerate(seq[:-1]):
solver.Add(solver.AllowedAssignments( # also named table constraint, or extensional constraint
(seq[i], seq[i + 1]), # variables
transition_table)) # table
# create the rhymes constraints
solver.Add(solver.AllowedAssignments((seq[4], seq[9]), rhymes_table)) # First rhyme
solver.Add(solver.AllowedAssignments((seq[14], seq[19]), rhymes_table)) # second one
plagiarism_table = [[word_to_i[words_order[i]],word_to_i[words_order[i+1]],word_to_i[words_order[i+2]],word_to_i[words_order[i+3]]]
for i, w in enumerate(words_order[:-3])]
# not accessible
#for i, w in enumerate(seq[:-3]):
# solver.Add(solver.ForbiddenAssignments( # also named table constraint, or extensional constraint
# (seq[i], seq[i + 1], seq[i + 2], seq[i + 3]), # variables
# plagiarism_table)) # table
Since the forbidden table constraint is not easily available in the ortools solver, we are using the CP/SAT solver that they provide. Here is the new model.
from ortools.sat.python import cp_model
# Create the solver.
model = cp_model.CpModel()
# sequence variables (use previous init variables)
seq = [model.NewIntVar(min_dom, max_dom, "seq(%i)" % i) for i in range(sequence_length)]
# create the transition constraints
for i, v in enumerate(seq[:-1]):
model.AddAllowedAssignments( # also named table constraint, or extensional constraint
(seq[i], seq[i + 1]), # variables
transition_table) # table
# create the rhymes constraints
model.AddAllowedAssignments((seq[4], seq[9]), rhymes_table) # First rhyme
model.AddAllowedAssignments((seq[14], seq[19]), rhymes_table) # second one
# create plagiarism constraint
for i, w in enumerate(seq[:-3]):
model.AddForbiddenAssignments( # also named table constraint, or extensional constraint
(seq[i], seq[i + 1], seq[i + 2], seq[i + 3]), # variables
plagiarism_table) # table
model.AddAllDifferent(seq) # All the values are pairwise different.
solver = cp_model.CpSolver()
solver.Solve(model)
for i, w in enumerate(seq):
print(words[solver.Value(w)] + " ", end='' if i%(5) != 4 else '\n')