Sequence Generation Under Constraints

Generating a sequence of values respecting some constraints can be easy for some constraints, and very hard for others. These exercises aim at showing examples of sequence generation and their complexity. This tutorial complements these slides.

1) Random Sequence

Let's start with a simple constraint. Let's generate a sequence of 10 words, but all the words must come from a corpus. Here our corpus is the book made by Sun Tzu the art of war. The text is available here. You may need to remove few of the first lines.

In [ ]:
import random 
import string
import re
import os

# Init and datafile
corpus_file = "./the_art_of_war.txt"
text = None

# reading the datafile
with open(corpus_file,'r') as f:
    text = f.read()

# extracting the words
words_order = [s.lower() for s in re.compile('\w+').findall(text)]
words = list(set(words_order))

# print a random sequence from the corpus
print("Random sentence of size 10:")
print(' '.join(  [words[random.randint(0,len(words))] for i in range(10)]  ))

Each word being independent of each other, the generation is simply done by sampling words one by one from the corpus. The constraint that we just satisfied was a simple unary constraint applied to each variable, restricting their values to a set of allowed ones.

Since we are doing a standalone process of sampling here, we could extract the frequency of each word in the text and use them in our sampling. While this would not change the set of possible solutions, it biases the generation and allows us to reach solutions whose word frequency is closer to the corpus more frequently. Imagine the impact if you apply then NLP using bags of words to your generated sequences.

2) Sequence with Rhymes

The ability to generate text with rhymes is an important question in lyrics and poetry generation. In poetry, a couplet is a set of two successive lines that rhyme and have the same meter. Thus, our next constraint is to be able to generate a sequence of 4 lines, such that the endings of the first two and last two lines rhyme.

How to know if two words rhymes

There exists a large number of different rhymes. In order to simplify the extraction of the rhyme information, we will consider that two words rhyme if their last four letters are the same. Even if that is not necessarily true ('equally', 'ally'), the exactness of the group of rhyming words does not change the mathematical model used to resolve it and an exact extractor of rhymes is not one of the goals of this tutorial.

In [ ]:
# define rhymes set
rhyme_groups = {}
for w in words:
    if len(w) < 4:
        continue
    end_of_word = w[-4:]
    if end_of_word in rhyme_groups:
        rhyme_groups[end_of_word].append(w)
    else:
        rhyme_groups[end_of_word] = [w]
            
# Remove small sets
key_to_del = []
for key in rhyme_groups:
    value = rhyme_groups[key]
    if len(value) <= 3:
        key_to_del.append(key)
for key in key_to_del:
    del rhyme_groups[key]
    
# 1) How to get a rhyme
rhyme1 = random.sample([*rhyme_groups],1)[0]
print(rhyme1 + "+>" + str(rhyme_groups[rhyme1]))

Generate a sequence with a rhyme

We now have at our disposition a rhyme dictionary. Use this dictionary to generate a sequence of 4 lines of 5 words with words from our corpus with the contraint that the endings of the first two and last two lines rhyme.

In [ ]:
# 2) generate
n_random_words = 4
n_sentences = 2

sequence = []
In [ ]:
# TODO
#Generate a sequence with rhymes and store it into the sequence list
In [ ]:
print("Final text")
for i, w in enumerate(sequence):
    print(w + " ", end='' if i%(n_random_words+1) != n_random_words else '\n')

Generating sequence with rhymes was not that hard either. But we now have the capability of generating a sequence of values such that some of them are related.

3) Sequence with style

High school students are often asked to write stories or poems or other kinds of creative documents in the same style as the authors that they are currently studying. Most of the time this task consists of extracting characteristics of a certain text, like the special use of some hyperbole or even the family of the objectives themselves. When listening to music the style may mean something different, and so too with paintings.

Interest in learning and/or using the style of an artist is one of the fundamental questions of art based generative methods. Recently impressive results have been accomplished, for example with deep-learning-based style transfer methods, which can recreate paintings with the style of another.

What do we use as style

In this tutorial, we consider the style of a body of text (or music to be the succession of words (or musical notes) used in the given corpus. So, if we are able to generate new content using the already seen successions of pairs of words, then the style will be the same.

Extracting this Style

Our first step is to extract the transitions between words from our corpus.

In [ ]:
# Reminder: words_order was storing all the text, word by word.
# extracting the transitions
transition = {}
for i, w in enumerate(words_order[:-1]):
    if w in transition:
        transition[w].append(words_order[i+1])
    else:
        transition[w] = [words_order[i+1]]
# just removing duplicates
transition = {word: list(set(nexts)) for word, nexts in transition.items()}

# Example of transition
word = random.sample([*transition],1)[0]
print("the word '"+word+ "' can be followed by "+ str(transition[word]))

Generating using the transition function

Now that we have extracted the style from our corpus, we can use it to generate some sequences. Generate a sequence of 10 words with the same style of our corpus.

In [ ]:
sequence_length = 10
# TODO
# Generate a sequence with style and store it into the sequence list
In [ ]:
print(' '.join(sequence))

4) Generating sequence with Style and Rhymes

The first two steps were not really hard to implement and a simple sampling algorithms was able to handle them.

What do you think about the complexity of sampling a sequence subject to the combination of the two constraints?

More precisely: What is the complexity of generating a sequence, following a transition function, and by putting dependency (here equality => rhyme) between variables?

The answer to this question has been established by several different works, and is that it is hard, like NP-complete hard just to prove that there exists a solution. It has also been proven that sampling sequences with exact probability (using the transition frequency for example) is #P-complete.

While we can easily find solutions with the two constraints independently, finding a solution that respect both of them simulaneously is hard. While this statement may seem like bad news, its counterpart is the core of model-based problem solving: hard problem can often be defined by the intersection of some easy problems.

5) Model based problem definition

Many problems can be defined by the intersection of several subproblems. We create a model by defining a finite set of variables, their domains of definition (which is typically a finite set of values), and their constraints. Constraints, relationships between variables, are a way of expressing subproblems, just like the rhyme and style constraints were able to define our problem. They usually consist of a subset of the Cartesian product of the variables' domain. The intersection of the subsets of all the constraints of a problem is thus the set of valid solutions.

Constraint Programming model

Many different frameworks exist for modeling problems, here we are going to use one of the most flexible one, in which we will be able to define our model easily. Many constraints, like the transition constraints or the equality constraints, are native to this framework.

We use the solver OR-tools made by google because of its efficient python binding. Another advantage is that it includes both a CP and a CP/SAT solver with good performance.

The installation of this solver can be done using pip:

pip install ortools

Here follows the definition of the basic requirement for our model, its variables and their associated domain: here, these are the indices of the words in the list word.

In [ ]:
#import the solver
from ortools.constraint_solver import pywrapcp

# Create the solver.
solver = pywrapcp.Solver("one-over-f")

sequence_length = 20
min_dom = 0
max_dom = len(words)-1

# sequence variables
seq = [solver.IntVar(min_dom, max_dom, "seq(%i)" % i) for i in range(sequence_length)]

The transition constraint

Transitions between variables is a well-studied problem in constraint programming. In this example, the transition is even simpler since it only depends on the value of the previous variable. Thus using the transition dictionary, we can create a binary table constraint (i.e. constraints applied to two variables), which is a constraint defined by the list of allowed combination (constraints named AllowedAssignement).

solver.Add(solver.AllowedAssignments(variable_list, tuple_list))
In [ ]:
# Extract the conversion between words and indices
word_to_i = {w:i for i, w in enumerate(words)}
In [ ]:
# TODO 
# Build the table
transition_table = []
In [ ]:
# TODO
# create the constraints
    

The Rhyme constraint

Now that the transition constraint is done, we can create our rhyme constraint. From our previous definition, we choose to say that the words that must rhyme must belong to the same bag of rhyming words.

Once again, this can be easily done using our rhyme_groups dictionary and table constraint. Here we just want to enforce that any combination of the two words from the same list is allowed.

In [ ]:
# TODO
# Build the rhyme table
In [ ]:
# TODO
# Create the constraints

Running the solver and extracting solutions

Finally, once we have defined all our constraints, we need to ask the solver to find a (or many) solution for us. This is an example of black box optimization, we do not necessarily know how the solver works, but only how to define a model.

The following code set the search for a solution to a random selection of the values.

In [ ]:
db = solver.Phase(seq,
                  solver.CHOOSE_FIRST_UNBOUND,
                  solver.ASSIGN_RANDOM_VALUE)

solver.NewSearch(db)

The search for a solution is done by calling the NextSolution method from the solver object.

In [ ]:
num_solution=0
while solver.NextSolution():
    num_solution +=1
    print([v.Value() for v in seq])
    break
for i, w in enumerate(seq):
    print(words[w.Value()] + " ", end='' if i%(5) != 4 else '\n')

Here is our solution, with rhymes and style from our corpus. Even if the problem is hard, we get a solution in an instant. This seems fast so let's try to generate a bunch of solutions, say 50 thousand.

In [ ]:
while solver.NextSolution():
    num_solution +=1
    if num_solution > 50000:
        break
print(num_solution)
for i, w in enumerate(seq):
    print(words[w.Value()] + " ", end='' if i%(5) != 4 else '\n')

Results

Let's recall what we just did. First, we defined a set of variables with domain the interval [0,D] where D is the maximum index in the words list. Second, we have defined the constraints. One constraint uses the transition list and ensures that each two consecutive values appear consecutively in the text. Another ensures that the words at the end of each 5-word sentence rhymes. Finally, we ran the solver and obtained a solution of our model.

6) Generation without plagiarism

The CP solver API doesn't seem to have the forbidden table constraint, here follows the model considering that the constraint is available. In the next section, this problem is used using the CP/SAT solver.

In [ ]:
# Create the solver.
solver = pywrapcp.Solver("one-over-f")

# sequence variables (use previous init variables)
seq = [solver.IntVar(min_dom, max_dom, "seq(%i)" % i) for i in range(sequence_length)]

# create the transition constraints
for i, v in enumerate(seq[:-1]):
    solver.Add(solver.AllowedAssignments( # also named table constraint, or extensional constraint
        (seq[i], seq[i + 1]),  # variables
        transition_table))     # table


# create the rhymes constraints
solver.Add(solver.AllowedAssignments((seq[4], seq[9]), rhymes_table)) # First rhyme
solver.Add(solver.AllowedAssignments((seq[14], seq[19]), rhymes_table)) # second one
In [ ]:
plagiarism_table = [[word_to_i[words_order[i]],word_to_i[words_order[i+1]],word_to_i[words_order[i+2]],word_to_i[words_order[i+3]]] 
                    for i, w in enumerate(words_order[:-3])]

# not accessible
#for i, w in enumerate(seq[:-3]):
#    solver.Add(solver.ForbiddenAssignments( # also named table constraint, or extensional constraint
#        (seq[i], seq[i + 1], seq[i + 2], seq[i + 3]),  # variables
#        plagiarism_table))     # table

CP / Sat solver model

Since the forbidden table constraint is not easily available in the ortools solver, we are using the CP/SAT solver that they provide. Here is the new model.

In [ ]:
from ortools.sat.python import cp_model
# Create the solver.
model = cp_model.CpModel()

# sequence variables (use previous init variables)
seq = [model.NewIntVar(min_dom, max_dom, "seq(%i)" % i) for i in range(sequence_length)]

# create the transition constraints
for i, v in enumerate(seq[:-1]):
    model.AddAllowedAssignments( # also named table constraint, or extensional constraint
        (seq[i], seq[i + 1]),  # variables
        transition_table)     # table
    
# create the rhymes constraints
model.AddAllowedAssignments((seq[4], seq[9]), rhymes_table) # First rhyme
model.AddAllowedAssignments((seq[14], seq[19]), rhymes_table) # second one

# create plagiarism constraint
for i, w in enumerate(seq[:-3]):
    model.AddForbiddenAssignments( # also named table constraint, or extensional constraint
        (seq[i], seq[i + 1], seq[i + 2], seq[i + 3]),  # variables
        plagiarism_table)     # table

model.AddAllDifferent(seq) # All the values are pairwise different.
    
solver = cp_model.CpSolver()

solver.Solve(model)
In [ ]:
for i, w in enumerate(seq):
    print(words[solver.Value(w)] + " ", end='' if i%(5) != 4 else '\n')
In [ ]: