<!-- Copyright 2011 The Go Authors. All rights reserved. Use of this source code is governed by a BSD-style license that can be found in the LICENSE file. --> <codewalk title="Generating arbitrary text: a Markov chain algorithm"> <step title="Introduction" src="doc/codewalk/markov.go:/Generating/,/line\./"> This codewalk describes a program that generates random text using a Markov chain algorithm. The package comment describes the algorithm and the operation of the program. Please read it before continuing. </step> <step title="Modeling Markov chains" src="doc/codewalk/markov.go:/ chain/"> A chain consists of a prefix and a suffix. Each prefix is a set number of words, while a suffix is a single word. A prefix can have an arbitrary number of suffixes. To model this data, we use a <code>map[string][]string</code>. Each map key is a prefix (a <code>string</code>) and its values are lists of suffixes (a slice of strings, <code>[]string</code>). <br/><br/> Here is the example table from the package comment as modeled by this data structure: <pre> map[string][]string{ " ": {"I"}, " I": {"am"}, "I am": {"a", "not"}, "a free": {"man!"}, "am a": {"free"}, "am not": {"a"}, "a number!": {"I"}, "number! I": {"am"}, "not a": {"number!"}, }</pre> While each prefix consists of multiple words, we store prefixes in the map as a single <code>string</code>. It would seem more natural to store the prefix as a <code>[]string</code>, but we can't do this with a map because the key type of a map must implement equality (and slices do not). <br/><br/> Therefore, in most of our code we will model prefixes as a <code>[]string</code> and join the strings together with a space to generate the map key: <pre> Prefix Map key []string{"", ""} " " []string{"", "I"} " I" []string{"I", "am"} "I am" </pre> </step> <step title="The Chain struct" src="doc/codewalk/markov.go:/type Chain/,/}/"> The complete state of the chain table consists of the table itself and the word length of the prefixes. The <code>Chain</code> struct stores this data. </step> <step title="The NewChain constructor function" src="doc/codewalk/markov.go:/func New/,/\n}/"> The <code>Chain</code> struct has two unexported fields (those that do not begin with an upper case character), and so we write a <code>NewChain</code> constructor function that initializes the <code>chain</code> map with <code>make</code> and sets the <code>prefixLen</code> field. <br/><br/> This is constructor function is not strictly necessary as this entire program is within a single package (<code>main</code>) and therefore there is little practical difference between exported and unexported fields. We could just as easily write out the contents of this function when we want to construct a new Chain. But using these unexported fields is good practice; it clearly denotes that only methods of Chain and its constructor function should access those fields. Also, structuring <code>Chain</code> like this means we could easily move it into its own package at some later date. </step> <step title="The Prefix type" src="doc/codewalk/markov.go:/type Prefix/"> Since we'll be working with prefixes often, we define a <code>Prefix</code> type with the concrete type <code>[]string</code>. Defining a named type clearly allows us to be explicit when we are working with a prefix instead of just a <code>[]string</code>. Also, in Go we can define methods on any named type (not just structs), so we can add methods that operate on <code>Prefix</code> if we need to. </step> <step title="The String method" src="doc/codewalk/markov.go:/func[^\n]+String/,/}/"> The first method we define on <code>Prefix</code> is <code>String</code>. It returns a <code>string</code> representation of a <code>Prefix</code> by joining the slice elements together with spaces. We will use this method to generate keys when working with the chain map. </step> <step title="Building the chain" src="doc/codewalk/markov.go:/func[^\n]+Build/,/\n}/"> The <code>Build</code> method reads text from an <code>io.Reader</code> and parses it into prefixes and suffixes that are stored in the <code>Chain</code>. <br/><br/> The <code><a href="/pkg/io/#Reader">io.Reader</a></code> is an interface type that is widely used by the standard library and other Go code. Our code uses the <code><a href="/pkg/fmt/#Fscan">fmt.Fscan</a></code> function, which reads space-separated values from an <code>io.Reader</code>. <br/><br/> The <code>Build</code> method returns once the <code>Reader</code>'s <code>Read</code> method returns <code>io.EOF</code> (end of file) or some other read error occurs. </step> <step title="Buffering the input" src="doc/codewalk/markov.go:/bufio\.NewReader/"> This function does many small reads, which can be inefficient for some <code>Readers</code>. For efficiency we wrap the provided <code>io.Reader</code> with <code><a href="/pkg/bufio/">bufio.NewReader</a></code> to create a new <code>io.Reader</code> that provides buffering. </step> <step title="The Prefix variable" src="doc/codewalk/markov.go:/make\(Prefix/"> At the top of the function we make a <code>Prefix</code> slice <code>p</code> using the <code>Chain</code>'s <code>prefixLen</code> field as its length. We'll use this variable to hold the current prefix and mutate it with each new word we encounter. </step> <step title="Scanning words" src="doc/codewalk/markov.go:/var s string/,/\n }/"> In our loop we read words from the <code>Reader</code> into a <code>string</code> variable <code>s</code> using <code>fmt.Fscan</code>. Since <code>Fscan</code> uses space to separate each input value, each call will yield just one word (including punctuation), which is exactly what we need. <br/><br/> <code>Fscan</code> returns an error if it encounters a read error (<code>io.EOF</code>, for example) or if it can't scan the requested value (in our case, a single string). In either case we just want to stop scanning, so we <code>break</code> out of the loop. </step> <step title="Adding a prefix and suffix to the chain" src="doc/codewalk/markov.go:/ key/,/key\], s\)"> The word stored in <code>s</code> is a new suffix. We add the new prefix/suffix combination to the <code>chain</code> map by computing the map key with <code>p.String</code> and appending the suffix to the slice stored under that key. <br/><br/> The built-in <code>append</code> function appends elements to a slice and allocates new storage when necessary. When the provided slice is <code>nil</code>, <code>append</code> allocates a new slice. This behavior conveniently ties in with the semantics of our map: retrieving an unset key returns the zero value of the value type and the zero value of <code>[]string</code> is <code>nil</code>. When our program encounters a new prefix (yielding a <code>nil</code> value in the map) <code>append</code> will allocate a new slice. <br/><br/> For more information about the <code>append</code> function and slices in general see the <a href="/doc/articles/slices_usage_and_internals.html">Slices: usage and internals</a> article. </step> <step title="Pushing the suffix onto the prefix" src="doc/codewalk/markov.go:/p\.Shift/"> Before reading the next word our algorithm requires us to drop the first word from the prefix and push the current suffix onto the prefix. <br/><br/> When in this state <pre> p == Prefix{"I", "am"} s == "not" </pre> the new value for <code>p</code> would be <pre> p == Prefix{"am", "not"}</pre> This operation is also required during text generation so we put the code to perform this mutation of the slice inside a method on <code>Prefix</code> named <code>Shift</code>. </step> <step title="The Shift method" src="doc/codewalk/markov.go:/func[^\n]+Shift/,/\n}/"> The <code>Shift</code> method uses the built-in <code>copy</code> function to copy the last len(p)-1 elements of <code>p</code> to the start of the slice, effectively moving the elements one index to the left (if you consider zero as the leftmost index). <pre> p := Prefix{"I", "am"} copy(p, p[1:]) // p == Prefix{"am", "am"}</pre> We then assign the provided <code>word</code> to the last index of the slice: <pre> // suffix == "not" p[len(p)-1] = suffix // p == Prefix{"am", "not"}</pre> </step> <step title="Generating text" src="doc/codewalk/markov.go:/func[^\n]+Generate/,/\n}/"> The <code>Generate</code> method is similar to <code>Build</code> except that instead of reading words from a <code>Reader</code> and storing them in a map, it reads words from the map and appends them to a slice (<code>words</code>). <br/><br/> <code>Generate</code> uses a conditional for loop to generate up to <code>n</code> words. </step> <step title="Getting potential suffixes" src="doc/codewalk/markov.go:/choices/,/}\n/"> At each iteration of the loop we retrieve a list of potential suffixes for the current prefix. We access the <code>chain</code> map at key <code>p.String()</code> and assign its contents to <code>choices</code>. <br/><br/> If <code>len(choices)</code> is zero we break out of the loop as there are no potential suffixes for that prefix. This test also works if the key isn't present in the map at all: in that case, <code>choices</code> will be <code>nil</code> and the length of a <code>nil</code> slice is zero. </step> <step title="Choosing a suffix at random" src="doc/codewalk/markov.go:/next := choices/,/Shift/"> To choose a suffix we use the <code><a href="/pkg/math/rand/#Intn">rand.Intn</a></code> function. It returns a random integer up to (but not including) the provided value. Passing in <code>len(choices)</code> gives us a random index into the full length of the list. <br/><br/> We use that index to pick our new suffix, assign it to <code>next</code> and append it to the <code>words</code> slice. <br/><br/> Next, we <code>Shift</code> the new suffix onto the prefix just as we did in the <code>Build</code> method. </step> <step title="Returning the generated text" src="doc/codewalk/markov.go:/Join\(words/"> Before returning the generated text as a string, we use the <code>strings.Join</code> function to join the elements of the <code>words</code> slice together, separated by spaces. </step> <step title="Command-line flags" src="doc/codewalk/markov.go:/Register command-line flags/,/prefixLen/"> To make it easy to tweak the prefix and generated text lengths we use the <code><a href="/pkg/flag/">flag</a></code> package to parse command-line flags. <br/><br/> These calls to <code>flag.Int</code> register new flags with the <code>flag</code> package. The arguments to <code>Int</code> are the flag name, its default value, and a description. The <code>Int</code> function returns a pointer to an integer that will contain the user-supplied value (or the default value if the flag was omitted on the command-line). </step> <step title="Program set up" src="doc/codewalk/markov.go:/flag.Parse/,/rand.Seed/"> The <code>main</code> function begins by parsing the command-line flags with <code>flag.Parse</code> and seeding the <code>rand</code> package's random number generator with the current time. <br/><br/> If the command-line flags provided by the user are invalid the <code>flag.Parse</code> function will print an informative usage message and terminate the program. </step> <step title="Creating and building a new Chain" src="doc/codewalk/markov.go:/c := NewChain/,/c\.Build/"> To create the new <code>Chain</code> we call <code>NewChain</code> with the value of the <code>prefix</code> flag. <br/><br/> To build the chain we call <code>Build</code> with <code>os.Stdin</code> (which implements <code>io.Reader</code>) so that it will read its input from standard input. </step> <step title="Generating and printing text" src="doc/codewalk/markov.go:/c\.Generate/,/fmt.Println/"> Finally, to generate text we call <code>Generate</code> with the value of the <code>words</code> flag and assigning the result to the variable <code>text</code>. <br/><br/> Then we call <code>fmt.Println</code> to write the text to standard output, followed by a carriage return. </step> <step title="Using this program" src="doc/codewalk/markov.go"> To use this program, first build it with the <a href="/cmd/go/">go</a> command: <pre> $ go build markov.go</pre> And then execute it while piping in some input text: <pre> $ echo "a man a plan a canal panama" \ | ./markov -prefix=1 a plan a man a plan a canal panama</pre> Here's a transcript of generating some text using the Go distribution's README file as source material: <pre> $ ./markov -words=10 < $GOROOT/README This is the source code repository for the Go source $ ./markov -prefix=1 -words=10 < $GOROOT/README This is the go directory (the one containing this README). $ ./markov -prefix=1 -words=10 < $GOROOT/README This is the variable if you have just untarred a</pre> </step> <step title="An exercise for the reader" src="doc/codewalk/markov.go"> The <code>Generate</code> function does a lot of allocations when it builds the <code>words</code> slice. As an exercise, modify it to take an <code>io.Writer</code> to which it incrementally writes the generated text with <code>Fprint</code>. Aside from being more efficient this makes <code>Generate</code> more symmetrical to <code>Build</code>. </step> </codewalk>