Post by ch00beh on Mar 1, 2013 12:23:06 GMT -5
NLP being Natural Language Processing, the branch of computing that applies machine learning to natural languages.
Currently just screwing around with using Markov chains to generate words based off some input text. Skip to the quote blocks if you don't care about how Markov chains work and just want to see the examples so far.
Basically a Markov model just tells you given some input state, what the probable outcomes will be. For example, given the letter A, the model might say that B follows A fairly often, but A almost never follows A.
So I did that. Given a corpus of ~1 million words taken as excerpts from real world sources (thanks Brown university!), analyze each word. Take the first letter and put it as a key in a table, and it then points to a list. Take the second letter and put it in that list. Repeat until you are out of letters in the word, then repeat until you are out of words.
And what you get when you use that simple Markov model is............
That's uh.... that's cool.
Wait a minute. Words aren't truly based off just one letter followed by another. Phonemes are often two letters long at least. What happens if we use a Markov chain, that is, what if instead of keying off a single letter, we do probability analysis that a given letter will follow two consecutives?
Hey, that actually has real words in it now, and some of the others might convince a non-English speaker?
Hm. But I see another problem with all those one and two letter words. Also what the hell is that "zz", words shouldn't be able to start with two consonants. Let's make it so that the word generator throws out any word less than 3 characters, doesn't repeat first letters, and checks to make sure that we always start with vowel-consonant or consonant-vowel.
While we're at it, let's also do some probability analysis on length of word. For some reason i can't break my code again to where I was last night, but basically I would often end up with words like "lentembeadectivemall". So anyway, when the word generation starts, pick a word length according to that distribution. If the generator starts exceeding that length, try to stop (letters know the probability that they are a terminator for a word because they are followed by a space or punctutation, so if any of that's present in the output list, just stop)
These might actually be pronounceable, more or less.
And that's pretty much where I am right now. Some other training sets used:
A short list of names from Game of Thrones that I manually listed out:
A list of female names that I got from the internet:
Female names with a 4th order Markov chain:
5th order:
Portuguese texts:
The book of Genesis:
And to make myself feel 100% accurate, a corpus of 50,000 words from the lorem ipsum:
The next step is to continue finding various constraints that I can place on words so I know when to throw one out and when one looks good enough. Maybe do another round of machine learning where I pick the best names/words of the bunch to further train the machine. Also I want to see what happens when you do analysis on the probability that two letters follow two other letters, then combine that with the single letter set.
Currently just screwing around with using Markov chains to generate words based off some input text. Skip to the quote blocks if you don't care about how Markov chains work and just want to see the examples so far.
Basically a Markov model just tells you given some input state, what the probable outcomes will be. For example, given the letter A, the model might say that B follows A fairly often, but A almost never follows A.
So I did that. Given a corpus of ~1 million words taken as excerpts from real world sources (thanks Brown university!), analyze each word. Take the first letter and put it as a key in a table, and it then points to a list. Take the second letter and put it in that list. Repeat until you are out of letters in the word, then repeat until you are out of words.
And what you get when you use that simple Markov model is............
her, ptagen, lesed, quee, l, publuditenve, joblot, juright, or, quing, but, quir, s, he, en, d, xforsed, s, he, fornioll, par, e, ure, e, knewithe
That's uh.... that's cool.
Wait a minute. Words aren't truly based off just one letter followed by another. Phonemes are often two letters long at least. What happens if we use a Markov chain, that is, what if instead of keying off a single letter, we do probability analysis that a given letter will follow two consecutives?
argin, in, ipeal, mintlen, ided, oky, jand, ld, early, zings, he, taked, y, ecopricir, munce, ld, zz, it, r, ivy, ceds, vionnothe, ad, n, palle
Hey, that actually has real words in it now, and some of the others might convince a non-English speaker?
Hm. But I see another problem with all those one and two letter words. Also what the hell is that "zz", words shouldn't be able to start with two consonants. Let's make it so that the word generator throws out any word less than 3 characters, doesn't repeat first letters, and checks to make sure that we always start with vowel-consonant or consonant-vowel.
While we're at it, let's also do some probability analysis on length of word. For some reason i can't break my code again to where I was last night, but basically I would often end up with words like "lentembeadectivemall". So anyway, when the word generation starts, pick a word length according to that distribution. If the generator starts exceeding that length, try to stop (letters know the probability that they are a terminator for a word because they are followed by a space or punctutation, so if any of that's present in the output list, just stop)
feetwis, migout, mity, cuse, quar, igai, hist, obit, kess, lood, sives, axand, zons, turn, migh, pents, jesen, gici, bighlenerly, peed, pand, hatio, reachin, aceig, vith
These might actually be pronounceable, more or less.
And that's pretty much where I am right now. Some other training sets used:
A short list of names from Game of Thrones that I manually listed out:
tarah, darys, dary, ater, oraned, margandor, aratheon, ark, heont, heon, enjenje, jory, enjeor, ery, enlyn, dor, bargary, ely, mard, rah, arger, arys, ark, sandor, do
A list of female names that I got from the internet:
ura, zorenvia, isse, lortrudita, aline, erlyndrubr, osteie, didali, mely, essi, hine, lerine, beta, eli, uldannell, kaynna, maly, ine, ellie, oldi, udi, renne, onidorele, ollorte, kassan
Female names with a 4th order Markov chain:
justa, ena, hannalia, nina, ondio, menicia, berriah, lie, olloita, ude, nista, etty, doria, ildet, nie, dynn, elliw, lottice, argarisabb, arlenni, armalle, izella, arbie, ace, karlate
5th order:
hila, nalee, hellalfre, nella, odele, erebelce, railym, ana, colian, ine, ulcinil, maroby, mary, anora, errannicqu, lia, lann, gara, helen, anci, ulinie, malettacy, ennistom, elbardy, aria
Portuguese texts:
porm, udoi, idad, est, lialinentes, riunt, tout, zara, guasguniva, pos, ivoc, ide, nist, depe, hame, wint, pet, yme, nuth, utra, kus, com, ecria, worar, harconio
The book of Genesis:
andi, nut, cit, uman, omme, pad, ber, kast, wand, varamie, ist, iches, dem, puut, kit, tag, ter, kom, ren, ine, lifund, nigh, jaask, lach, ypt
And to make myself feel 100% accurate, a corpus of 50,000 words from the lorem ipsum:
que, elis, harcu, cin, quisque, esus, laculi, illestius, nisit, vini, hicilis, ges, fus, isis, dolor, piestae, narem, odis, isci, aliscel, rabi, erra, tiendimi, vel, gue
The next step is to continue finding various constraints that I can place on words so I know when to throw one out and when one looks good enough. Maybe do another round of machine learning where I pick the best names/words of the bunch to further train the machine. Also I want to see what happens when you do analysis on the probability that two letters follow two other letters, then combine that with the single letter set.