ch00b's adventures with NLP

new

« Prev
1
Next »

ch00beh
Global Moderator

wat

Posts: 9,651

House: Oloysian

ch00b's adventures with NLP Mar 1, 2013 12:23:06 GMT -5

Quote

Post by ch00beh on Mar 1, 2013 12:23:06 GMT -5

NLP being Natural Language Processing, the branch of computing that applies machine learning to natural languages.

Currently just screwing around with using Markov chains to generate words based off some input text. Skip to the quote blocks if you don't care about how Markov chains work and just want to see the examples so far.

Basically a Markov model just tells you given some input state, what the probable outcomes will be. For example, given the letter A, the model might say that B follows A fairly often, but A almost never follows A.

So I did that. Given a corpus of ~1 million words taken as excerpts from real world sources (thanks Brown university!), analyze each word. Take the first letter and put it as a key in a table, and it then points to a list. Take the second letter and put it in that list. Repeat until you are out of letters in the word, then repeat until you are out of words.

And what you get when you use that simple Markov model is............

her, ptagen, lesed, quee, l, publuditenve, joblot, juright, or, quing, but, quir, s, he, en, d, xforsed, s, he, fornioll, par, e, ure, e, knewithe

That's uh.... that's cool.

Wait a minute. Words aren't truly based off just one letter followed by another. Phonemes are often two letters long at least. What happens if we use a Markov chain, that is, what if instead of keying off a single letter, we do probability analysis that a given letter will follow two consecutives?

argin, in, ipeal, mintlen, ided, oky, jand, ld, early, zings, he, taked, y, ecopricir, munce, ld, zz, it, r, ivy, ceds, vionnothe, ad, n, palle

Hey, that actually has real words in it now, and some of the others might convince a non-English speaker?

Hm. But I see another problem with all those one and two letter words. Also what the hell is that "zz", words shouldn't be able to start with two consonants. Let's make it so that the word generator throws out any word less than 3 characters, doesn't repeat first letters, and checks to make sure that we always start with vowel-consonant or consonant-vowel.

While we're at it, let's also do some probability analysis on length of word. For some reason i can't break my code again to where I was last night, but basically I would often end up with words like "lentembeadectivemall". So anyway, when the word generation starts, pick a word length according to that distribution. If the generator starts exceeding that length, try to stop (letters know the probability that they are a terminator for a word because they are followed by a space or punctutation, so if any of that's present in the output list, just stop)

feetwis, migout, mity, cuse, quar, igai, hist, obit, kess, lood, sives, axand, zons, turn, migh, pents, jesen, gici, bighlenerly, peed, pand, hatio, reachin, aceig, vith

These might actually be pronounceable, more or less.

And that's pretty much where I am right now. Some other training sets used:

A short list of names from Game of Thrones that I manually listed out:

tarah, darys, dary, ater, oraned, margandor, aratheon, ark, heont, heon, enjenje, jory, enjeor, ery, enlyn, dor, bargary, ely, mard, rah, arger, arys, ark, sandor, do

A list of female names that I got from the internet:

ura, zorenvia, isse, lortrudita, aline, erlyndrubr, osteie, didali, mely, essi, hine, lerine, beta, eli, uldannell, kaynna, maly, ine, ellie, oldi, udi, renne, onidorele, ollorte, kassan

Female names with a 4th order Markov chain:

justa, ena, hannalia, nina, ondio, menicia, berriah, lie, olloita, ude, nista, etty, doria, ildet, nie, dynn, elliw, lottice, argarisabb, arlenni, armalle, izella, arbie, ace, karlate

5th order:

hila, nalee, hellalfre, nella, odele, erebelce, railym, ana, colian, ine, ulcinil, maroby, mary, anora, errannicqu, lia, lann, gara, helen, anci, ulinie, malettacy, ennistom, elbardy, aria

Portuguese texts:

porm, udoi, idad, est, lialinentes, riunt, tout, zara, guasguniva, pos, ivoc, ide, nist, depe, hame, wint, pet, yme, nuth, utra, kus, com, ecria, worar, harconio

The book of Genesis:

andi, nut, cit, uman, omme, pad, ber, kast, wand, varamie, ist, iches, dem, puut, kit, tag, ter, kom, ren, ine, lifund, nigh, jaask, lach, ypt

And to make myself feel 100% accurate, a corpus of 50,000 words from the lorem ipsum:

que, elis, harcu, cin, quisque, esus, laculi, illestius, nisit, vini, hicilis, ges, fus, isis, dolor, piestae, narem, odis, isci, aliscel, rabi, erra, tiendimi, vel, gue

The next step is to continue finding various constraints that I can place on words so I know when to throw one out and when one looks good enough. Maybe do another round of machine learning where I pick the best names/words of the bunch to further train the machine. Also I want to see what happens when you do analysis on the probability that two letters follow two other letters, then combine that with the single letter set.

Last Edit: Mar 1, 2013 12:29:20 GMT -5 by ch00beh

ヽ( ﾟヮﾟ)ﾉ

Ninety Obsidian Heart fuck if i know Posts: 1,586	ch00b's adventures with NLP Mar 1, 2013 13:47:31 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Ninety on Mar 1, 2013 13:47:31 GMT -5 Since when can't words start with two consonants? I just had three (now four) words start with two consonants.

ch00beh Global Moderator wat Posts: 9,651 House: Oloysian	ch00b's adventures with NLP Mar 1, 2013 14:03:08 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by ch00beh on Mar 1, 2013 14:03:08 GMT -5 Since it was easier to define a heuristic than to create a probability distribution of the first two letters of every word then use those as a root of all the generated words.
	ヽ( ﾟヮﾟ)ﾉ

ch00beh
Global Moderator

wat

Posts: 9,651

House: Oloysian

ch00b's adventures with NLP Mar 2, 2013 14:39:31 GMT -5

Quote

Post by ch00beh on Mar 2, 2013 14:39:31 GMT -5

So while doing what 90 suggested, I realized that there was a huge bug in the state transition code and now this thing works significantly better.

Brown corpus used for training, then words generated to look like a lorem ipsum, but in english

Andez hanged. Harp dent canno say come outsider july their hings divide. For wed sters itute ity. Diate possi ble the twisted ecial his detro tury palfred there snows money facto the. Conce with ity fly fla own ing manza the. Will from workin instruct sound plend what evotiv bers ment sult. Spitalio and. And obably guard anica ward take let ands delib now. Leman age red have tionairedly today gers will annon come oblem. Under devey hold first han never elect hat togeth only tions shown witho ally. From artin. The ren marriva the coope good the willy municating. Proved schoo netary ride issio some rom inst wheres. Ally themself its ement belie was dea oughtly savinsky. And can away howeve appy came fever unicol not sanso. Music the the. Trali proba hen heart then tronar mande inter which ther astomorr. Companiel nemie walte hem asseud. Possible ther pace ther sult the you ing for self becaus this terrible incre grown. Not ther there pletel toms lear val ity ists rer world ant otal ped tor. Suppo luency uncle. Door comes small.

Names generated off the list of female first names and a general list of family names

Ana Lin
Kore Day
Herine Gues
Dia Bacicottis
Flossi Greeley
Inger Moulson
Winny Candelat
Heda Monachelder
Eda Bhatt
Sinde Tondon
Lan Son
Ana Yak
Bee Dihy
Bell Regoric
Anthie Kolapni
Drusy Ghty
Zella Lount
Winona Millane
Odelina Her
Karissia Vanyi

"Roman" names generated a list of male first names and 2 or 3 lorem ipsum names

Barth Etiam Aliquam Modo
Darnard Orbi Felis
Derick Sollicies Tor Inia
Ganny Nec Habitas
Roma Pretium Aliquam Liquet
Aldo Dipiscing Idunt Culis
Ald Himentum Rhoncus
Frank Vulputate Elit Por
Ore Eugiat Consequada
Vaugh Fringilla Sapien Inia
Collectio Eugiat Rus Dunt
Wood Eleifend Tique Mauris
Gie Landit Modo Nenatibi
Jeffy Esuada Nec
Clevey Vel Eugiat
Sley Suspendin Congue
Matthian Ellus Nullam
Istian Pharetra Lisis Justo
Ton Ulla Ipit Lisis
Matthias Integer Mauris Lus

Names generated off the female list followed by two words generated from portuguese

Kathryn Quer Seus
Maire Defendeu Espingu
Bra Edu Empregor
Evangelin Contral Ram
Alyndsay Panha Pre
Amaleta Voltando Disposi
Grid Humanifes Para
Millene Pano Detido
Kathi Lugerindo Semesa
Fancie Mais Partidos
Glyn Estamos Andidato
Lita Pertaria Demais
Meline Escer Passocia
Garla Mericara Afande
Edelilas Aneira Pataram
Suzie Proteca Que
Edi Deral Der
Jacquelin Idade Presas
Beitrina Foi Princident
Robby Grandes Longe

There seem to be some cross pollination of male/female names, so I need to figure that and if there's anything I can really do about it. Also just generally it is still not 100%.

ヽ( ﾟヮﾟ)ﾉ

Beelzebibble
Fadministrator

Snob-in-Residence

Posts: 9,655
House: Oloysian

ch00b's adventures with NLP Dec 18, 2015 11:32:55 GMT -5

Quote

Post by Beelzebibble on Dec 18, 2015 11:32:55 GMT -5

Mar 1, 2013 12:23:06 GMT -5 ch00beh said:

hila, nalee, hellalfre, nella, odele, erebelce, railym, ana, colian, ine, ulcinil, maroby, mary, anora, errannicqu, lia, lann, gara, helen, anci, ulinie, malettacy, ennistom, elbardy, aria

That's it, I'm expecting Lee to name all of his upcoming characters out of this list, too. Nella, Odele, Erebelce, Ulcinil, Maroby, Ulinie, Malettacy, Ennistom, and Elbardy all sound like Lee names.

Tout-Perd
Law of the Heavens

Reluctant Admin

Posts: 14,181
House: Alryst

ch00b's adventures with NLP Dec 18, 2015 12:12:05 GMT -5

Quote

Post by Tout-Perd on Dec 18, 2015 12:12:05 GMT -5

Between Cendra and me, I bet we could knock out that whole list. Also, Hila is a valid Hebrew name (meaning halo, glory, or crown), so don't go ruling that one out, either.

Odele will be a character of mine who just waltzes into topics, says "HELLO," in a really dramatic tone, and then does nothing else for the rest of the RP.

Take two steps north into the unsettled future, south into the unquiet past, east into the present day, or west into the great unknown.-Terramorphic Expanse

Honestly, a trainwreck isn't a wrong statement to use.
It's a trainwreck, but the train was carrying a load of fireworks and a shipment of clowns.-Jayngfet

ch00beh Global Moderator wat Posts: 9,651 House: Oloysian	ch00b's adventures with NLP Dec 18, 2015 12:46:45 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by ch00beh on Dec 18, 2015 12:46:45 GMT -5
	ヽ( ﾟヮﾟ)ﾉ

Quick Reply

Shoutbox

me: chat Apr 23, 2024 11:50:52 GMT -5

{WW}BetaBloodWolf7: What are you seeing? Jul 6, 2021 3:17:43 GMT -5

{WW}BetaBloodWolf7: That wasn't all a conversation Loogs and I had in the same day Jul 6, 2021 3:17:38 GMT -5

{WW}BetaBloodWolf7: I do know that I sent that bit about not having a good year either awhile later Jul 6, 2021 3:17:17 GMT -5

{WW}BetaBloodWolf7: Honestly I'm not entirely sure Jul 6, 2021 3:17:00 GMT -5

Yoshimitsu: i feel like it's not Jul 5, 2021 18:36:26 GMT -5

Yoshimitsu: is the shoutbox information accurate Jul 5, 2021 18:36:24 GMT -5

{WW}BetaBloodWolf7: I can't say it's improved for me =) May 12, 2021 20:01:11 GMT -5

{WW}BetaBloodWolf7: But May 12, 2021 20:01:04 GMT -5

{WW}BetaBloodWolf7: I don't know about everyone else May 12, 2021 20:01:04 GMT -5

{WW}BetaBloodWolf7: Well May 12, 2021 20:00:59 GMT -5

{WW}BetaBloodWolf7: I'm not either but hopefully it improves Jan 21, 2021 20:51:23 GMT -5

Loogs: im not but thanks anyway Jan 18, 2021 21:49:09 GMT -5

{WW}BetaBloodWolf7: I hope everyone is having a good new year so far Jan 13, 2021 7:32:13 GMT -5

Krazy Glue: merry xmas everyone Dec 25, 2020 14:54:02 GMT -5

{WW}BetaBloodWolf7: It's interesting that I see it now, I've actually been rereading Homestuck Nov 24, 2020 21:49:13 GMT -5

{WW}BetaBloodWolf7: I know I read their message at the time but I completely forgot about it XD Nov 24, 2020 21:48:57 GMT -5

Hamuu: Wish that ten11 person had linked their youtube they were talking about Nov 11, 2020 23:11:18 GMT -5

Krazy Glue: Happy Global Pandemic everyone =) Oct 29, 2020 10:40:15 GMT -5

{WW}BetaBloodWolf7: I hope you're all relatively happy Sept 10, 2020 20:07:38 GMT -5

ch00b's adventures with NLP

Post by ch00beh on Mar 1, 2013 12:23:06 GMT -5

Post by Ninety on Mar 1, 2013 13:47:31 GMT -5

Post by ch00beh on Mar 1, 2013 14:03:08 GMT -5

Post by ch00beh on Mar 2, 2013 14:39:31 GMT -5

Post by Beelzebibble on Dec 18, 2015 11:32:55 GMT -5

Post by Tout-Perd on Dec 18, 2015 12:12:05 GMT -5

Post by ch00beh on Dec 18, 2015 12:46:45 GMT -5

Quick Reply

Shoutbox