How to parse ‘go’
Natural Language Processing in Ruby
Tom Cartwright
@tomcartwrightuk
!

keepmebooked
giveaiddirect.com
Python, surely?
Yes. The NLTK is awesome.
But you have a Ruby-based app.
Extracting meaning from !
human input
Summarisation
Extracting entities
Tagging text
Sentiment analysis
Filtering text
document

sentence

From document level!
!
!
!
!

word

example

to word level
document

sentence

word

example

Chunking & segmenting
Breaking text into paragraphs, sentences and other zones
Start with a document/some text:
“The second nonabsolute number is the given time of
arrival, which is now known to be one of those most bizarre
of mathematical concepts, a recipriversexclusion, a number
whose existence can only be defined as being anything other
than itself…..”
document

sentence

word

Punkt sentence tokenizer to the rescue….

example
document

sentence

word

example

tokenizer = Punkt::SentenceTokenizer.new(!
"The second nonabsolute number is the given time
of arrival...")!
!

result = !
tokenizer.sentences_from_text(text,!
:output => :sentences_text)!
!
!
!
document

sentence

word

example

Training

trainer = Punkt::Trainer.new()!
trainer.train(bistromatic_text)
document

sentence

word

example

Tokenising
Breaking text into words, phrases and symbols.
“Time is an illusion. Lunchtime
doubly so.”.split(“ “)!
!

#=> !
!

[“Time", “is", “an", “illusion.”,
“Lunchtime", “doubly", “so.”]!
document

sentence

word

example

Tokenizer gem
Regexes and rules
class Tokenizer	
	
FS = Regexp.new(‘[[:blank:]]+')	
PAIR_PRE = ['(', '{', '[']	
SIMPLE_POST = ['!', '?', ',', ':', ';', '.']	
PAIR_POST = [')', '}', ']']	
PRE_N_POST = ['"', “'"]	
…
document

sentence

word

tokenizer = Tokenizer::Tokenizer.new
tokenizer.tokenize(“Time is an
illusion. Lunchtime doubly so.”)

#=>

[“Time", “is", “an", “illusion", “.”,
“Lunchtime", “doubly", “so", “.”]

example
document

sentence

word

example

Stemming
Jogging => Jog
“jogging”.gsub(/.ing/, “”) !
#=> “jog"!
!

“bring”.gsub(/.ing/, “”) !
#=> “b"
document

sentence

1. Ruby-Stemmer
2. Text

word

example

multi-language porter stemmer

porter stemmer

stemmer = Lingua::Stemmer.new(:language => "en")
stemmer.stem("programming") #=> program
stemmer.stem("vimming") #=> vim
document

sentence

word

example

Parts-of-speech tagging
CC

conjunction

DET

determiner

and, but
this, some

IN

preposition / conjunction

JJ

adjective

NNP

above, about

orange, tiny

proper noun

Camden Pale Ale
document

sentence

word

A couple of methods!
!

Regex tagger
/*.ing/
VBG
/*.ed/

VBD
!

Lookup on words
E.g.
calculating : { VBG: 6 }
orange: { JJ: 2, NN: 5 }

example
document

sentence

word

example

A tale of two taggers
EngTagger

rb-brill-tagger

Probabilistic (uses

•

Rule based

look up table prev.

•

•

C extensions

slide)
•

Brown corpus trained

•

Pure ruby
document

sentence

word

example

Treat gem
Bundles many of the gems shown
Wraps them in a DSL
s = sentence(“A really good sentence.”)
s.do(:chunk, :segment, :tokenize, :parse)

stemming; tokenising; chunking; serialising;
tagging; text extraction from pdfs and html;
LRUG Sentiments
A tag

{NN}

Pass in regex => /({JJ}|{JJS})({NNS}|{NNP})/
And some tagged tokens
#=> [(Word @tag="JJ", @text="jolly"),!
(Word @tag="NN", @text="face")]
Sentimental value
1.0
!
1.0
0.21875
0.21875
-1.0
-1.0

epic!
good!
chance!
brisk!
slanderous!
piteous
Results
!
!
!
•
•

•
•
•

Ruby!
Practical ObjectOriented Design in
Ruby!
Doctors!
Lrug!
recruiters (!)

•
•
•

dedicated servers!
pdfs!
Surrey

•

•
•
•
•

unsolicited phone
calls from
r********s!
clients!
Paypal!
XML!
geeks
Gems
Text - Paul Battley’s box of tricks
Treat
Tokenizer
Punkt segmenter
Chronic - for extracting dates
Other things you can do/I didn’t talk about
Calculate text edit distance
Extract entities using the Stanford
libraries via the RJB
!

Extract topic words (LDA)
!

Keyword extraction - TfIdf
!

Jruby
Thank you for processing.
Questions?
@tomcartwrightuk

Thanks to Tim Cowlishaw and the HT dev
team for specialised rubber duck support

Natural Language Processing in Ruby