A walkthrough example

The following examples give an overview of how to use py-processors.

Getting started

For annotation and parsing, py-processors communicates with processors-server using a REST interface.

The server can be run either via java directly or in a docker container. Let’s look at how to connect to the server.

Running the NLP server

Option 1: processors-server.jar

This method requires java and a compatible processors-server.jar for the server. An appropriate jar will be downloaded automatically if one is not found.

from processors import *
# The constructor requires you to specify a port for running the server.
API = ProcessorsAPI(port=8886)

NOTE: It may take a minute or so for the server to initialize as there are some large model files that need to be loaded.

Option 2: docker container

You can pull the official container from Docker Hub:

docker pull myedibleenso/processors-server:latest

You can check py-processors for the appropriate version to retrieve:

import processors
# print the recommended processors-server version
print(import processors.__ps_rec__)

Just replace latest in the command above with the appropriate version (3.1.0 onwards).

The following command will run the container in the background and expose the service on port 8886:

docker run -d -e _JAVA_OPTIONS="-Xmx3G" -p 127.0.0.1:8886:8888 --name procserv myedibleenso/processors-server

For a more detailed example showcasing configuration options, take a look at this docker-compose.yml file. You’ll need to map a local port to 8888 in the container.

Once the container is running, you can connect to it via py-processors:

from processors import *
# provide the local port that you mapped to 8888 on the running container
API = ProcessorsBaseAPI(hostname="127.0.0.1", port=8886)

Annotating text

Text can be annotated automatically with these linguistic attributes.

# try annotating some text using FastNLPProcessor (a CoreNLP wrapper)
doc = API.fastnlp.annotate("My name is Inigo Montoya.  You killed my father.  Prepare to die.")

# you can also annotate text already segmented into sentences
doc = API.fastnlp.annotate_from_sentences(["My name is Inigo Montoya.", "You killed my father.", "Prepare to die."])

# There should be 3 Sentence objects in this Document
doc.size

# A Document contains the words, pos tags, lemmas, named entities, and syntactic dependencies of its component Sentences
doc.bag_of_labeled_deps

# We can access the named entities for the Document as a dictionary mapping an NE label -> list of named entities
doc.nes

# A Sentence contains words, pos tags, lemmas, named entities, and syntactic dependencies
doc.sentences[0].lemmas

# get the first sentence
s = doc.sentences[0]

# the number of tokens in this sentence
s.length

# the named entities contained in this sentence
s.nes

# generate labeled dependencies using "words", "tags", "lemmas", "entities", or token index ("index")
s.bag_of_labeled_dependencies_using("tags")

# generate unlabeled dependencies using "words", "tags", "lemmas", "entities", or token index ("index")
s.bag_of_unlabeled_dependencies_using("lemmas")

# play around with the dependencies directly
deps = s.dependencies

# see what dependencies lead directly to the first token (i.e. token 0 is the dependent of what?)
deps.incoming[0]

# see what dependencies are originating from the first token (i.e. token 0 is the head of what?)
deps.outgoing[0]

# find all shortest paths between "name" and either "Inigo" or "Montoya".
deps.shortest_paths(start=1, end=[3,4])

# find the shortest path between "name" and either "Inigo" or "Montoya".  Prefer a path that involves a "nsubj" relation.
sp = deps.shortest_path(start=1, end=[3,4],
scoring_func=lambda path: 9000 if any(seg[1] == "nsubj" for seg in path) else 0)

# generate an Odin-like pattern with partial lexicalization
DependencyUtils.lexicalize_path(sentence=s, path=sp, lemmas=True, tags=True)

# limit lexicalization to tokens 1 and 4 (if present)
DependencyUtils.lexicalize_path(sentence=s, path=sp, lemmas=True, tags=True, limit_to=[1,4])

# run PageRank on the dependency graph to find nodes with the most activity.
# SPOILER: When using reverse=True, the nodes with the highest weight are usually the sentential predicate and its args
deps.pagerank(reverse=True)

# find out which nodes are most central to the dependency graph
deps.degree_centrality()

# retrieve the likely semantic head for a sentence.
from processors.paths import HeadFinder
doc2 = API.annotate("acute renal failure")
sentence = doc2.sentences[0]
# select the graph to examine (default is "stanford-collapsed") and
# optionally limit to a set of PoS tags (regex or str)
head_idx = sentence.semantic_head(graph_name="stanford-collapsed", valid_tags=None)
head_word = sentence.words[head_idx] if head_idx else None

# try using BioNLPProcessor
biodoc = api.bionlp.annotate("We next considered the effect of Ras monoubiquitination on GAP-mediated hydrolysis")

# check out the bio-specific entities
biodoc.nes

Serializing to/from json

Once you’ve annotated text, you can serialize it to json for later loading.

# serialize to/from JSON!
json_file = "serialized_doc_example.json"
ross_doc = api.fastnlp.annotate("We don't make mistakes, just happy little accidents.")

# serialize to JSON
with open(json_file, "w") as out:
    out.write(ross_doc.to_JSON())

# load from JSON
with open(json_file, "r") as jf:
    d = Document.load_from_JSON(json.load(jf))

Perform sentiment analysis

You can perform sentiment analysis using CoreNLP‘s tree-based system.

# get sentiment analysis scores
review = "The humans are dead."
doc = API.fastnlp.annotate(review)

# try Stanford's tree-based sentiment analysis
# you'll get a score for each Sentence
# scores are between 1 (very negative) - 5 (very positive)
scores = API.sentiment.corenlp.score_document(doc)

# you can pass text directly
scores = API.sentiment.corenlp.score_text(review)

# ... or a single sentence
score = API.sentiment.corenlp.score_sentence(doc.sentences[0])

# ... or from text already segmented into sentences
lyrics = ["My sugar lumps are two of a kind", "Sweet and white and highly refined", "Honeys try all kinds of tomfoolery", "to steal a feel of my family jewelry"]
scores = API.sentiment.corenlp.score_segmented_text(lyrics)

Rule-based information extraction (IE) with Odin

If you’re unfamiliar with writing Odin rules, see our manual for a primer on the language: http://arxiv.org/pdf/1509.07513v1.pdf

# Do rule-based IE with Odin!
# see http://arxiv.org/pdf/1509.07513v1.pdf for details
example_rule = """
- name: "ner-person"
  label: [Person, PossiblePerson, Entity]
  priority: 1
  type: token
  pattern: |
   [entity="PERSON"]+
   |
   [tag=/^N/]* [tag=/^N/ & outgoing="cop"] [tag=/^N/]*
"""

example_text = """
Barack Hussein Obama II is the 44th and current President of the United States and the first African-American to hold the office.
He is a Democrat.
Obama won the 2008 United States presidential election, on November 4, 2008.
He was inaugurated on January 20, 2009.
"""

# take a look at the .label, .labels, and .text attributes of each mention
mentions = API.odin.extract_from_text(example_text, example_rule)
# visualize the structure of a mention as colored output in the terminal
for m in mentions: print(m)

# Alternatively, you can provide a rule URL.  The URL should end with .yml or .yaml.
rules_url = "https://raw.githubusercontent.com/clulab/reach/508697db2217ba14cd1fa0a99174816cc3383317/src/main/resources/edu/arizona/sista/demo/open/grammars/rules.yml"

mentions = API.odin.extract_from_text(example_text, rules_url)

# You can also perform IE with Odin on a Document.
barack_doc = API.annotate(example_text)
mentions = API.odin.extract_from_document(barack_doc, rules_url)

# mentions can be serialized as well
mentions_json_file = "mentions.json"

with open(mentions_json_file, "w") as out:
    out.write(JSONSerializer.mentions_to_JSON(mentions))

# loading from a file is also handled via JSONSerializer
with open(mentions_json_file, "r") as jf:
    mentions = JSONSerializer.mentions_from_JSON(json.load(jf))

OpenIE for concept recognition

coming soon

Jupyter notebook visualizations

py-processors supports some custom notebook-based visualizations, but you’ll need to install the extra [jupyter] module in order to use them:

pip install "py-processors[jupyter]"

These visualizations make use of our fork of displaCy, You can now visualize a Sentence graph as an SVG image using visualization.JupyterVisualizer.display_graph():

from processors.visualization import JupyterVisualizer as viz
# run this snippet within a jupyter notebook
text = "To be loved by unicorns is the greatest gift of all."
doc = API.annotate(text)
viz.display_graph(doc.sentences[0], graph_name="stanford-collapsed")

Mentions can also be visualized in a notebook:

# run this snippet within a jupyter notebook
rules = """
rules:
  - name: "ner-location"
    label: [Location, PossibleLocation, Entity]
    priority: 1
    type: token
    pattern: |
      [entity="LOCATION"]+ | Twin Peaks

  - name: "ner-person"
    label: [Person, PossiblePerson, Entity]
    priority: 1
    type: token
    pattern: |
     [entity="PERSON"]+

  - name: "ner-org"
    label: [Organization, Entity]
    priority: 1
    type: token
    pattern: |
      [entity="ORGANIZATION"]+

  - name: "ner-date"
    label: [Date]
    priority: 1
    type: token
    pattern: |
      [entity="DATE"]+

  - name: "missing"
    label: Missing
    pattern: |
      trigger = [lemma=go] missing
      theme: Person = <xcomp nsubj
      date: Date? = prep_on
"""
mentions = API.odin.extract_from_text("FBI Special Agent Dale Cooper went missing on June 10, 1991.  He was last seen in the woods of Twin Peaks. ", rules=rules)

for m in mentions: viz.display_mention(m)

Other ways of initializing the server

Using a custom processors-server

When initializing the API, you can specify a path to a custom processors-server.jar using the jar_path parameter:

from processors import *

API = ProcessorsAPI(port=8886, jar_path="path/to/processors-server.jar")

Alternatively, you can set an environment variable, PROCESSORS_SERVER, with the path to the jar you wish to use. In your .bashrc (or equivalent), add this line with the path to the jar you wish to use with py-processors:

export PROCESSORS_SERVER="path/to/processors-server.jar"

Remember to source your profile:

source path/to/your/.profile

py-processors will now prefer this jar whenever a new API is initialized.

NOTE: If you decide that you no longer want to use this enivronment variable, remember to both remove it from your profile and run unset PROCESSORS_SERVER from the shell.

Allocating memory

By default, the server will be run with 3GB of RAM. You might be able to get by with a little less, though. You can start the server with a different amount of memory with the jvm_mem parameter:

from processors import *
# run the sever with 2GB of memory
API = ProcessorsAPI(port=8886, jvm_mem="-Xmx2G")

NOTE: This won’t have any effect if the server is already running on the given port.

Keeping the server running

If you’ve launched the server via java, py-processors will by default attempt to shut down the server whenever an API instance goes out of scope (ex. your script finishes or you exit the interpreter).

If you’d prefer to keep the server alive, you’ll need to initialize the API with keep_alive=True:

from processors import *

API = ProcessorsAPI(port=8886, keep_alive=True)

This is useful if you’re sharing access to the server on a network, or if you have a bunch of independent tasks and would prefer to avoid waiting for the server to initialize again and again.