API Reference

This section of the documentation provides detailed information on functions, classes, and methods.

Server Communication

Communicating with the NLP server (processors-server) is handled by the following classes:

ProcessorsBaseAPI

class processors.api.ProcessorsBaseAPI(**kwargs)[source]

Bases: object

Manages a connection with processors-server and provides an interface to the API.

Parameters:
  • port (int) – The port the server is running on or should be started on. Default is 8886.
  • hostname (str) – The host name to use for the server. Default is “localhost”.
  • log_file (str) – The path for the log file. Default is py-processors.log in the user’s home directory.
annotate(text)

Produces a Document from the provided text using the default processor.

clu.annotate(text)

Produces a Document from the provided text using CluProcessor.

fastnlp.annotate(text)

Produces a Document from the provided text using FastNLPProcessor.

bionlp.annotate(text)

Produces a Document from the provided text using BioNLPProcessor.

annotate_from_sentences(sentences)

Produces a Document from sentences (a list of text split into sentences). Uses the default processor.

fastnlp.annotate_from_sentences(sentences)

Produces a Document from sentences (a list of text split into sentences). Uses FastNLPProcessor.

bionlp.annotate_from_sentences(sentences)

Produces a Document from sentences (a list of text split into sentences). Uses BioNLPProcessor.

corenlp.sentiment.score_sentence(sentence)

Produces a sentiment score for the provided sentence (an instance of Sentence).

corenlp.sentiment.score_document(doc)

Produces sentiment scores for the provided doc (an instance of Document). One score is produced for each sentence.

corenlp.sentiment.score_segmented_text(sentences)

Produces sentiment scores for the provided sentences (a list of text segmented into sentences). One score is produced for item in sentences.

odin.extract_from_text(text, rules)

Produces a list of Mentions for matches of the provided rules on the text. rules can be a string of Odin rules, or a url ending in .yml or .yaml.

odin.extract_from_document(doc, rules)

Produces a list of Mentions for matches of the provided rules on the doc (an instance of Document). rules can be a string of Odin rules, or a url ending in .yml or yaml.

ProcessorsAPI

class processors.api.ProcessorsAPI(**kwargs)[source]

Bases: processors.api.ProcessorsBaseAPI

Manages a connection with the processors-server jar and provides an interface to the API.

Parameters:
  • timeout (int) – The number of seconds to wait for the server to initialize. Default is 120.
  • jvm_mem (str) – The maximum amount of memory to allocate to the JVM for the server. Default is “-Xmx3G”.
  • jar_path (str) – The path to the processors-server jar. Default is the jar installed with the package.
  • kee_alive (bool) – Whether or not to keep the server running when ProcessorsAPI instance goes out of scope. Default is false (server is shut down).
  • log_file (str) – The path for the log file. Default is py-processors.log in the user’s home directory.
start_server(jar_path, **kwargs)

Starts the server using the provided jar_path. Optionally takes hostname, port, jvm_mem, and timeout.

stop_server()

Attempts to stop the server running at self.address.

OdinAPI

class processors.api.OdinAPI(address)[source]

Bases: object

API for performing rule-based information extraction with Odin.

Parameters:address (str) – The base address for the API (i.e., everything preceding /api/..)

OdinAPI

class processors.api.OpenIEAPI(address)[source]

Bases: object

SentimentAnalysisAPI

class processors.sentiment.SentimentAnalysisAPI(address)[source]

Bases: object

API for performing sentiment analysis

Parameters:address (str) – The base address for the API (i.e., everything preceding /api/..)
corenlp

processors.sentiment.CoreNLPSentimentAnalyzer – Service using [CoreNLP‘s tree-based system](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) for performing sentiment analysis.

Data Structures

NLPDatum

class processors.ds.NLPDatum[source]

Bases: object

Document

class processors.ds.Document(sentences)[source]

Bases: processors.ds.NLPDatum

Storage class for annotated text. Based on [org.clulab.processors.Document](https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/processors/Document.scala)

Parameters:sentences ([processors.ds.Sentence]) – The sentences comprising the Document.
id

str or None – A unique ID for the Document.

size

int – The number of sentences.

sentences

sentences – The sentences comprising the Document.

words

[str] – A list of the Document‘s tokens.

tags

[str] – A list of the Document‘s tokens represented using part of speech (PoS) tags.

lemmas

[str] – A list of the Document‘s tokens represented using lemmas.

_entities

[str] – A list of the Document‘s tokens represented using IOB-style named entity (NE) labels.

nes

dict – A dictionary of NE labels represented in the Document -> a list of corresponding text spans.

bag_of_labeled_deps

[str] – The labeled dependencies from all sentences in the Document.

bag_of_unlabeled_deps

[str] – The unlabeled dependencies from all sentences in the Document.

text

str or None – The original text of the Document.

bag_of_labeled_dependencies_using(form)

Produces a list of syntactic dependencies where each edge is labeled with its grammatical relation.

bag_of_unlabeled_dependencies_using(form)

Produces a list of syntactic dependencies where each edge is left unlabeled without its grammatical relation.

Sentence

class processors.ds.Sentence(**kwargs)[source]

Bases: processors.ds.NLPDatum

Storage class for an annotated sentence. Based on [org.clulab.processors.Sentence](https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/processors/Sentence.scala)

Parameters:
  • text (str or None) – The text of the Sentence.
  • words ([str]) – A list of the Sentence‘s tokens.
  • startOffsets ([int]) – The character offsets starting each token (inclusive).
  • endOffsets ([int]) – The character offsets marking the end of each token (exclusive).
  • tags ([str]) – A list of the Sentence‘s tokens represented using part of speech (PoS) tags.
  • lemmas ([str]) – A list of the Sentence‘s tokens represented using lemmas.
  • chunks ([str]) – A list of the Sentence‘s tokens represented using IOB-style phrase labels (ex. B-NP, I-NP, B-VP, etc.).
  • entities ([str]) – A list of the Sentence‘s tokens represented using IOB-style named entity (NE) labels.
  • graphs (dict) – A dictionary of {graph-name -> {edges: [{source, destination, relation}], roots: [int]}}
text

str – The text of the Sentence.

startOffsets

[int] – The character offsets starting each token (inclusive).

endOffsets

[int] – The character offsets marking the end of each token (exclusive).

length

int – The number of tokens in the Sentence

graphs

dict – A dictionary (str -> processors.ds.DirectedGraph) mapping the graph type/name to a processors.ds.DirectedGraph.

basic_dependencies

processors.ds.DirectedGraph – A processors.ds.DirectedGraph using basic Stanford dependencies.

collapsed_dependencies

processors.ds.DirectedGraph – A processors.ds.DirectedGraph using collapsed Stanford dependencies.

dependencies

processors.ds.DirectedGraph – A pointer to the prefered syntactic dependency graph type for this Sentence.

_entities

[str] – The IOB-style Named Entity (NE) labels corresponding to each token.

_chunks

[str] – The IOB-style chunk labels corresponding to each token.

nes

dict – A dictionary of NE labels represented in the Document -> a list of corresponding text spans (ex. {“PERSON”: [phrase 1, ..., phrase n]}). Built from Sentence._entities

phrases

dict – A dictionary of chunk labels represented in the Document -> a list of corresponding text spans (ex. {“NP”: [phrase 1, ..., phrase n]}). Built from Sentence._chunks

bag_of_labeled_dependencies_using(form)

Produces a list of syntactic dependencies where each edge is labeled with its grammatical relation.

bag_of_unlabeled_dependencies_using(form)

Produces a list of syntactic dependencies where each edge is left unlabeled without its grammatical relation.

Edge

class processors.ds.Edge(source, destination, relation)[source]

Bases: processors.ds.NLPDatum

DirectedGraph

class processors.ds.DirectedGraph(kind, deps, words)[source]

Bases: processors.ds.NLPDatum

Storage class for directed graphs.

Parameters:
  • kind (str) – The name of the directed graph.
  • deps (dict) – A dictionary of {edges: [{source, destination, relation}], roots: [int]}
  • words ([str]) – A list of the word form of the tokens from the originating Sentence.
_words

[str] – A list of the word form of the tokens from the originating Sentence.

roots

[int] – A list of indices for the syntactic dependency graph’s roots. Generally this is a single token index.

edges

list[processors.ds.Edge] – A list of processors.ds.Edge

incoming

A dictionary of {int -> [int]} encoding the incoming edges for each node in the graph.

outgoing

A dictionary of {int -> [int]} encoding the outgoing edges for each node in the graph.

labeled

[str] – A list of strings where each element in the list represents an edge encoded as source index, relation, and destination index (“source_relation_destination”).

unlabeled

[str] – A list of strings where each element in the list represents an edge encoded as source index and destination index (“source_destination”).

graph

networkx.Graph – A networkx.graph representation of the DirectedGraph. Used by shortest_path

bag_of_labeled_dependencies_from_tokens(form)

Produces a list of syntactic dependencies where each edge is labeled with its grammatical relation.

bag_of_unlabeled_dependencies_from_tokens(form)

Produces a list of syntactic dependencies where each edge is left unlabeled without its grammatical relation.

Mention

class processors.odin.Mention(token_interval, sentence, document, foundBy, label, labels=None, trigger=None, arguments=None, paths=None, keep=True, doc_id=None)[source]

Bases: processors.ds.NLPDatum

A labeled span of text. Used to model textual mentions of events, relations, and entities.

Parameters:
  • token_interval (Interval) – The span of the Mention represented as an Interval.
  • sentence (int) – The sentence index that contains the Mention.
  • document (Document) – The Document in which the Mention was found.
  • foundBy (str) – The Odin IE rule that produced this Mention.
  • label (str) – The label most closely associated with this span. Usually the lowest hyponym of “labels”.
  • labels (list) – The list of labels associated with this span.
  • trigger (dict or None) – dict of JSON for Mention’s trigger (event predicate or word(s) signaling the Mention).
  • arguments (dict or None) – dict of JSON for Mention’s arguments.
  • paths (dict or None) – dict of JSON encoding the syntactic paths linking a Mention’s arguments to its trigger (applies to Mentions produces from type:”dependency” rules).
  • doc_id (str or None) – the id of the document
tokenInterval

processors.ds.Interval – An Interval encoding the start and end of the Mention.

start

int – The token index that starts the Mention.

end

int – The token index that marks the end of the Mention (exclusive).

sentenceObj

processors.ds.Sentence – Pointer to the Sentence instance containing the Mention.

characterStartOffset

int – The index of the character that starts the Mention.

characterEndOffset

int – The index of the character that ends the Mention.

type

Mention.TBM or Mention.EM or Mention.RM – The type of the Mention.

See also

[Odin manual](https://arxiv.org/abs/1509.07513)

matches(label_pattern)

Test if the provided pattern, label_pattern, matches any element in Mention.labels.

overlaps(other)

Test whether other (token index or Mention) overlaps with span of this Mention.

copy(**kwargs)

Copy constructor for this Mention.

words()

Words for this Mention’s span.

tags()

Part of speech for this Mention’s span.

lemmas()

Lemmas for this Mention’s span.

_chunks()

chunk labels for this Mention’s span.

_entities()

NE labels for this Mention’s span.

JSON serialization/deserialization is handled via processors.serialization.JSONSerializer.

Interval

class processors.ds.Interval(start, end)[source]

Bases: processors.ds.NLPDatum

Defines a token or character span

Parameters:
  • start (str) – The token or character index where the interval begins.
  • end (str) – The 1 + the index of the last token/character in the span.
contains(that)

Test whether that (int or Interval) overlaps with span of this Interval.

overlaps(that)

Test whether this Interval contains another. Equivalent Intervals will overlap.

Annotators (Processors)

Text annotation is performed by communicating with one of the following annotators (“processors”).

CluProcessor

class processors.annotators.CluProcessor(address)[source]

Bases: processors.annotators.Processor

Processor for text annotation based on [org.clulab.processors.clu.CluProcessor](https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/processors/clu/CluProcessor.scala)

Uses the Malt parser.

FastNLPProcessor

class processors.annotators.FastNLPProcessor(address)[source]

Bases: processors.annotators.Processor

Processor for text annotation based on [org.clulab.processors.fastnlp.FastNLPProcessor](https://github.com/clulab/processors/blob/master/corenlp/src/main/scala/org/clulab/processors/fastnlp/FastNLPProcessor.scala)

Uses the Stanford CoreNLP neural network parser.

BioNLPProcessor

class processors.annotators.BioNLPProcessor(address)[source]

Bases: processors.annotators.Processor

Processor for biomedical text annotation based on [org.clulab.processors.fastnlp.FastNLPProcessor](https://github.com/clulab/processors/blob/master/corenlp/src/main/scala/org/clulab/processors/fastnlp/FastNLPProcessor.scala)

CoreNLP-derived annotator.

Sentiment Analysis

SentimentAnalyzer

class processors.sentiment.SentimentAnalyzer(address)[source]

Bases: object

CoreNLPSentimentAnalyzer

class processors.sentiment.CoreNLPSentimentAnalyzer(address)[source]

Bases: processors.sentiment.SentimentAnalyzer

Bridge to [CoreNLP‘s tree-based sentiment analysis system](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf)

paths

DependencyUtils

class processors.paths.DependencyUtils[source]

Bases: object

A set of utilities for analyzing syntactic dependency graphs.

build_networkx_graph(roots, edges, name)

Constructs a networkx.Graph

shortest_path(g, start, end)

Finds the shortest path in a networkx.Graph between any element in a list of start nodes and any element in a list of end nodes.

retrieve_edges(dep_graph, path)

Converts output of shortest_path into a list of triples that include the grammatical relation (and direction) for each node-node “hop” in the syntactic dependency graph.

simplify_tag(tag)

Maps part of speech (PoS) tag to a subset of PoS tags to better consolidate categorical labels.

lexicalize_path(sentence, path, words=False, lemmas=False, tags=False, simple_tags=False, entities=False, limit_to=None)

Lexicalizes path in syntactic dependency graph using Odin-style token constraints.

pagerank(networkx_graph, alpha=0.85, personalization=None, max_iter=1000, tol=1e-06, nstart=None, weight='weight', dangling=None)

Measures node activity in a networkx.Graph using a thin wrapper around networkx implementation of pagerank algorithm (see networkx.algorithms.link_analysis.pagerank). Use with processors.ds.DirectedGraph.graph.

HeadFinder

class processors.paths.HeadFinder[source]

Bases: object

Serialization

JSONSerializer

class processors.serialization.JSONSerializer[source]

Bases: object

Utilities for serialization/deserialization of data structures.

Visualization

JupyterVisualizer

.. autoclass:: processors.Visualization.JupyterVisualizer :show-inheritance: