API Reference

This section of the documentation provides detailed information on functions, classes, and methods.

Server Communication

Communicating with the NLP server (processors-server) is handled by the following classes.

ProcessorsBaseAPI

class processors.api.ProcessorsBaseAPI(**kwargs)[source]

Bases: object

Manages a connection with processors-server and provides an interface to the API.

Parameters:
  • port (int) – The port the server is running on or should be started on. Default is 8886.
  • hostname (str) – The host name to use for the server. Default is “localhost”.
  • log_file (str) – The path for the log file. Default is py-processors.log in the user’s home directory.
annotate(text)

Produces a Document from the provided text using the default processor.

clu.annotate(text)

Produces a Document from the provided text using CluProcessor.

fastnlp.annotate(text)

Produces a Document from the provided text using FastNLPProcessor.

bionlp.annotate(text)

Produces a Document from the provided text using BioNLPProcessor.

annotate_from_sentences(sentences)

Produces a Document from sentences (a list of text split into sentences). Uses the default processor.

fastnlp.annotate_from_sentences(sentences)

Produces a Document from sentences (a list of text split into sentences). Uses FastNLPProcessor.

bionlp.annotate_from_sentences(sentences)

Produces a Document from sentences (a list of text split into sentences). Uses BioNLPProcessor.

corenlp.sentiment.score_sentence(sentence)

Produces a sentiment score for the provided sentence (an instance of Sentence).

corenlp.sentiment.score_document(doc)

Produces sentiment scores for the provided doc (an instance of Document). One score is produced for each sentence.

corenlp.sentiment.score_segmented_text(sentences)

Produces sentiment scores for the provided sentences (a list of text segmented into sentences). One score is produced for item in sentences.

odin.extract_from_text(text, rules)

Produces a list of Mentions for matches of the provided rules on the text. rules can be a string of Odin rules, or a url ending in .yml or .yaml.

odin.extract_from_document(doc, rules)

Produces a list of Mentions for matches of the provided rules on the doc (an instance of Document). rules can be a string of Odin rules, or a url ending in .yml or yaml.

ProcessorsAPI

class processors.api.ProcessorsAPI(**kwargs)[source]

Bases: processors.api.ProcessorsBaseAPI

Manages a connection with the processors-server jar and provides an interface to the API.

Parameters:
  • timeout (int) – The number of seconds to wait for the server to initialize. Default is 120.
  • jvm_mem (str) – The maximum amount of memory to allocate to the JVM for the server. Default is “-Xmx3G”.
  • jar_path (str) – The path to the processors-server jar. Default is the jar installed with the package.
  • kee_alive (bool) – Whether or not to keep the server running when ProcessorsAPI instance goes out of scope. Default is false (server is shut down).
  • log_file (str) – The path for the log file. Default is py-processors.log in the user’s home directory.
start_server(jar_path, **kwargs)

Starts the server using the provided jar_path. Optionally takes hostname, port, jvm_mem, and timeout.

stop_server()

Attempts to stop the server running at self.address.

OdinAPI

class processors.api.OdinAPI(address)[source]

Bases: object

API for performing rule-based information extraction with Odin.

Parameters:address (str) – The base address for the API (i.e., everything preceding /api/..)

SentimentAnalysisAPI

class processors.sentiment.SentimentAnalysisAPI(address)[source]

Bases: object

API for performing sentiment analysis

Parameters:address (str) – The base address for the API (i.e., everything preceding /api/..)
corenlp

processors.sentiment.CoreNLPSentimentAnalyzer – Service using [CoreNLP‘s tree-based system](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) for performing sentiment analysis.

Data Structures

Document

class processors.ds.Document(sentences)[source]

Bases: object

Storage class for annotated text. Based on [org.clulab.processors.Document](https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/processors/Document.scala)

Parameters:sentences ([processors.ds.Sentence]) – The sentences comprising the Document.
id

str or None – A unique ID for the Document.

size

int – The number of sentences.

sentences

sentences – The sentences comprising the Document.

words

[str] – A list of the Document‘s tokens.

tags

[str] – A list of the Document‘s tokens represented using part of speech (PoS) tags.

lemmas

[str] – A list of the Document‘s tokens represented using lemmas.

_entities

[str] – A list of the Document‘s tokens represented using IOB-style named entity (NE) labels.

nes

dict – A dictionary of NE labels represented in the Document -> a list of corresponding text spans.

bag_of_labeled_deps

[str] – The labeled dependencies from all sentences in the Document.

bag_of_unlabeled_deps

[str] – The unlabeled dependencies from all sentences in the Document.

text

str or None – The original text of the Document.

bag_of_labeled_dependencies_using(form)

Produces a list of syntactic dependencies where each edge is labeled with its grammatical relation.

bag_of_unlabeled_dependencies_using(form)

Produces a list of syntactic dependencies where each edge is left unlabeled without its grammatical relation.

Sentence

class processors.ds.Sentence(**kwargs)[source]

Bases: object

Storage class for an annotated sentence. Based on [org.clulab.processors.Sentence](https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/processors/Sentence.scala)

Parameters:
  • text (str or None) – The text of the Sentence.
  • words ([str]) – A list of the Sentence‘s tokens.
  • startOffsets ([int]) – The character offsets starting each token (inclusive).
  • endOffsets ([int]) – The character offsets marking the end of each token (exclusive).
  • tags ([str]) – A list of the Sentence‘s tokens represented using part of speech (PoS) tags.
  • lemmas ([str]) – A list of the Sentence‘s tokens represented using lemmas.
  • chunks ([str]) – A list of the Sentence‘s tokens represented using IOB-style phrase labels (ex. B-NP, I-NP, B-VP, etc.).
  • entities ([str]) – A list of the Sentence‘s tokens represented using IOB-style named entity (NE) labels.
  • graphs (dict) – A dictionary of {graph-name -> {edges: [{source, destination, relation}], roots: [int]}}
text

str – The text of the Sentence.

startOffsets

[int] – The character offsets starting each token (inclusive).

endOffsets

[int] – The character offsets marking the end of each token (exclusive).

length

int – The number of tokens in the Sentence

graphs

dict – A dictionary (str -> processors.ds.DirectedGraph) mapping the graph type/name to a processors.ds.DirectedGraph.

basic_dependencies

processors.ds.DirectedGraph – A processors.ds.DirectedGraph using basic Stanford dependencies.

collapsed_dependencies

processors.ds.DirectedGraph – A processors.ds.DirectedGraph using collapsed Stanford dependencies.

dependencies

processors.ds.DirectedGraph – A pointer to the prefered syntactic dependency graph type for this Sentence.

_entities

[str] – The IOB-style Named Entity (NE) labels corresponding to each token.

_chunks

[str] – The IOB-style chunk labels corresponding to each token.

nes

dict – A dictionary of NE labels represented in the Document -> a list of corresponding text spans (ex. {“PERSON”: [phrase 1, ..., phrase n]}). Built from Sentence._entities

phrases

dict – A dictionary of chunk labels represented in the Document -> a list of corresponding text spans (ex. {“NP”: [phrase 1, ..., phrase n]}). Built from Sentence._chunks

bag_of_labeled_dependencies_using(form)

Produces a list of syntactic dependencies where each edge is labeled with its grammatical relation.

bag_of_unlabeled_dependencies_using(form)

Produces a list of syntactic dependencies where each edge is left unlabeled without its grammatical relation.

DirectedGraph

class processors.ds.DirectedGraph(kind, deps, words)[source]

Bases: object

Storage class for directed graphs.

Parameters:
  • kind (str) – The name of the directed graph.
  • deps (dict) – A dictionary of {edges: [{source, destination, relation}], roots: [int]}
  • words ([str]) – A list of the word form of the tokens from the originating Sentence.
_words

[str] – A list of the word form of the tokens from the originating Sentence.

roots

[int] – A list of indices for the syntactic dependency graph’s roots. Generally this is a single token index.

edges

list[processors.ds.Edge] – A list of processors.ds.Edge

incoming

A dictionary of {int -> [int]} encoding the incoming edges for each node in the graph.

outgoing

A dictionary of {int -> [int]} encoding the outgoing edges for each node in the graph.

labeled

[str] – A list of strings where each element in the list represents an edge encoded as source index, relation, and destination index (“source_relation_destination”).

unlabeled

[str] – A list of strings where each element in the list represents an edge encoded as source index and destination index (“source_destination”).

graph

networkx.Graph – A networkx.graph representation of the DirectedGraph. Used by shortest_path

bag_of_labeled_dependencies_from_tokens(form)

Produces a list of syntactic dependencies where each edge is labeled with its grammatical relation.

bag_of_unlabeled_dependencies_from_tokens(form)

Produces a list of syntactic dependencies where each edge is left unlabeled without its grammatical relation.

Mention

class processors.odin.Mention(token_interval, sentence, document, foundBy, label, labels=None, trigger=None, arguments=None, paths=None, keep=True, doc_id=None)[source]

Bases: object

A labeled span of text. Used to model textual mentions of events, relations, and entities.

Parameters:
  • token_interval (Interval) – The span of the Mention represented as an Interval.
  • sentence (int) – The sentence index that contains the Mention.
  • document (Document) – The Document in which the Mention was found.
  • foundBy (str) – The Odin IE rule that produced this Mention.
  • label (str) – The label most closely associated with this span. Usually the lowest hyponym of “labels”.
  • labels (list) – The list of labels associated with this span.
  • trigger (dict or None) – dict of JSON for Mention’s trigger (event predicate or word(s) signaling the Mention).
  • arguments (dict or None) – dict of JSON for Mention’s arguments.
  • paths (dict or None) – dict of JSON encoding the syntactic paths linking a Mention’s arguments to its trigger (applies to Mentions produces from type:”dependency” rules).
  • doc_id (str or None) – the id of the document
tokenInterval

processors.ds.Interval – An Interval encoding the start and end of the Mention.

start

int – The token index that starts the Mention.

end

int – The token index that marks the end of the Mention (exclusive).

sentenceObj

processors.ds.Sentence – Pointer to the Sentence instance containing the Mention.

characterStartOffset

int – The index of the character that starts the Mention.

characterEndOffset

int – The index of the character that ends the Mention.

type

Mention.TBM or Mention.EM or Mention.RM – The type of the Mention.

See also

[Odin manual](https://arxiv.org/abs/1509.07513)

matches(label_pattern)

Test if the provided pattern, label_pattern, matches any element in Mention.labels.

JSON serialization/deserialization is handled via processors.serialization.JSONSerializer.

Interval

class processors.ds.Interval(start, end)[source]

Bases: object

Defines a token or character span

Parameters:
  • start (str) – The token or character index where the interval begins.
  • end (str) – The 1 + the index of the last token/character in the span.

DependencyUtils

class processors.paths.DependencyUtils[source]

Bases: object

A set of utilities for analyzing syntactic dependency graphs.

build_networkx_graph(roots, edges, name)

Constructs a networkx.Graph

shortest_path(g, start, end)

Finds the shortest path in a networkx.Graph between any element in a list of start nodes and any element in a list of end nodes.

retrieve_edges(dep_graph, path)

Converts output of shortest_path into a list of triples that include the grammatical relation (and direction) for each node-node “hop” in the syntactic dependency graph.

simplify_tag(tag)

Maps part of speech (PoS) tag to a subset of PoS tags to better consolidate categorical labels.

lexicalize_path(sentence, path, words=False, lemmas=False, tags=False, simple_tags=False, entities=False)

Lexicalizes path in syntactic dependency graph using Odin-style token constraints.

pagerank(networkx_graph, alpha=0.85, personalization=None, max_iter=1000, tol=1e-06, nstart=None, weight='weight', dangling=None)

Measures node activity in a networkx.Graph using a thin wrapper around networkx implementation of pagerank algorithm (see networkx.algorithms.link_analysis.pagerank). Use with processors.ds.DirectedGraph.graph.

Annotators (Processors)

Text annotation is performed by communicating with one of the following annotators (“processors”).

CluProcessor

class processors.annotators.CluProcessor(address)[source]

Bases: processors.annotators.Processor

Processor for text annotation based on [org.clulab.processors.clu.CluProcessor](https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/processors/clu/CluProcessor.scala)

Uses the Malt parser.

FastNLPProcessor

class processors.annotators.FastNLPProcessor(address)[source]

Bases: processors.annotators.Processor

Processor for text annotation based on [org.clulab.processors.fastnlp.FastNLPProcessor](https://github.com/clulab/processors/blob/master/corenlp/src/main/scala/org/clulab/processors/fastnlp/FastNLPProcessor.scala)

Uses the Stanford CoreNLP neural network parser.

BioNLPProcessor

class processors.annotators.BioNLPProcessor(address)[source]

Bases: processors.annotators.Processor

Processor for biomedical text annotation based on [org.clulab.processors.fastnlp.FastNLPProcessor](https://github.com/clulab/processors/blob/master/corenlp/src/main/scala/org/clulab/processors/fastnlp/FastNLPProcessor.scala)

CoreNLP-derived annotator.

Sentiment Analysis

Serialization

class processors.serialization.JSONSerializer[source]

Bases: object

Utilities for serialization/deserialization of data structures.