API Reference¶
This section of the documentation provides detailed information on functions, classes, and methods.
Server Communication¶
Communicating with the NLP server (processors-server
) is handled by the following classes:
ProcessorsBaseAPI
¶
-
class
processors.api.
ProcessorsBaseAPI
(**kwargs)[source]¶ Bases:
object
Manages a connection with processors-server and provides an interface to the API.
Parameters: - port (int) – The port the server is running on or should be started on. Default is 8886.
- hostname (str) – The host name to use for the server. Default is “localhost”.
- log_file (str) – The path for the log file. Default is py-processors.log in the user’s home directory.
-
annotate
(text)¶ Produces a Document from the provided text using the default processor.
-
clu.
annotate
(text)¶ Produces a Document from the provided text using CluProcessor.
-
fastnlp.
annotate
(text)¶ Produces a Document from the provided text using FastNLPProcessor.
-
bionlp.
annotate
(text)¶ Produces a Document from the provided text using BioNLPProcessor.
-
annotate_from_sentences
(sentences)¶ Produces a Document from sentences (a list of text split into sentences). Uses the default processor.
-
fastnlp.
annotate_from_sentences
(sentences)¶ Produces a Document from sentences (a list of text split into sentences). Uses FastNLPProcessor.
-
bionlp.
annotate_from_sentences
(sentences)¶ Produces a Document from sentences (a list of text split into sentences). Uses BioNLPProcessor.
-
corenlp.sentiment.
score_sentence
(sentence)¶ Produces a sentiment score for the provided sentence (an instance of Sentence).
-
corenlp.sentiment.
score_document
(doc)¶ Produces sentiment scores for the provided doc (an instance of Document). One score is produced for each sentence.
-
corenlp.sentiment.
score_segmented_text
(sentences)¶ Produces sentiment scores for the provided sentences (a list of text segmented into sentences). One score is produced for item in sentences.
-
odin.
extract_from_text
(text, rules)¶ Produces a list of Mentions for matches of the provided rules on the text. rules can be a string of Odin rules, or a url ending in .yml or .yaml.
-
odin.
extract_from_document
(doc, rules)¶ Produces a list of Mentions for matches of the provided rules on the doc (an instance of Document). rules can be a string of Odin rules, or a url ending in .yml or yaml.
ProcessorsAPI
¶
-
class
processors.api.
ProcessorsAPI
(**kwargs)[source]¶ Bases:
processors.api.ProcessorsBaseAPI
Manages a connection with the processors-server jar and provides an interface to the API.
Parameters: - timeout (int) – The number of seconds to wait for the server to initialize. Default is 120.
- jvm_mem (str) – The maximum amount of memory to allocate to the JVM for the server. Default is “-Xmx3G”.
- jar_path (str) – The path to the processors-server jar. Default is the jar installed with the package.
- kee_alive (bool) – Whether or not to keep the server running when ProcessorsAPI instance goes out of scope. Default is false (server is shut down).
- log_file (str) – The path for the log file. Default is py-processors.log in the user’s home directory.
-
start_server
(jar_path, **kwargs)¶ Starts the server using the provided jar_path. Optionally takes hostname, port, jvm_mem, and timeout.
-
stop_server
()¶ Attempts to stop the server running at self.address.
OdinAPI
¶
SentimentAnalysisAPI
¶
-
class
processors.sentiment.
SentimentAnalysisAPI
(address)[source]¶ Bases:
object
API for performing sentiment analysis
Parameters: address (str) – The base address for the API (i.e., everything preceding /api/..) -
corenlp
¶ processors.sentiment.CoreNLPSentimentAnalyzer – Service using [CoreNLP‘s tree-based system](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) for performing sentiment analysis.
-
Data Structures¶
Document
¶
-
class
processors.ds.
Document
(sentences)[source]¶ Bases:
processors.ds.NLPDatum
Storage class for annotated text. Based on [org.clulab.processors.Document](https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/processors/Document.scala)
Parameters: sentences ([processors.ds.Sentence]) – The sentences comprising the Document. -
id
¶ str or None – A unique ID for the Document.
-
size
¶ int – The number of sentences.
-
sentences
¶ sentences – The sentences comprising the Document.
-
words
¶ [str] – A list of the Document‘s tokens.
[str] – A list of the Document‘s tokens represented using part of speech (PoS) tags.
-
lemmas
¶ [str] – A list of the Document‘s tokens represented using lemmas.
-
_entities
¶ [str] – A list of the Document‘s tokens represented using IOB-style named entity (NE) labels.
-
nes
¶ dict – A dictionary of NE labels represented in the Document -> a list of corresponding text spans.
-
bag_of_labeled_deps
¶ [str] – The labeled dependencies from all sentences in the Document.
-
bag_of_unlabeled_deps
¶ [str] – The unlabeled dependencies from all sentences in the Document.
-
text
¶ str or None – The original text of the Document.
-
bag_of_labeled_dependencies_using
(form)¶ Produces a list of syntactic dependencies where each edge is labeled with its grammatical relation.
-
bag_of_unlabeled_dependencies_using
(form)¶ Produces a list of syntactic dependencies where each edge is left unlabeled without its grammatical relation.
-
Sentence
¶
-
class
processors.ds.
Sentence
(**kwargs)[source]¶ Bases:
processors.ds.NLPDatum
Storage class for an annotated sentence. Based on [org.clulab.processors.Sentence](https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/processors/Sentence.scala)
Parameters: - text (str or None) – The text of the Sentence.
- words ([str]) – A list of the Sentence‘s tokens.
- startOffsets ([int]) – The character offsets starting each token (inclusive).
- endOffsets ([int]) – The character offsets marking the end of each token (exclusive).
- tags ([str]) – A list of the Sentence‘s tokens represented using part of speech (PoS) tags.
- lemmas ([str]) – A list of the Sentence‘s tokens represented using lemmas.
- chunks ([str]) – A list of the Sentence‘s tokens represented using IOB-style phrase labels (ex. B-NP, I-NP, B-VP, etc.).
- entities ([str]) – A list of the Sentence‘s tokens represented using IOB-style named entity (NE) labels.
- graphs (dict) – A dictionary of {graph-name -> {edges: [{source, destination, relation}], roots: [int]}}
-
text
¶ str – The text of the Sentence.
-
startOffsets
¶ [int] – The character offsets starting each token (inclusive).
-
endOffsets
¶ [int] – The character offsets marking the end of each token (exclusive).
-
length
¶ int – The number of tokens in the Sentence
-
graphs
¶ dict – A dictionary (str -> processors.ds.DirectedGraph) mapping the graph type/name to a processors.ds.DirectedGraph.
-
basic_dependencies
¶ processors.ds.DirectedGraph – A processors.ds.DirectedGraph using basic Stanford dependencies.
-
collapsed_dependencies
¶ processors.ds.DirectedGraph – A processors.ds.DirectedGraph using collapsed Stanford dependencies.
-
dependencies
¶ processors.ds.DirectedGraph – A pointer to the prefered syntactic dependency graph type for this Sentence.
-
_entities
¶ [str] – The IOB-style Named Entity (NE) labels corresponding to each token.
-
_chunks
¶ [str] – The IOB-style chunk labels corresponding to each token.
-
nes
¶ dict – A dictionary of NE labels represented in the Document -> a list of corresponding text spans (ex. {“PERSON”: [phrase 1, ..., phrase n]}). Built from Sentence._entities
-
phrases
¶ dict – A dictionary of chunk labels represented in the Document -> a list of corresponding text spans (ex. {“NP”: [phrase 1, ..., phrase n]}). Built from Sentence._chunks
-
bag_of_labeled_dependencies_using
(form)¶ Produces a list of syntactic dependencies where each edge is labeled with its grammatical relation.
-
bag_of_unlabeled_dependencies_using
(form)¶ Produces a list of syntactic dependencies where each edge is left unlabeled without its grammatical relation.
Edge
¶
-
class
processors.ds.
Edge
(source, destination, relation)[source]¶ Bases:
processors.ds.NLPDatum
DirectedGraph
¶
-
class
processors.ds.
DirectedGraph
(kind, deps, words)[source]¶ Bases:
processors.ds.NLPDatum
Storage class for directed graphs.
Parameters: - kind (str) – The name of the directed graph.
- deps (dict) – A dictionary of {edges: [{source, destination, relation}], roots: [int]}
- words ([str]) – A list of the word form of the tokens from the originating Sentence.
-
_words
¶ [str] – A list of the word form of the tokens from the originating Sentence.
-
roots
¶ [int] – A list of indices for the syntactic dependency graph’s roots. Generally this is a single token index.
-
edges
¶ list[processors.ds.Edge] – A list of processors.ds.Edge
-
incoming
¶ A dictionary of {int -> [int]} encoding the incoming edges for each node in the graph.
-
outgoing
¶ A dictionary of {int -> [int]} encoding the outgoing edges for each node in the graph.
-
labeled
¶ [str] – A list of strings where each element in the list represents an edge encoded as source index, relation, and destination index (“source_relation_destination”).
-
unlabeled
¶ [str] – A list of strings where each element in the list represents an edge encoded as source index and destination index (“source_destination”).
-
graph
¶ networkx.Graph – A networkx.graph representation of the DirectedGraph. Used by shortest_path
-
bag_of_labeled_dependencies_from_tokens
(form)¶ Produces a list of syntactic dependencies where each edge is labeled with its grammatical relation.
-
bag_of_unlabeled_dependencies_from_tokens
(form)¶ Produces a list of syntactic dependencies where each edge is left unlabeled without its grammatical relation.
Mention
¶
-
class
processors.odin.
Mention
(token_interval, sentence, document, foundBy, label, labels=None, trigger=None, arguments=None, paths=None, keep=True, doc_id=None)[source]¶ Bases:
processors.ds.NLPDatum
A labeled span of text. Used to model textual mentions of events, relations, and entities.
Parameters: - token_interval (Interval) – The span of the Mention represented as an Interval.
- sentence (int) – The sentence index that contains the Mention.
- document (Document) – The Document in which the Mention was found.
- foundBy (str) – The Odin IE rule that produced this Mention.
- label (str) – The label most closely associated with this span. Usually the lowest hyponym of “labels”.
- labels (list) – The list of labels associated with this span.
- trigger (dict or None) – dict of JSON for Mention’s trigger (event predicate or word(s) signaling the Mention).
- arguments (dict or None) – dict of JSON for Mention’s arguments.
- paths (dict or None) – dict of JSON encoding the syntactic paths linking a Mention’s arguments to its trigger (applies to Mentions produces from type:”dependency” rules).
- doc_id (str or None) – the id of the document
-
tokenInterval
¶ processors.ds.Interval – An Interval encoding the start and end of the Mention.
-
start
¶ int – The token index that starts the Mention.
-
end
¶ int – The token index that marks the end of the Mention (exclusive).
-
sentenceObj
¶ processors.ds.Sentence – Pointer to the Sentence instance containing the Mention.
-
characterStartOffset
¶ int – The index of the character that starts the Mention.
-
characterEndOffset
¶ int – The index of the character that ends the Mention.
-
type
¶ Mention.TBM or Mention.EM or Mention.RM – The type of the Mention.
See also
[Odin manual](https://arxiv.org/abs/1509.07513)
-
matches
(label_pattern)¶ Test if the provided pattern, label_pattern, matches any element in Mention.labels.
-
overlaps
(other)¶ Test whether other (token index or Mention) overlaps with span of this Mention.
-
copy
(**kwargs)¶ Copy constructor for this Mention.
-
words
()¶ Words for this Mention’s span.
Part of speech for this Mention’s span.
-
lemmas
()¶ Lemmas for this Mention’s span.
-
_chunks
()¶ chunk labels for this Mention’s span.
-
_entities
()¶ NE labels for this Mention’s span.
JSON
serialization/deserialization is handled via processors.serialization.JSONSerializer
.
Interval
¶
-
class
processors.ds.
Interval
(start, end)[source]¶ Bases:
processors.ds.NLPDatum
Defines a token or character span
Parameters: - start (str) – The token or character index where the interval begins.
- end (str) – The 1 + the index of the last token/character in the span.
-
contains
(that)¶ Test whether that (int or Interval) overlaps with span of this Interval.
-
overlaps
(that)¶ Test whether this Interval contains another. Equivalent Intervals will overlap.
Annotators (Processors)¶
Text annotation is performed by communicating with one of the following annotators (“processors”).
CluProcessor
¶
-
class
processors.annotators.
CluProcessor
(address)[source]¶ Bases:
processors.annotators.Processor
Processor for text annotation based on [org.clulab.processors.clu.CluProcessor](https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/processors/clu/CluProcessor.scala)
Uses the Malt parser.
FastNLPProcessor
¶
-
class
processors.annotators.
FastNLPProcessor
(address)[source]¶ Bases:
processors.annotators.Processor
Processor for text annotation based on [org.clulab.processors.fastnlp.FastNLPProcessor](https://github.com/clulab/processors/blob/master/corenlp/src/main/scala/org/clulab/processors/fastnlp/FastNLPProcessor.scala)
Uses the Stanford CoreNLP neural network parser.
BioNLPProcessor
¶
-
class
processors.annotators.
BioNLPProcessor
(address)[source]¶ Bases:
processors.annotators.Processor
Processor for biomedical text annotation based on [org.clulab.processors.fastnlp.FastNLPProcessor](https://github.com/clulab/processors/blob/master/corenlp/src/main/scala/org/clulab/processors/fastnlp/FastNLPProcessor.scala)
CoreNLP-derived annotator.
Sentiment Analysis¶
CoreNLPSentimentAnalyzer
¶
-
class
processors.sentiment.
CoreNLPSentimentAnalyzer
(address)[source]¶ Bases:
processors.sentiment.SentimentAnalyzer
Bridge to [CoreNLP‘s tree-based sentiment analysis system](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf)
paths
¶
DependencyUtils
¶
-
class
processors.paths.
DependencyUtils
[source]¶ Bases:
object
A set of utilities for analyzing syntactic dependency graphs.
-
build_networkx_graph
(roots, edges, name)¶ Constructs a networkx.Graph
-
shortest_path
(g, start, end)¶ Finds the shortest path in a networkx.Graph between any element in a list of start nodes and any element in a list of end nodes.
-
retrieve_edges
(dep_graph, path)¶ Converts output of shortest_path into a list of triples that include the grammatical relation (and direction) for each node-node “hop” in the syntactic dependency graph.
-
simplify_tag
(tag)¶ Maps part of speech (PoS) tag to a subset of PoS tags to better consolidate categorical labels.
-
lexicalize_path
(sentence, path, words=False, lemmas=False, tags=False, simple_tags=False, entities=False, limit_to=None)¶ Lexicalizes path in syntactic dependency graph using Odin-style token constraints.
-
pagerank
(networkx_graph, alpha=0.85, personalization=None, max_iter=1000, tol=1e-06, nstart=None, weight='weight', dangling=None)¶ Measures node activity in a networkx.Graph using a thin wrapper around networkx implementation of pagerank algorithm (see networkx.algorithms.link_analysis.pagerank). Use with processors.ds.DirectedGraph.graph.
-