API Reference¶
This section of the documentation provides detailed information on functions, classes, and methods.
Server Communication¶
Communicating with the NLP server (processors-server) is handled by the following classes:
ProcessorsBaseAPI¶
-
class
processors.api.ProcessorsBaseAPI(**kwargs)[source]¶ Bases:
objectManages a connection with processors-server and provides an interface to the API.
Parameters: - port (int) – The port the server is running on or should be started on. Default is 8886.
- hostname (str) – The host name to use for the server. Default is “localhost”.
- log_file (str) – The path for the log file. Default is py-processors.log in the user’s home directory.
-
annotate(text)¶ Produces a Document from the provided text using the default processor.
-
clu.annotate(text)¶ Produces a Document from the provided text using CluProcessor.
-
fastnlp.annotate(text)¶ Produces a Document from the provided text using FastNLPProcessor.
-
bionlp.annotate(text)¶ Produces a Document from the provided text using BioNLPProcessor.
-
annotate_from_sentences(sentences)¶ Produces a Document from sentences (a list of text split into sentences). Uses the default processor.
-
fastnlp.annotate_from_sentences(sentences)¶ Produces a Document from sentences (a list of text split into sentences). Uses FastNLPProcessor.
-
bionlp.annotate_from_sentences(sentences)¶ Produces a Document from sentences (a list of text split into sentences). Uses BioNLPProcessor.
-
corenlp.sentiment.score_sentence(sentence)¶ Produces a sentiment score for the provided sentence (an instance of Sentence).
-
corenlp.sentiment.score_document(doc)¶ Produces sentiment scores for the provided doc (an instance of Document). One score is produced for each sentence.
-
corenlp.sentiment.score_segmented_text(sentences)¶ Produces sentiment scores for the provided sentences (a list of text segmented into sentences). One score is produced for item in sentences.
-
odin.extract_from_text(text, rules)¶ Produces a list of Mentions for matches of the provided rules on the text. rules can be a string of Odin rules, or a url ending in .yml or .yaml.
-
odin.extract_from_document(doc, rules)¶ Produces a list of Mentions for matches of the provided rules on the doc (an instance of Document). rules can be a string of Odin rules, or a url ending in .yml or yaml.
ProcessorsAPI¶
-
class
processors.api.ProcessorsAPI(**kwargs)[source]¶ Bases:
processors.api.ProcessorsBaseAPIManages a connection with the processors-server jar and provides an interface to the API.
Parameters: - timeout (int) – The number of seconds to wait for the server to initialize. Default is 120.
- jvm_mem (str) – The maximum amount of memory to allocate to the JVM for the server. Default is “-Xmx3G”.
- jar_path (str) – The path to the processors-server jar. Default is the jar installed with the package.
- kee_alive (bool) – Whether or not to keep the server running when ProcessorsAPI instance goes out of scope. Default is false (server is shut down).
- log_file (str) – The path for the log file. Default is py-processors.log in the user’s home directory.
-
start_server(jar_path, **kwargs)¶ Starts the server using the provided jar_path. Optionally takes hostname, port, jvm_mem, and timeout.
-
stop_server()¶ Attempts to stop the server running at self.address.
OdinAPI¶
SentimentAnalysisAPI¶
-
class
processors.sentiment.SentimentAnalysisAPI(address)[source]¶ Bases:
objectAPI for performing sentiment analysis
Parameters: address (str) – The base address for the API (i.e., everything preceding /api/..) -
corenlp¶ processors.sentiment.CoreNLPSentimentAnalyzer – Service using [CoreNLP‘s tree-based system](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) for performing sentiment analysis.
-
Data Structures¶
Document¶
-
class
processors.ds.Document(sentences)[source]¶ Bases:
processors.ds.NLPDatumStorage class for annotated text. Based on [org.clulab.processors.Document](https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/processors/Document.scala)
Parameters: sentences ([processors.ds.Sentence]) – The sentences comprising the Document. -
id¶ str or None – A unique ID for the Document.
-
size¶ int – The number of sentences.
-
sentences¶ sentences – The sentences comprising the Document.
-
words¶ [str] – A list of the Document‘s tokens.
[str] – A list of the Document‘s tokens represented using part of speech (PoS) tags.
-
lemmas¶ [str] – A list of the Document‘s tokens represented using lemmas.
-
_entities¶ [str] – A list of the Document‘s tokens represented using IOB-style named entity (NE) labels.
-
nes¶ dict – A dictionary of NE labels represented in the Document -> a list of corresponding text spans.
-
bag_of_labeled_deps¶ [str] – The labeled dependencies from all sentences in the Document.
-
bag_of_unlabeled_deps¶ [str] – The unlabeled dependencies from all sentences in the Document.
-
text¶ str or None – The original text of the Document.
-
bag_of_labeled_dependencies_using(form)¶ Produces a list of syntactic dependencies where each edge is labeled with its grammatical relation.
-
bag_of_unlabeled_dependencies_using(form)¶ Produces a list of syntactic dependencies where each edge is left unlabeled without its grammatical relation.
-
Sentence¶
-
class
processors.ds.Sentence(**kwargs)[source]¶ Bases:
processors.ds.NLPDatumStorage class for an annotated sentence. Based on [org.clulab.processors.Sentence](https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/processors/Sentence.scala)
Parameters: - text (str or None) – The text of the Sentence.
- words ([str]) – A list of the Sentence‘s tokens.
- startOffsets ([int]) – The character offsets starting each token (inclusive).
- endOffsets ([int]) – The character offsets marking the end of each token (exclusive).
- tags ([str]) – A list of the Sentence‘s tokens represented using part of speech (PoS) tags.
- lemmas ([str]) – A list of the Sentence‘s tokens represented using lemmas.
- chunks ([str]) – A list of the Sentence‘s tokens represented using IOB-style phrase labels (ex. B-NP, I-NP, B-VP, etc.).
- entities ([str]) – A list of the Sentence‘s tokens represented using IOB-style named entity (NE) labels.
- graphs (dict) – A dictionary of {graph-name -> {edges: [{source, destination, relation}], roots: [int]}}
-
text¶ str – The text of the Sentence.
-
startOffsets¶ [int] – The character offsets starting each token (inclusive).
-
endOffsets¶ [int] – The character offsets marking the end of each token (exclusive).
-
length¶ int – The number of tokens in the Sentence
-
graphs¶ dict – A dictionary (str -> processors.ds.DirectedGraph) mapping the graph type/name to a processors.ds.DirectedGraph.
-
basic_dependencies¶ processors.ds.DirectedGraph – A processors.ds.DirectedGraph using basic Stanford dependencies.
-
collapsed_dependencies¶ processors.ds.DirectedGraph – A processors.ds.DirectedGraph using collapsed Stanford dependencies.
-
dependencies¶ processors.ds.DirectedGraph – A pointer to the prefered syntactic dependency graph type for this Sentence.
-
_entities¶ [str] – The IOB-style Named Entity (NE) labels corresponding to each token.
-
_chunks¶ [str] – The IOB-style chunk labels corresponding to each token.
-
nes¶ dict – A dictionary of NE labels represented in the Document -> a list of corresponding text spans (ex. {“PERSON”: [phrase 1, ..., phrase n]}). Built from Sentence._entities
-
phrases¶ dict – A dictionary of chunk labels represented in the Document -> a list of corresponding text spans (ex. {“NP”: [phrase 1, ..., phrase n]}). Built from Sentence._chunks
-
bag_of_labeled_dependencies_using(form)¶ Produces a list of syntactic dependencies where each edge is labeled with its grammatical relation.
-
bag_of_unlabeled_dependencies_using(form)¶ Produces a list of syntactic dependencies where each edge is left unlabeled without its grammatical relation.
Edge¶
-
class
processors.ds.Edge(source, destination, relation)[source]¶ Bases:
processors.ds.NLPDatum
DirectedGraph¶
-
class
processors.ds.DirectedGraph(kind, deps, words)[source]¶ Bases:
processors.ds.NLPDatumStorage class for directed graphs.
Parameters: - kind (str) – The name of the directed graph.
- deps (dict) – A dictionary of {edges: [{source, destination, relation}], roots: [int]}
- words ([str]) – A list of the word form of the tokens from the originating Sentence.
-
_words¶ [str] – A list of the word form of the tokens from the originating Sentence.
-
roots¶ [int] – A list of indices for the syntactic dependency graph’s roots. Generally this is a single token index.
-
edges¶ list[processors.ds.Edge] – A list of processors.ds.Edge
-
incoming¶ A dictionary of {int -> [int]} encoding the incoming edges for each node in the graph.
-
outgoing¶ A dictionary of {int -> [int]} encoding the outgoing edges for each node in the graph.
-
labeled¶ [str] – A list of strings where each element in the list represents an edge encoded as source index, relation, and destination index (“source_relation_destination”).
-
unlabeled¶ [str] – A list of strings where each element in the list represents an edge encoded as source index and destination index (“source_destination”).
-
graph¶ networkx.Graph – A networkx.graph representation of the DirectedGraph. Used by shortest_path
-
bag_of_labeled_dependencies_from_tokens(form)¶ Produces a list of syntactic dependencies where each edge is labeled with its grammatical relation.
-
bag_of_unlabeled_dependencies_from_tokens(form)¶ Produces a list of syntactic dependencies where each edge is left unlabeled without its grammatical relation.
Mention¶
-
class
processors.odin.Mention(token_interval, sentence, document, foundBy, label, labels=None, trigger=None, arguments=None, paths=None, keep=True, doc_id=None)[source]¶ Bases:
processors.ds.NLPDatumA labeled span of text. Used to model textual mentions of events, relations, and entities.
Parameters: - token_interval (Interval) – The span of the Mention represented as an Interval.
- sentence (int) – The sentence index that contains the Mention.
- document (Document) – The Document in which the Mention was found.
- foundBy (str) – The Odin IE rule that produced this Mention.
- label (str) – The label most closely associated with this span. Usually the lowest hyponym of “labels”.
- labels (list) – The list of labels associated with this span.
- trigger (dict or None) – dict of JSON for Mention’s trigger (event predicate or word(s) signaling the Mention).
- arguments (dict or None) – dict of JSON for Mention’s arguments.
- paths (dict or None) – dict of JSON encoding the syntactic paths linking a Mention’s arguments to its trigger (applies to Mentions produces from type:”dependency” rules).
- doc_id (str or None) – the id of the document
-
tokenInterval¶ processors.ds.Interval – An Interval encoding the start and end of the Mention.
-
start¶ int – The token index that starts the Mention.
-
end¶ int – The token index that marks the end of the Mention (exclusive).
-
sentenceObj¶ processors.ds.Sentence – Pointer to the Sentence instance containing the Mention.
-
characterStartOffset¶ int – The index of the character that starts the Mention.
-
characterEndOffset¶ int – The index of the character that ends the Mention.
-
type¶ Mention.TBM or Mention.EM or Mention.RM – The type of the Mention.
See also
[Odin manual](https://arxiv.org/abs/1509.07513)
-
matches(label_pattern)¶ Test if the provided pattern, label_pattern, matches any element in Mention.labels.
-
overlaps(other)¶ Test whether other (token index or Mention) overlaps with span of this Mention.
-
copy(**kwargs)¶ Copy constructor for this Mention.
-
words()¶ Words for this Mention’s span.
Part of speech for this Mention’s span.
-
lemmas()¶ Lemmas for this Mention’s span.
-
_chunks()¶ chunk labels for this Mention’s span.
-
_entities()¶ NE labels for this Mention’s span.
JSON serialization/deserialization is handled via processors.serialization.JSONSerializer.
Interval¶
-
class
processors.ds.Interval(start, end)[source]¶ Bases:
processors.ds.NLPDatumDefines a token or character span
Parameters: - start (str) – The token or character index where the interval begins.
- end (str) – The 1 + the index of the last token/character in the span.
-
contains(that)¶ Test whether that (int or Interval) overlaps with span of this Interval.
-
overlaps(that)¶ Test whether this Interval contains another. Equivalent Intervals will overlap.
Annotators (Processors)¶
Text annotation is performed by communicating with one of the following annotators (“processors”).
CluProcessor¶
-
class
processors.annotators.CluProcessor(address)[source]¶ Bases:
processors.annotators.ProcessorProcessor for text annotation based on [org.clulab.processors.clu.CluProcessor](https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/processors/clu/CluProcessor.scala)
Uses the Malt parser.
FastNLPProcessor¶
-
class
processors.annotators.FastNLPProcessor(address)[source]¶ Bases:
processors.annotators.ProcessorProcessor for text annotation based on [org.clulab.processors.fastnlp.FastNLPProcessor](https://github.com/clulab/processors/blob/master/corenlp/src/main/scala/org/clulab/processors/fastnlp/FastNLPProcessor.scala)
Uses the Stanford CoreNLP neural network parser.
BioNLPProcessor¶
-
class
processors.annotators.BioNLPProcessor(address)[source]¶ Bases:
processors.annotators.ProcessorProcessor for biomedical text annotation based on [org.clulab.processors.fastnlp.FastNLPProcessor](https://github.com/clulab/processors/blob/master/corenlp/src/main/scala/org/clulab/processors/fastnlp/FastNLPProcessor.scala)
CoreNLP-derived annotator.
Sentiment Analysis¶
CoreNLPSentimentAnalyzer¶
-
class
processors.sentiment.CoreNLPSentimentAnalyzer(address)[source]¶ Bases:
processors.sentiment.SentimentAnalyzerBridge to [CoreNLP‘s tree-based sentiment analysis system](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf)
paths¶
DependencyUtils¶
-
class
processors.paths.DependencyUtils[source]¶ Bases:
objectA set of utilities for analyzing syntactic dependency graphs.
-
build_networkx_graph(roots, edges, name)¶ Constructs a networkx.Graph
-
shortest_path(g, start, end)¶ Finds the shortest path in a networkx.Graph between any element in a list of start nodes and any element in a list of end nodes.
-
retrieve_edges(dep_graph, path)¶ Converts output of shortest_path into a list of triples that include the grammatical relation (and direction) for each node-node “hop” in the syntactic dependency graph.
-
simplify_tag(tag)¶ Maps part of speech (PoS) tag to a subset of PoS tags to better consolidate categorical labels.
-
lexicalize_path(sentence, path, words=False, lemmas=False, tags=False, simple_tags=False, entities=False, limit_to=None)¶ Lexicalizes path in syntactic dependency graph using Odin-style token constraints.
-
pagerank(networkx_graph, alpha=0.85, personalization=None, max_iter=1000, tol=1e-06, nstart=None, weight='weight', dangling=None)¶ Measures node activity in a networkx.Graph using a thin wrapper around networkx implementation of pagerank algorithm (see networkx.algorithms.link_analysis.pagerank). Use with processors.ds.DirectedGraph.graph.
-