

When using nlp.pipe, keep in mind that it returns a pipe (texts, disable = ) : # Do something with the doc here print ( ) Important note

load ( "en_core_web_sm" ) for doc in nlp. You’ll see that the doc.ents are nowĮmpty, because the entity recognizer didn’t run. So we can iterate over them and access the named entity predictions: Because we’re onlyĪccessing the named entities in doc.ents (set by the ner component), we’llĭisable all other components during processing. (potentially very large) iterable of texts as a stream. In this example, we’re using nlp.pipe to process a See the section onĭisabling pipeline components for more details and examples. To prevent this, use the disable keyword argument to disableĬomponents you don’t need – either when loading a pipeline, or during Model that you don’t actually need adds up and becomes very inefficient at Only apply the pipeline components you need.Process the texts as a stream using nlp.pipe andīuffer them in batches, instead of one-by-one.texts = - docs = + docs = list(nlp.pipe(texts)) 💡Tips for efficient processing Nlp.pipe method takes an iterable of texts and yields When processing large volumes of text, the statistical models are usually moreĮfficient if you let them work on batches of texts. It then returns the processed Doc that youĬan work with. When you call nlp on a text, spaCy will tokenize it and then call eachĬomponent on the Doc, in order. Writable, so you can either create your own You can still customize the tokenizer, though. Really be one tokenizer, and while all other pipeline components take a DocĪnd return it, the tokenizer takes a string of text and turns it into aĭoc. It also doesn’t show up in nlp.pipe_names. The tokenizer is a “special” component and isn’t part of the regular pipeline. TheĮntityLinker, which resolves named entities to knowledgeīase IDs, should be preceded by a pipeline component that recognizes entities Recognizer: if it’s added before, the entity recognizer will take the existingĮntities into account when making predictions. Similarly, it matters if you add theĮntityRuler before or after the statistical entity Sentence boundaries, so if a previous component in the pipeline sets them, itsĭependency predictions may be different. Only work if it’s added after the tagger. ForĮxample, a custom lemmatizer may need the part-of-speech tags assigned, so it’ll You can read more about this in the docs onĬustom components may also depend on annotations set by other components. However, components may share a “token-to-vector” This means that you can swap them, or remove single components from the pipeline Recognizer doesn’t use any features set by the tagger and parser, and so on.

The statistical components like the tagger or parser are typically independentĪnd don’t share any data between each other. This is why each pipeline specifies its components and their settings in Statistical model and weights that enable it to make predictions of entity Recognition needs to include a trained named entity recognizer component with a The capabilities of a processing pipeline always depend on the components, their Token.head, p, Doc.sents, Doc.noun_chunksĪssign custom attributes, methods or properties.
