The CLARIN Resource Families initiative provides a user-friendly overview of the available language resources in the CLARIN infrastructure for researchers from digital humanities, social sciences and human language technologies. This month we highlight the overview of part-of-speech taggers and lemmatizers. The CLARIN infrastructure offers 65 tools for part-of-speech tagging or lemmatization for over 50 languages.
Part-of-speech tagging is the automatic text annotation process in which words or tokens are assigned part of speech tags, which typically correspond to the main syntactic categories in a language (e.g., noun, verb) and often to subtypes of a particular syntactic category which are distinguished by morphosyntactic features (e.g., number, tense). Lemmatization is the process by which inflected forms of a lexeme are grouped together under a base dictionary form. Part-of-speech tagging and lemmatization are crucial steps of linguistic pre-processing.