NLP is used in many areas today, including voice assistants, automatic text translations, and text filtering. The main three areas are: Speech Recognition, Natural Language Understanding [) and Natural Language Generation.

Contents

What is Natural language processing How to Build a Website
Natural Language Processing (NLP) is the intersection of machine learning and mathematical linguistics, aimed at learning methods for the analysis and synthesis of natural language.

Stemming

The number of correct word forms, the meanings of which are similar, but the spellings differ in suffixes, prefixes, endings, etc., is very large, which complicates the creation of dictionaries and further processing. Stemming allows you to bring a word to its basic form. The essence of the approach is to find the basis of a word; for this, its parts are sequentially cut off from the end and the beginning of the word. Clipping rules for stemmer are created in advance, and most often they are regular expressions, which makes this approach laborious, since new linguistic research is needed when connecting another language like in this SaaS project https://www.conveythis.com//. The second drawback of the approach is the possible loss of information when cutting off parts, for example, we can lose information about a part of speech.

Vectorization

Most mathematical models work in high-dimensional vector spaces, so you need to display text in vector space. The main approach is a bag-of-words: a vector of the dictionary dimension is formed for the document, its own dimension is allocated for each word, the attribute of how often the word occurs in it is recorded for the document, and we get a vector. The most common method for calculating a feature is TF-IDF [4] (TF - term frequency, IDF - inverse document frequency). TF is calculated, for example, by the word occurence counter. IDF is usually calculated as the logarithm of the number of documents in a corpus divided by the number of documents where this word is represented. Thus, if a word occurs in all documents in the corpus, then such a word will not be added anywhere.

The advantage of the bag of words is the simple implementation, however, this method loses some information, for example, the word order. To reduce the loss of information, you can use a bag of N-grams (add not only words, but also phrases), or use methods of vector representations of words, this, for example, allows you to reduce the error on words with the same spellings, but different meanings.

Deduplication

Since the number of similar documents in a large corpus can be large, it is necessary to get rid of duplicates. Since each document can be represented as a vector, we can determine their proximity by taking a cosine or other metric. The downside is that for large corpuses, a complete enumeration of all documents will be impossible. For optimization, you can use a locally sensitive hash that will place closely similar objects.

Semantic Analysis

Semantic (semantic) analysis of the text - the allocation of semantic relations, the formation of a semantic representation. In the general case, the semantic representation is a graph, a semantic network, reflecting the binary relations between two nodes - the semantic units of the text. The depth of semantic analysis can be different, and in real systems, only syntactic-semantic representation of the text or individual sentences is most often built. Semantic analysis is used in Sentiment analysis tasks, for example, to automatically determine the positivity of reviews. Ref.: Wiki