Carnegie Mellon University

Multilingual Machine Translation and cross media search based on NLP large model

Time flies, and before you know it, MBP has been for 20 years. MBP, we cooperated in the digitization of multilingual books such as Chinese and English at the beginning, and later built an international engineering knowledge center through the Chinese Academy of engineering to provide knowledge services for the engineering science and technology circles in China and the world, under the support of UNESCO; The tools used have also witnessed the unconventional development of technology in the past 20 years, from OCR to search engine, from character recognition based on handcraft features to deep neural network and intelligent recommendation. Today, standing at the new starting point of open source software and natural language processing model, how can we inherit and carry forward the vision of MBP and realize the ideal of knowledge benefiting the world? Let's talk about my thinking in the last 1-2 years. First of all, with the transformer model, we can use AI supercomputing power and massive data to train an excellent NLP model to support the knowledge engineering center to provide existing content to users all over the world. Whether they use the language of the content in the database or their own language, we should provide them with reading and retrieval services, especially for users they use some small languages. In order to train a good model, we need multi-party cooperation to obtain enough data. To this end, we first solve two problems. First, how to get enough labeled training data for small languages; Second, how to protect the privacy of data owner, so that the data will not be illegally copied by other users in use. For the first problem, we use the bridge language to solve it. The second problem is solved by DPI and waterproof Fort technology. Of course, considering that many engineering science and technology materials are illustrated and some also contain audio and video, we also need to support cross media search. Relevant technical details will be given in detail in the report.