A material named entity recognition model using domain embedded bi-lstm and hybrid cosine with wordvector similarity approach
In recent years, the field of energy storage has witnessed a significant surge in research activities. Concurrently, there is a growing demand for the exploration of materials and processes employed in supercapacitors. Given the multitude of available materials and processes, it is difficult to expe...
| Main Author: | |
|---|---|
| Format: | Thesis |
| Language: | English |
| Published: |
2024
|
| Subjects: | |
| Online Access: | http://umpir.ump.edu.my/id/eprint/44642/ http://umpir.ump.edu.my/id/eprint/44642/1/A%20material%20named%20entity%20recognition%20model%20using%20domain%20embedded%20bi-lstm%20and%20hybrid%20cosine%20with%20wordvector%20similarity%20approach.pdf |
| Summary: | In recent years, the field of energy storage has witnessed a significant surge in research activities. Concurrently, there is a growing demand for the exploration of materials and processes employed in supercapacitors. Given the multitude of available materials and processes, it is difficult to experiment with each one in a short period of time. Nonetheless, researchers employ diverse approaches in exploring the processes and publish their findings within the same domain, thereby providing a valuable source of information. However, extracting and identifying these different approaches and their results from the vast corpus of published scientific articles poses a formidable challenge. The challenges include variations in reported data formats, inability to match publication volume growth, limited domain knowledge, insufficient semantic representation to select relevant articles, and unreliable named entity disambiguation. Existing works in knowledge extraction faces different challenges. Rule and dictionary-based techniques are limited by their reliance on static and fixed information, preventing them from dynamically extracting new data. While deep learning-based works encounter difficulties due to their use of generic domain knowledge embedding, which lack the specific details needed for the domain of interest. Furthermore, these deep learning-based works struggle in accurately extracting nested named entities. Therefore, this research proposes an automated knowledge extraction process that employs deep learning technique to recognize materials, processes, values and relationship between entities and their associated values from scientific articles. The aim is to develop a knowledge extraction process by incorporating a hybrid similarity approach that combines cosine similarity with word vectors and domain-specific keywords, and a domain-embedded Bidirectional Long Short-Term Memory (Bi-LSTM). The process begins by collecting scientific articles in PDF format and converting them into plain text. Relevant domain-specific articles are then selected using the hybrid similarity technique. Subsequently, a domain-embedded Bi-LSTM Named Entity Recognition (NER) model is developed to extract material named entities from the scientific articles. Four datasets are used in the experiment, among them three datasets are curated by the domain experts, namely, Electric Double Layer Capacitor (EDLC), drug-disease and target-precursor dataset, and one benchmark dataset for NER, called Groningen Meaning Bank (GMB) dataset. The proposed model, called matRec, is evaluated using performance metrics including Precision, Recall and F1 score. The matRec model achieves an F1 score of 96% for entity recognition in the EDLC domain, surpassing other deep neural network models such as BERT and sciBERT. The model’s strong performance extends to other domains, with F1 scores of 98% and 99% for the drug-disease and target-precursor datasets, respectively and 99% in the benchmark dataset. The proposed approach ensures sustained performance across different scientific domain due to the inclusion of relevant domain-specific article identification approach and usage of domain-embedded Bi-LSTM model for the full-text of the articles. Besides that, it also addressed the limitations inherent in manual and existing knowledge extraction methods. This adaptability highlights the potential of matRec model for efficient knowledge extraction in diverse scientific fields. |
|---|