Session: 17-01-01: Research Posters
Paper Number: 149312
149312 - Leveraging Large Language Models and Knowledge Graphs for Enhanced Technical Document Classification
i) Introduction, Motivation, and Innovative Contribution
The extraction and classification of information from technical documents is a critical task in various industries, particularly in managing the design phase of product development processes. Traditional approaches to automated information retrieval from technical documents have relied heavily on conventional Natural Language Processing (NLP) techniques. These methods, including Named Entity Recognition (NER) and custom pre-built models, have been widely employed in attempts to automate the extraction process. However, these approaches come with their own set of challenges. They typically require extensive training data specific to each domain or document type, making them resource-intensive to develop and maintain. The need for large, carefully curated datasets often results in models that are inflexible and struggle to generalize across different types of technical documents or adapt to new domains. Moreover, the performance of these traditional NLP models can be inconsistent when faced with the diverse and often highly specialized vocabulary found in technical documentation. This limitation has historically constrained the widespread adoption of automated information extraction systems in industries dealing with complex and varied technical documents. This research proposes an innovative methodology leveraging Large Language Models (LLMs) for the automated classification of technical documents, addressing these challenges and significantly improving accuracy and consistency in information extraction.
ii) Methodology
The proposed approach integrates LLMs with Knowledge Graphs through a multi-phase process. Initially, information is extracted from technical documents in the form of plain text during the pre-processing phase. This text is then passed to a fine-tuned LLM via a carefully designed prompt. The LLM, specifically trained to provide structured output formats, processes the input while reducing variability in responses, thus addressing the inherent stochastic nature of language models. To ensure the correctness and consistency of the LLM's output, a Knowledge Graph is interrogated in the verification phase. This step serves as a critical control mechanism, preventing inconsistencies and redundancies in the extracted data. The process culminates in the production of validated and enriched specifications in a structured format, ready for integration into subsequent workflows. A key innovation of this methodology is its ability to seamlessly integrate the extracted, validated information directly into Product Lifecycle Management (PLM) systems. This integration streamlines the retrieval and utilization of technical information from documents, significantly enhancing the efficiency of information management in product development processes.
iii) Preliminary Results and Conclusions
The study reveals that pre-trained Language Models, when properly fine-tuned, require minimal additional resources to perform specific tasks within the technical document classification domain effectively. The incorporation of Knowledge Graphs plays a crucial role in maintaining data integrity and consistency throughout the process, providing a robust structure for understanding domain-specific terms while leveraging the flexibility and advanced NLP capabilities of LLMs. The research findings highlight several critical areas for future development in this field. Enhancing LLM fine-tuning techniques is essential for producing more structured outputs tailored to technical documentation needs. Additionally, developing and implementing ontologies specific to particular products or domains will further improve classification accuracy. The efficient integration of extracted data into PLM workflows remains a priority, suggesting a direction for future research and development efforts. This study represents a significant advancement in the application of AI technologies to engineering documentation processes. By streamlining the extraction, validation, and integration of technical information from documents directly into PLM systems, this approach has the potential to significantly reduce manual effort, minimize errors, and accelerate product development cycles, marking a substantial step forward in the field of technical document management and classification.
Presenting Author: Alessandro Stefanone Politecnico di Milano
Presenting Author Biography: PhD Student at Politecnico di Milano. Completed his MSc in Mechanical Engineering at Politecnico di Milano in 2023 with a focus on Virtual Prototyping. Since September 2023, his main research interests lie at the intersection between product development process and artificial intelligence algorithms.
Authors:
Alessandro Stefanone Politecnico di MilanoLeveraging Large Language Models and Knowledge Graphs for Enhanced Technical Document Classification
Paper Type
Poster Presentation