Analyze NLP Methods for Extracting Data from ORCA Log Files as Alternative to Rule-Based Parsing #98

SiriChandanaGarimella · 2024-11-11T17:12:52Z

Is your feature request related to a problem? Please describe.
Currently, extracting data from ORCA computational chemistry log files is based on rule-based pattern matching, which has several limitations, like constant maintenance of rules, limited flexibility for different ORCA versions, etc.

Describe the solution you'd like
Need to analyze various NLP approaches that could provide more robust and flexible data extraction:

Text Chunking/Segmentation Methods
- Evaluate algorithms for identifying section boundaries
- Compare with the current rule-based approach
Section Classification Methods
- TF-IDF vs Word Embeddings
- BERT vs simpler approaches
Data Structure Recognition
- Methods for table/coordinate data extraction
- Accuracy requirements

Acceptance Criteria

Analysis document comparing different NLP methods
Recommendations for best approach
Implementation considerations

SiriChandanaGarimella · 2024-12-17T06:09:26Z

nlp-analysis.pdf

@kungfuchicken - Attached is my NLP analysis documentation. Please check and let me know if any changes are needed.

SiriChandanaGarimella self-assigned this Nov 11, 2024

SiriChandanaGarimella mentioned this issue Nov 11, 2024

Implement NLP-Based Data Extraction System for ORCA Log Files #99

Open

SiriChandanaGarimella mentioned this issue Dec 17, 2024

Added NLP-Based Section Matching and Data Extraction Logic Proof of Concept #125

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyze NLP Methods for Extracting Data from ORCA Log Files as Alternative to Rule-Based Parsing #98

Analyze NLP Methods for Extracting Data from ORCA Log Files as Alternative to Rule-Based Parsing #98

SiriChandanaGarimella commented Nov 11, 2024

SiriChandanaGarimella commented Dec 17, 2024

Analyze NLP Methods for Extracting Data from ORCA Log Files as Alternative to Rule-Based Parsing #98

Analyze NLP Methods for Extracting Data from ORCA Log Files as Alternative to Rule-Based Parsing #98

Comments

SiriChandanaGarimella commented Nov 11, 2024

SiriChandanaGarimella commented Dec 17, 2024