Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze NLP Methods for Extracting Data from ORCA Log Files as Alternative to Rule-Based Parsing #98

Open
SiriChandanaGarimella opened this issue Nov 11, 2024 · 1 comment
Assignees

Comments

@SiriChandanaGarimella
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Currently, extracting data from ORCA computational chemistry log files is based on rule-based pattern matching, which has several limitations, like constant maintenance of rules, limited flexibility for different ORCA versions, etc.

Describe the solution you'd like
Need to analyze various NLP approaches that could provide more robust and flexible data extraction:

  1. Text Chunking/Segmentation Methods
    • Evaluate algorithms for identifying section boundaries
    • Compare with the current rule-based approach
  2. Section Classification Methods
    • TF-IDF vs Word Embeddings
    • BERT vs simpler approaches
  3. Data Structure Recognition
    • Methods for table/coordinate data extraction
    • Accuracy requirements

Acceptance Criteria

  1. Analysis document comparing different NLP methods
  2. Recommendations for best approach
  3. Implementation considerations
@SiriChandanaGarimella
Copy link
Collaborator Author

nlp-analysis.pdf

@kungfuchicken - Attached is my NLP analysis documentation. Please check and let me know if any changes are needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant