Skip to content

ssun32/clir_scripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Used tools
----------
convert traditional chinese to simplied
http://github.com/berniey/hanziconv

stanford segmenter
tokenize chinese language

filter.py
---------
filter out ill-formed documents, such as Wikipedia:, Help: PortalA

fix_query.py
------------
Regenerate the query from raw wikipedia documents. 
Previously, title words were not deleted from queries properly.

fix_rel.py:
----------
Scan all the rel and documents files and remove the lines with deleted wikipedia articles

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages