public abstract class ASegment extends Object implements ISegment
| Modifier and Type | Field and Description |
|---|---|
protected String |
behindLatin
global behind Latin word after the CJK word
added at 2016/11/22 for better mixed word implementation
|
protected JcsegTaskConfig |
config |
protected int |
ctrlMask
segmentation runtime function control mask
|
protected ADictionary |
dic
the dictionary and task configuration instance
|
protected IntArrayList |
ialist |
protected int |
idx
the index value of the current input stream
mainly for track the start position of the token
|
protected IStringBuffer |
isb |
protected IPushbackReader |
reader |
protected LinkedList<IWord> |
wordPool
CJK word cache pool, Reusable string buffer
and the array list for basic integer
|
CHECK_CE_MASk, CHECK_CF_MASK, CHECK_EC_MASK, START_SS_MASK| Constructor and Description |
|---|
ASegment(JcsegTaskConfig config,
ADictionary dic) |
ASegment(Reader input,
JcsegTaskConfig config,
ADictionary dic)
initialize the segment
|
| Modifier and Type | Method and Description |
|---|---|
protected void |
appendLatinSyn(IWord w)
Check and append the synonyms words of specified word included the CJK and basic Latin words
All the synonyms words share the same position part of speech, word type with the primitive word
|
protected void |
appendWordFeatures(IWord word)
check and append the pinyin and the synonyms words of the specified word
|
protected IWord |
enSecondSeg(IWord w,
boolean retfw)
Do the secondary split for the specified complex Latin word
This will split a complex English, Arabic, punctuation compose word to multiple simple parts
Like 'qq2013' will split to 'qq' and '2013'
|
protected String |
findCHName(char[] chars,
int index,
IChunk chunk)
find an Chinese name from the current position of the input chars
|
boolean |
findCHName(IWord w,
IChunk chunk)
Deprecated.
|
protected abstract IChunk |
getBestCJKChunk(char[] chars,
int index)
an abstract method to gain a CJK word from the
current position.
|
JcsegTaskConfig |
getConfig()
get the current task configuration instance.
|
ADictionary |
getDict()
get the current dictionary instance.
|
protected IWord |
getNextCJKWord(int c,
int pos)
get the next CJK word from the current position of the input stream
|
protected IWord |
getNextLatinWord(int c,
int pos)
get the next Latin word from the current position of the input stream
|
protected IWord[] |
getNextMatch(char[] chars,
int index)
match the next CJK word in the dictionary
|
protected IWord |
getNextMixedWord(char[] chars,
int cjkidx)
get the next mixed word, CJK-English or CJK-English-CJK or whatever
|
protected IWord |
getNextPunctuationPairWord(int c,
int pos)
get the next punctuation pair word from the current position
of the input stream.
|
protected String |
getPairPunctuationText(int c)
find pair punctuation of the given punctuation char
the purpose is to get the text between them
|
int |
getStreamPosition()
get the current length of the stream
|
IWord |
next()
segment a word from a char array
from a specified position.
|
protected char[] |
nextCJKSentence(int c)
load a CJK char list from the stream start from the
current position till the char is not a CJK char
|
protected String |
nextCNNumeric(char[] chars,
int index)
find the Chinese number from the current position
count until the char in the specified position is not a other number or whitespace
|
protected String |
nextLatinString(int c)
the simple version of the next basic Latin fetch logic
Just return the next Latin string with the keep punctuation after it
|
protected IWord |
nextLatinWord(int c,
int pos)
find the letter or digit word from the current position
count until the char is whitespace or not letter_digit
|
protected String |
nextLetterNumber(int c)
find the next other letter from the current position
find the letter number from the current position
count until the char in the specified position is not a letter number or whitespace
|
protected String |
nextOtherNumber(int c)
find the other number from the current position
count until the char in the specified position is not a other number or whitespace
|
protected void |
pushBack(int data)
push back the data to the stream.
|
protected void |
pushBack(String str)
push back a string to the stream
|
protected int |
readNext()
read the next char from the current position
|
void |
reset(Reader input)
input stream and reader reset.
|
void |
setConfig(JcsegTaskConfig config)
set the current task configuration instance.
|
void |
setDict(ADictionary dic)
set the current dictionary
|
protected int idx
protected IPushbackReader reader
protected LinkedList<IWord> wordPool
protected IStringBuffer isb
protected IntArrayList ialist
protected String behindLatin
protected int ctrlMask
protected ADictionary dic
protected JcsegTaskConfig config
public ASegment(Reader input, JcsegTaskConfig config, ADictionary dic) throws IOException
input - config - Jcseg task configuration instancedic - Jcseg dictionary instanceIOExceptionpublic ASegment(JcsegTaskConfig config, ADictionary dic) throws IOException
IOExceptionASegment(Reader, JcsegTaskConfig, ADictionary)public void reset(Reader input) throws IOException
reset in interface ISegmentinput - IOExceptionprotected int readNext()
throws IOException
IOExceptionprotected void pushBack(int data)
throws IOException
data - IOExceptionprotected void pushBack(String str)
data - public int getStreamPosition()
ISegmentgetStreamPosition in interface ISegmentpublic void setDict(ADictionary dic)
dic - public ADictionary getDict()
public void setConfig(JcsegTaskConfig config)
config - public JcsegTaskConfig getConfig()
public IWord next() throws IOException
ISegmentnext in interface ISegmentIOExceptionISegment.next()protected IWord getNextCJKWord(int c, int pos) throws IOException
c - pos - IOExceptionprotected IWord getNextLatinWord(int c, int pos) throws IOException
c - pos - IOExceptionprotected IWord getNextMixedWord(char[] chars, int cjkidx) throws IOException
chars - cjkidx - IOExceptionprotected IWord getNextPunctuationPairWord(int c, int pos) throws IOException
c - pos - IOExceptionprotected void appendWordFeatures(IWord word)
word - protected void appendLatinSyn(IWord w)
w - protected IWord enSecondSeg(IWord w, boolean retfw)
Do the secondary split for the specified complex Latin word This will split a complex English, Arabic, punctuation compose word to multiple simple parts Like 'qq2013' will split to 'qq' and '2013'
And all the sub words share the same type and part of speech with the primitive word You should check the config.EN_SECOND_SEG before invoke this method
w - retfw - whether to return the fword.protected IWord[] getNextMatch(char[] chars, int index)
chars - index - protected String findCHName(char[] chars, int index, IChunk chunk)
chars - index - chunk - @Deprecated public boolean findCHName(IWord w, IChunk chunk)
chunk - the best chunk.protected char[] nextCJKSentence(int c)
throws IOException
c - IOExceptionprotected IWord nextLatinWord(int c, int pos) throws IOException
c - pos - IOExceptionprotected String nextLatinString(int c) throws IOException
int - cIOExceptionprotected String nextLetterNumber(int c) throws IOException
c - IOExceptionprotected String nextOtherNumber(int c) throws IOException
c - IOExceptionprotected String nextCNNumeric(char[] chars, int index) throws IOException
chars - char array of CJK itemsindex - IOExceptionprotected String getPairPunctuationText(int c) throws IOException
c - IOExceptionprotected abstract IChunk getBestCJKChunk(char[] chars, int index) throws IOException
chars - index - IOExceptionCopyright © 2017. All Rights Reserved.