当前位置：首页 > 资讯 > 技术文档

TF-IDF理解及其Java实现代码实例

时间：2021-03-04 02:53 编辑：来源：阅读：
扫一扫，手机访问

摘要：TF-IDF理解及其Java实现代码实例

[b]TF-IDF[/b] [b]前言[/b] 前段时间，又具体看了自己以前整理的TF-IDF，这里把它发布在博客上，知识就是需要不断的重复的，否则就感觉生疏了。 [b]TF-IDF理解[/b] TF-IDF（term frequency–inverse document frequency）是一种用于资讯检索与资讯探勘的常用加权技术, TFIDF的主要思想是：如果某个词或短语在一篇文章中出现的频率TF高，并且在其他文章中很少出现，则认为此词或者短语具有很好的类别区分能力，适合用来分类。TFIDF实际上是：TF * IDF，TF词频(Term Frequency)，IDF反文档频率(Inverse Document Frequency)。TF表示词条在文档d中出现的频率。IDF的主要思想是：如果包含词条t的文档越少，也就是n越小，IDF越大，则说明词条t具有很好的类别区分能力。如果某一类文档C中包含词条t的文档数为m，而其它类包含t的文档总数为k，显然所有包含t的文档数n=m + k，当m大的时候，n也大，按照IDF公式得到的IDF的值会小，就说明该词条t类别区分能力不强。但是实际上，如果一个词条在一个类的文档中频繁出现，则说明该词条能够很好代表这个类的文本的特征，这样的词条应该给它们赋予较高的权重，并选来作为该类文本的特征词以区别与其它类文档。这就是IDF的不足之处. TF公式： [img]http://files.jb51.net/file_images/article/201711/2017111683754320.png?2017101683814[/img] 以上式子中[img]http://files.jb51.net/file_images/article/201711/2017111683901236.png?2017101683916[/img] 是该词在文件[img]http://files.jb51.net/file_images/article/201711/2017111683943213.png?2017101683956[/img] 中的出现次数，而分母则是在文件[img]http://files.jb51.net/file_images/article/201711/2017111684020662.png?2017101684036[/img] 中所有字词的出现次数之和。 IDF公式： [img]http://files.jb51.net/file_images/article/201711/2017111684108691.png?2017101684119[/img] |D|：语料库中的文件总数 [img]http://files.jb51.net/file_images/article/201711/2017111684207302.png?2017101684218[/img] ：包含词语 t_i 的文件数目（即 n_i,j不等于0的文件数目）如果该词语不在语料库中，就会导致被除数为零，因此一般情况下使用 [img]http://files.jb51.net/file_images/article/201711/2017111684420055.png?2017101684433[/img] 然后 [img]http://files.jb51.net/file_images/article/201711/2017111684500069.png?2017101684511[/img] [b]TF-IDF实现（Java）[/b] 这里采用了外部插件IKAnalyzer-2012.jar，用其进行分词具体代码如下：

package tfidf;
import java.io.*;
import java.util.*;
import org.wltea.analyzer.lucene.IKAnalyzer;
public class ReadFiles {
 /**
   * @param args
   */
 private static ArrayList<String> FileList = new ArrayList<String>();
 // the list of file
 //get list of file for the directory, including sub-directory of it
 public static List<String> readDirs(String filepath) throws FileNotFoundException, IOException
   {
  try
      {
   File file = new File(filepath);
   if(!file.isDirectory())
         {
    System.out.println("输入的[]");
    System.out.println("filepath:" + file.getAbsolutePath());
   } else
         {
    String[] flist = file.list();
    for (int i = 0; i < flist.length; i++)
            {
     File newfile = new File(filepath + "\\" + flist[i]);
     if(!newfile.isDirectory())
               {
      FileList.add(newfile.getAbsolutePath());
     } else if(newfile.isDirectory()) //if file is a directory, call ReadDirs
     {
      readDirs(filepath + "\\" + flist[i]);
     }
    }
   }
  }
  catch(FileNotFoundException e)
      {
   System.out.println(e.getMessage());
  }
  return FileList;
 }
 //read file
 public static String readFile(String file) throws FileNotFoundException, IOException
   {
  StringBuffer strSb = new StringBuffer();
  //String is constant， StringBuffer can be changed.
  InputStreamReader inStrR = new InputStreamReader(new FileInputStream(file), "gbk");
  //byte streams to character streams
  BufferedReader br = new BufferedReader(inStrR);
  String line = br.readLine();
  while(line != null){
   strSb.append(line).append("\r\n");
   line = br.readLine();
  }
  return strSb.toString();
 }
 //word segmentation
 public static ArrayList<String> cutWords(String file) throws IOException{
  ArrayList<String> words = new ArrayList<String>();
  String text = ReadFiles.readFile(file);
  IKAnalyzer analyzer = new IKAnalyzer();
  words = analyzer.split(text);
  return words;
 }
 //term frequency in a file, times for each word
 public static HashMap<String, Integer> normalTF(ArrayList<String> cutwords){
  HashMap<String, Integer> resTF = new HashMap<String, Integer>();
  for (String word : cutwords){
   if(resTF.get(word) == null){
    resTF.put(word, 1);
    System.out.println(word);
   } else{
    resTF.put(word, resTF.get(word) + 1);
    System.out.println(word.toString());
   }
  }
  return resTF;
 }
 //term frequency in a file, frequency of each word
 public static HashMap<String, float> tf(ArrayList<String> cutwords){
  HashMap<String, float> resTF = new HashMap<String, float>();
  int wordLen = cutwords.size();
  HashMap<String, Integer> intTF = ReadFiles.normalTF(cutwords);
  Iterator iter = intTF.entrySet().iterator();
  //iterator for that get from TF
  while(iter.hasNext()){
   Map.Entry entry = (Map.Entry)iter.next();
   resTF.put(entry.getKey().toString(), float.parsefloat(entry.getValue().toString()) / wordLen);
   System.out.println(entry.getKey().toString() + " = "+ float.parsefloat(entry.getValue().toString()) / wordLen);
  }
  return resTF;
 }
 //tf times for file
 public static HashMap<String, HashMap<String, Integer>> normalTFAllFiles(String dirc) throws IOException{
  HashMap<String, HashMap<String, Integer>> allNormalTF = new HashMap<String, HashMap<String,Integer>>();
  List<String> filelist = ReadFiles.readDirs(dirc);
  for (String file : filelist){
   HashMap<String, Integer> dict = new HashMap<String, Integer>();
   ArrayList<String> cutwords = ReadFiles.cutWords(file);
   //get cut word for one file
   dict = ReadFiles.normalTF(cutwords);
   allNormalTF.put(file, dict);
  }
  return allNormalTF;
 }
 //tf for all file
 public static HashMap<String,HashMap<String, float>> tfAllFiles(String dirc) throws IOException{
  HashMap<String, HashMap<String, float>> allTF = new HashMap<String, HashMap<String, float>>();
  List<String> filelist = ReadFiles.readDirs(dirc);
  for (String file : filelist){
   HashMap<String, float> dict = new HashMap<String, float>();
   ArrayList<String> cutwords = ReadFiles.cutWords(file);
   //get cut words for one file
   dict = ReadFiles.tf(cutwords);
   allTF.put(file, dict);
  }
  return allTF;
 }
 public static HashMap<String, float> idf(HashMap<String,HashMap<String, float>> all_tf){
  HashMap<String, float> resIdf = new HashMap<String, float>();
  HashMap<String, Integer> dict = new HashMap<String, Integer>();
  int docNum = FileList.size();
  for (int i = 0; i < docNum; i++){
   HashMap<String, float> temp = all_tf.get(FileList.get(i));
   Iterator iter = temp.entrySet().iterator();
   while(iter.hasNext()){
    Map.Entry entry = (Map.Entry)iter.next();
    String word = entry.getKey().toString();
    if(dict.get(word) == null){
     dict.put(word, 1);
    } else {
     dict.put(word, dict.get(word) + 1);
    }
   }
  }
  System.out.println("IDF for every word is:");
  Iterator iter_dict = dict.entrySet().iterator();
  while(iter_dict.hasNext()){
   Map.Entry entry = (Map.Entry)iter_dict.next();
   float value = (float)Math.log(docNum / float.parsefloat(entry.getValue().toString()));
   resIdf.put(entry.getKey().toString(), value);
   System.out.println(entry.getKey().toString() + " = " + value);
  }
  return resIdf;
 }
 public static void tf_idf(HashMap<String,HashMap<String, float>> all_tf,HashMap<String, float> idfs){
  HashMap<String, HashMap<String, float>> resTfIdf = new HashMap<String, HashMap<String, float>>();
  int docNum = FileList.size();
  for (int i = 0; i < docNum; i++){
   String filepath = FileList.get(i);
   HashMap<String, float> tfidf = new HashMap<String, float>();
   HashMap<String, float> temp = all_tf.get(filepath);
   Iterator iter = temp.entrySet().iterator();
   while(iter.hasNext()){
    Map.Entry entry = (Map.Entry)iter.next();
    String word = entry.getKey().toString();
    float value = (float)float.parsefloat(entry.getValue().toString()) * idfs.get(word);
    tfidf.put(word, value);
   }
   resTfIdf.put(filepath, tfidf);
  }
  System.out.println("TF-IDF for Every file is :");
  DisTfIdf(resTfIdf);
 }
 public static void DisTfIdf(HashMap<String, HashMap<String, float>> tfidf){
  Iterator iter1 = tfidf.entrySet().iterator();
  while(iter1.hasNext()){
   Map.Entry entrys = (Map.Entry)iter1.next();
   System.out.println("FileName: " + entrys.getKey().toString());
   System.out.print("{");
   HashMap<String, float> temp = (HashMap<String, float>) entrys.getValue();
   Iterator iter2 = temp.entrySet().iterator();
   while(iter2.hasNext()){
    Map.Entry entry = (Map.Entry)iter2.next();
    System.out.print(entry.getKey().toString() + " = " + entry.getValue().toString() + ", ");
   }
   System.out.println("}");
  }
 }
 public static void main(String[] args) throws IOException {
  // TODO Auto-generated method stub
  String file = "D:/testfiles";
  HashMap<String,HashMap<String, float>> all_tf = tfAllFiles(file);
  System.out.println();
  HashMap<String, float> idfs = idf(all_tf);
  System.out.println();
  tf_idf(all_tf, idfs);
 }
}

[b]结果如下图：[/b] [img]http://files.jb51.net/file_images/article/201711/2017111684828709.jpg?2017101684840[/img] 常见问题没有加入lucene jar包 [img]http://files.jb51.net/file_images/article/201711/2017111684910923.png?2017101684927[/img] lucene包和je包版本不适合 [img]http://files.jb51.net/file_images/article/201711/2017111684949153.png?201710168501[/img] [b]总结[/b] 以上就是本文关于TF-IDF理解及其Java实现代码实例的全部内容，希望对大家有所帮助。感兴趣的朋友可以继续参阅本站： [url=http://www.1sucai.cn/article/127893.htm][b]java算法实现红黑树完整代码示例[/b][/url] [url=http://www.1sucai.cn/article/127682.htm][b]Java算法之堆排序代码示例[/b][/url] [url=http://www.1sucai.cn/article/123408.htm][b]Java 蒙特卡洛算法求圆周率近似值实例详解[/b][/url] 如有不足之处，欢迎留言指出。

全部评论(0)

上一篇：使用eclipse + maven一步步搭建SSM框架教程详解
下一篇：java 指定某个jdk版本方法

资讯排行榜
更多>>