Zheqiao Chen's Personal Website

zh-late-chunking: Late Chunking for Chinese

2025-03-15T00:00:00+00:00

Since reading this article, Late Chunking in Long-Context Embedding Models, I have been very interested in creating a Chinese version of it. I’m glad there is finally one :)

A brief description of this project: readme_en.md
An introduction to Late Chunking: Late Chunking: Revolutionizing Text Retrieval with Long-Context Embeddings

Why We Should Not Do Overlap in Chunking (and What to Do Instead)

2025-03-07T00:00:00+00:00

When I was working on chunking long texts for embedding models, I tried to chunk texts into smaller segments and add some overlap. Because, as mentioned in many documents, articles, and tutorials, we should add overlaps to keep contextual information. However, this is not an elegant way to do chunking.

Overlapping chunks do help in preserving context, but they introduce a trade-off:

Too short can cause the loss of critical context.
Too long wastes computational resources and may degrade retrieval performance by generating larger chunks. FYI, Chunk large documents for vector search solutions in Azure AI Search.

A Structure-Aware Approach

Rather than relying on arbitrary overlaps, I found that leveraging the structure of a document can produce more semantically coherent segments. A simple pipeline is like:

Markdown OCR: Use an OCR engine that outputs markdown to capture the document’s natural hierarchy. This preserves important formatting like heading levels (#, ##, ###).
Structure Extraction: Identify and extract text segments between hierarchical markers. This creates natural boundaries.
Adaptive Chunking:
- For shorter sections, retain them as single, coherent chunks.
- For longer sections, apply large-context LLMs (for example, Gemini 1.5, which is free) to segment the text along semantic boundaries.

Several considerations

OCR Selection: Choose an OCR model that supports markdown output. I use MinerU for this purpose.
Length Thresholds: Carefully choose the threshold for long and short texts. Use character counts instead of token counts. (Or maybe dynamically adjusting it based on the long text length).

This approach provides semantically coherent segments. While I won’t provide detailed step-by-step instructions here, my experiments have shown that this strategy works pretty well.

To Do List: Late Chunking for Some Other Embedding Models

2025-02-24T00:00:00+00:00

3/15/2025: Anyway, an update can be found here: zh-late-chunking: Late Chunking for Chinese

3/10/2025: I decided not to do this project, since there’s a better way to do chunking as specified in this article: Why We Should Not Do Overlap in Chunking (and What to Do Instead).

This article is a little reminder to myself.

I have been working on some projects where I need to chunk long texts and retrieve relevant information from them. Surprisingly, I came across this article: Late Chunking in Long-Context Embedding Models. It introduces a new method to chunk long texts while keeping the context information. The method sounds promising and has two pros:

It requires far fewer computing resources that llm aided chunking.
It does as good as or even better than llm aided chunking according to Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models.

However, even though the authors claimed that it works on any embedding models that use avarage pooling technique, it still requires a lot of work to implement it. Plus, there is a language problem, because Chinese and other languages in late chunking differs from English. Therefore, to my knowledge, the only model supports this method is the jina-embedding series from Jina AI. Which I tried, and it does not work as well as the leading embedding models such as openai/text-embedding-3 and BAAI/bge-m3.

In the next few months, if I have time, I will try to implement this method for some other embedding models.

中国每天在发生什么：中国社会事件数据库

2025-01-07T00:00:00+00:00

2025.3.9: 抱歉网站已停止更新，2025.1-2025.3期间的数据仍然可以查看，您可以根据repository自行部署

项目网站为：csed.zheqiaoc.com

项目介绍在：这里

2025-01-31: 在Github开源，并做了很多优化

中国社会事件数据库的开发源于我的两个问题：中国每天在发生什么？民众每天在互联网上关注和接收哪些信息？

多数的传播学或政治学研究似乎更多关注特定事件的报道，而非整体的信息分布。因此，我希望可以通过这个项目，从事件的层次汇总数据，并进行分析。

这一项目有以下特点：

每日自动汇总信息，以时间线的形式进行展示。
政府回应检测，有小黄星的条目包含了政府的回应。
点击帖子标题可以跳转微博原帖。
对移动端和桌面端都有较好的支持。

未来希望可以完成的事情：

提供数据下载的页面或者API接口。
增加更多的功能，如事件分类，事件地图等。
增加更多的数据源，如公众号，抖音等。
开源（抱歉代码写得太差，没仔细检查之前不好意思开源）

我目前正在上学，知识了解十分有限，并且没有充足的时间进行后续开发维护，因此最终效果或许不尽如人意。如果你对这一项目有兴趣和想法，欢迎通过邮件联系我，可以在About页面找到我的邮箱。

Something New: China Social Event Database

2025-01-04T00:00:00+00:00

2025.3.9: Sorry the website is no longer updated, but data from 1/2025 - 3/2025 is still available

To visit the project website, please go to csed.zheqiaoc.com.

An introduction is available here

2025-01-31: The project is open-sourced at Github, many improvements have been made.

A core question in my mind is: What do people know about society and the world? Especially in China, after many years of information control, what do 1.4 billion people see every day?

Previous data and services have several limitations:

They usually focus on a single event rather than multiple events.
They do not pay much attention to information flow.
Data accessibility is a big problem.

Therefore, I think it would be valuable to have a system that monitors the Chinese internet, keeps track of what’s happening in China, and aggregates this information at the event level. This is what the CSED (China Social Event Database) does. it monitors the Weibo platform, tracks a list of key accounts, automatically aggregates information, and presents social media dynamics on a daily basis.

This project is still in its early stages, and here are a few things I hope to complete in the near future:

A comprehensive introduction to its methodology.
A user-friendly API for data downloads.
Additional features and algorithm adjustments.

Once again, you can visit the website at csed.zheqiaoc.com. Have fun :)

A Toolkit for OpenAI Batch

2024-11-21T00:00:00+00:00

The online service is no longer available

NEWS: I deployed the app online to make it easier to use. Visit openaibatch.vercel.app to give it a try!

The online service may not be able to process large CSV files. If you encounter any error, please clone the app and run it on your local device.

What is OpenAI Batch?

When using the OpenAI API for NLP tasks in social science research, I typically use the openai package with pandas to process CSV files, reading text and writing labels from OpenAI API responses. However, when a CSV file gets too large, the processing speed drops dramatically after handling a certain number of tasks. This is due to rate limits, as detailed in the rate limits documentation.

To tackle large workloads more efficiently, the best approach is to use OpenAI Batch. With OpenAI Batch, users can upload a JSONL file in a specific format. OpenAI processes the JSONL file (slower than standard API calls but more efficient for large jobs) and returns the results in the same format.

Key Advantages of OpenAI Batch

Higher daily usage limits compared to standard API calls.
Faster processing for large-scale NLP projects.

Workflow for Using OpenAI Batch

Here’s a simple workflow I follow to classify a set of sentences using OpenAI Batch:

Create a CSV file with my target sentences in one column.
Configure the task parameters and convert the CSV file to JSONL format.
Upload the JSONL file to the OpenAI Batch service. If there’s no error, wait for the results.
Download the processed JSONL file from the server and convert it back to a CSV file.

What Can OpenAI Batch Tools Do?

The above process requires some coding, and dealing with JSONL format and batch service limits can get pretty annoying. That’s why I made this app, which you can download here.

The app menu looks like this:

It contains three tools to streamline the workflow:

CSV to JSONL Converter
Converts a CSV file to JSONL format, which is required for OpenAI Batch processing.
JSONL File Splitter
Splits a JSONL file into smaller files of equal size. If the JSONL file exceeds batch service limits, you can split it into smaller files, register new OpenAI accounts to process them separately, and combine the results later.
JSONL Response Extractor
After downloading batch outcomes from the OpenAI server, this feature extracts the responses and converts them back to a CSV file.

How to Use

First, prepare a CSV file as input. Specify the column that contains the text you want to analyze in the “Text Column” field, configure the other parameters, and click the “Convert” button to generate a JSONL file.

# Example of the parameters

# Model: I often use gpt-4o-mini, which is cheap and strikes a good balance between speed and quality.
# Max Tokens: This is the maximum number of tokens shared between the prompt and the response. 
# One token is roughly 4 characters in English.
# Temperature: 1 is the default. A higher value gives more creative responses, a lower value gives more conservative ones.

Once the JSONL file is ready, you can upload it to the OpenAI Batch service to start processing.

After the batch service finishes processing, download the results and use the “JSONL Response Extractor” tool to convert them back to a CSV file.

Hope it can save you some time!

Try Requests If You Don’t Like Selenium

2024-10-07T00:00:00+00:00

Recently, I struggled to scrape data from a website built with Vue.js. When I tried to scrape the web data using the traditional Selenium approach, I encountered two problems:

Captchas appeared when I tried to load more pages.
Since the site is built with a JavaScript framework, the web page is not static. Accessing information on subpages using Selenium took a lot of time, and the URL did not change, which posed an additional challenge for Selenium.

I’d like to share how I dealt with these issues.

Captcha Solver

Initially, I tried using Chrome extensions to automatically solve captchas while using Selenium to load pages, but they didn’t work well. So, I turned to alternative solutions. I found a Chinese company that provides an excellent service. They offer a Python code snippet that solves captchas with the following process:

Locate the captcha’s XPath
Download the captcha image
Upload the captcha image to their server
Return the solved captcha value

They charge for this service, and I processed around 21,000 captchas, which cost $3—a fair price. In the next part, I’ll talk about how to integrate it with the Requests package, or with Selenium if you prefer to stick to the traditional method.

The service I used is Chaojiying. It provides a Python function, which can be imported to your python program like this:

!pip install requests
import requests
import chaojiying # Download this part at Chaojiying website
from chaojiying import Chaojiying_Client

def solveCaptcha():
    url1 = "URL FOR FETCHING CAPTCHA MODULE"
    params1 = {
        'type': 'test'  
    }

    response = requests.get(url1, headers=headers, cookies=cookies, params=params1)
    data = response.json()
    img = data['data']['img']
    img_id = data['data']['id']
    
    # Decode and save the captcha image
    img_bytes = base64.b64decode(img)
    with open('captcha.png', 'wb') as f:
        f.write(img_bytes)

    with open('captcha.png', 'rb') as f:
        im = f.read()
    
    chaojiying = Chaojiying_Client('CHAOJIYING USER NAME', 'CHAOJIYING USER PASSWORD', 'CHAOJIYING ID')  # Replace with valid credentials, have to sign up the account first

    res = chaojiying.PostPic(im, 1004)  # Adjust captcha type ID (1004) if needed
    
    # Submit the solved captcha
    url2 = "URL FOR CHECKING CAPTCHA MODULE"  
    params2 = {
        'id': img_id,
        'captcha': res["pic_str"],
        'type': 'test'
    }
    
    # Send the solved captcha
    response = requests.get(url2, headers=headers, cookies=cookies, params=params2)

There are many other companies that provide such services, but I haven’t tried them yet. I list them below for your reference:

Try Requests if You Have Trouble Setting Selenium

If you’re having trouble with Selenium, or if you’re tired of dealing with XPaths, you might want to try directly retrieving information from the server database. The way to do this is to use an API.

First, you need to figure out the API format. Right-click on the web page, select “Inspect” in Chrome, and click the Network tab at the top. Now, you can inspect the communication between the front end and the back end.

For example, I use Jekyll Talk. When I click on the link of interest, the Network panel changes.

This gives us information about the API request headers and response headers. Therefore, by reconstructing this information ourselves, we can simulate API requests and retrieve responses in JSON format.

Here’s a simple code snippet demonstrating how to use the Requests package in Python to fetch information from the server:

import requests

headers = {INSERT YOUR HEADERS HERE} # You can find your headers in the Inspect panel, or you can fake one
cookies = {INSERT YOUR COOKIES HERE} # You can also find your cookies in the Inspect panel; cookies may change with your login status
url = ""  # Insert the API endpoint here
params = {}  # Parameters of the website can be seen in 'General -> Request URL'

response = requests.get(url, headers=headers, cookies=cookies, params=params)

print(response.json())  # Get your response in JSON format

Some benefits of doing so:

Easily get all content directly from the database, which also includes some non-displayed or hard-to-fetch content.
It’s way faster than Selenium.
It’s way easier than Selenium.

Captcha Solver + Requests

If we want to combine the two, the logic would be simple:

if Captcha:
    get captcha from web server
    download captcha
    upload captcha to captcha solver server
    get captcha key
    send captcha key
else:
    get content with requests

This article will be continued and refined. Feel free to email me or leave a comment if you have any questions.

我的政治学、传播学、计算社会科学美国硕士项目申请心得

2024-03-19T00:00:00+00:00

前言

信息无论对谁都是至关重要的，然而在获取信息这件事上，总是存在各种困难和障碍，一些朋友们虽然有很好的bg但是因为信息差没有去申请更好更适合的项目。

在这个申请季我也曾在互联网上通宵翻阅各个论坛博客试图获取项目和申请的信息，得到了一些素不相识的，慷慨的申请者们的无私帮助。于是有了在申请季结束后把我掌握的信息分享给大家的想法，如果能对之后政治学和传播学的申请者，甚至其他社会科学专业的朋友有所帮助，那真是太棒了。

本文中的信息主要来源于我的互联网检索，我并没有和朋友们交流过这些信息，因此可能存在一些主观的甚至错误的判断，本文权当抛砖引玉，希望大家可以批判性阅读，也期待大家的指正!

我会继续更新这篇文章，最后一次修改的时间为：2024-7-1

一、背景、申请项目与录取结果

1. 我的背景

在申请时我就读于中国传媒大学的行政管理专业，课程设置会偏向政治学，又因为传媒大学的传播学特色，在4年里读了不少政治传播的文献，逐渐对这个方向产生了兴趣，最后有了申请political science和media and communication研究生项目的想法。另一方面又深感自己quantitative training的不足，于是同时申请了computational social science项目。

我的GPA尚可，但不是顶尖，曾有美国交换经历和PKU-UChicago暑校经历，但是偷懒未考GRE，且托福成绩仅及格而已。我用心打磨了SoP，表达了自己的学术兴趣和科研能力。我的研究方向是政治传播，主要使用causal inference和computational methods。

2. 申请结果

2024fall，主申美国，也申请了英国、欧陆、香港的项目。

美国项目

UChicago, MAPSS - Political Science track, 20k scholarship
UCSD, Master of Chinese Economic and Political Affairs (MCEPA), 1/3 scholarship
Columbia University, Political Science
UCLA, Social Science
UT-Austin, Journalism and Media - MA Research and Theory
NYU, MA in Politics
UMass at Amherst, MS in Data Analytics and Computational Social Science (DACSS)
Boston University, EMS, 10k scholarship
~~U Wisconsin-Madison, Journalism and Mass Communication~~
~~U Minnesota-Twin Cities, Mass Communication~~
~~NYU, ASSR~~

其他项目

Hertie School, MIA - dual degree track with Syracuse University
UManchester, MS in Social Research Methods and Statistics
HKU, MSocSc in Social Data Analytics

关于中介

我曾经接触过数家中介，但是感觉很难在服务质量和价格上达到平衡。大多数中介可能对社科，尤其是政治学领域不太了解，无法提供太多帮助，而能提供帮助的中介往往较昂贵（就我身边的案例来看棕榈是一个例子），我想可能半diy是一个比较好的办法。
我没有找中介，但是找了有经验的申请者帮我看文书，也找了native speaker润色，在这方面的投入可能1000左右，付出的时间成本挺多的。我身边有申请者找了中介一手包办，然后撒手享受大四的自由时光，我也很羡慕。
总之根据个人的状况做取舍，我的建议是想申学术项目的申请者们不论是否选择找中介，多投入一点时间了解申请相关的信息没有坏处。

二、申请之前的选校

我申请的项目主要为研究型，因此在这一节只谈面向学术型的master项目。在选校方面我花费了最大量的时间，遍历了美国Top院校的政治学、传播学、计算社会科学项目，也看了很多的就读体验，在这里想感谢一些博主的文章给我的帮助，我会在下面的参考资源中列出。
同时也感谢寄托、一亩三分地、小红书为我提供了申请季中的大部分信息。

1. 一些选校的小建议

不要为了申请而申请：
在选择之前不妨问问自己，“如果被录取的话我会不会选择这所学校？”避免浪费100美金左右的申请费。

看看Funding：
比如NYU的MA Politics一视同仁不给funding，纽约的生活成本让人望而却步，就算拿到了offer也不太可能去读，因此申请之前建议看看学校funding状况如何。

看看Faculty：
如果没有自己感兴趣的教授，申请前请三思。申请博士时会关注硕士期间的产出，如果没有和自己研究方向一致的教授，不仅无法在做研究的时候得到帮助，而且在申请写文书的时候也会面临不小的挑战。此处的教授仅包括：Professor, Associate Professor, and Assistant Professor。

申请传播学中政治传播方向的同学可以参考Jing Zhang写的帖子：总结：软科专排UStop50含政治传播方向项目与老师总结

不妨列一张表格：
用Google Sheets或者Excel管理申请的项目，可以列一列申请学校，截止日期（按照截止日期正序排列），申请费，语言要求，GRE要求，WES要求，SoP要求，WS要求，LoR要求，申请页面链接，学费，学习时长等等。这样对自己的申请可以有更加直观的掌握。

大胆申请：
不要在意他人给你的定位，也不要在意别人发的求定位帖子，如果有心爱的项目就大胆申请，申请了就有录取的可能，但是不申请录取的可能一定是0。对我来说最重要的参考标准是寄托、一亩三分地、thegradcafe上过往录取者的bg，但是也只是参考而已，每个个体都不一样，不要让他人的失败/成功影响你的申请。

选校数量：
我比较没自信，所以当时选了蛮多所学校申请的，其实现在回看一下并不需要申请这么多。另一个极端是在寄托看到的一位同学，仅申请了一个项目，最后得到录取，在佩服他勇气的同时也不禁为他捏了把汗。
我想，可能对跨专业的master申请者来说8所左右的学校足矣，否则申请费、托福送分费用是一个问题，不方便要推荐信是另一个问题。

2. 美国Political Science硕士项目

比较好的美国polisci master项目可能是这些学校的：MIT, Duke, UChicago, Columbia, NYU, UvA, GWU, Georgetown等。

4-5月（应该）会写一写我当初整理的其他硕士项目，包括综合排名稍低但是学术上不错的项目。

3. 美国Media and Communication硕士项目

同样，4-5月（应该）会写一写

请参考我的另一篇文章：美国computational social science硕士项目。

三、申请要素：GPA、语言、GRE、CV、SoP、PHS、WS、RL等

1. 硬实力

硬实力包括三部分：GPA、GRE、出身校。

GPA和GRE其实目的都是证明你的学习能力，如果有了高GPA，其实GRE可以是optional的，如果没有足够的时间备考其实也可以选择optional。只要不是申请顶级的院校或者顶级的项目，3.6或者3.7的GPA是完全够用的。

关于出身校，我的直观感受是除了那几所顶尖学校，或者在业内特别有名的学校之外，大家的差距都不是很大，不必因为出身校差而给自己减分，大胆申请吧！

2. 软实力

在申请季，GPA已经确定了，剩下文书等资料就是你的软实力了。大多数学术型项目会要求SoP (statement of purpose)或者PS (personal statement)，也有部分会要求PHS (personal history statement)，这些是你可以做文章提升竞争力的地方。我自己的SoP写作流程大致是：

写英文草稿 $\longrightarrow$ 填充细节得到长文章草稿 $\longrightarrow$ 找有经验的人帮忙看结构 $\longrightarrow$ 修改成一篇新长文章草稿 $\longrightarrow$ 找native speaker修改语法 $\longrightarrow$ 修改成一篇完整的长文章 $\longrightarrow$ 根据目标院校字数要求删减成一篇短文章 $\longrightarrow$ 再次找native speaker修改语法

经过这个流程之后，你应该会有一篇长文章和一篇短文章，之后还需要经过多次的打磨，最终得到可以提交的版本。不同学校对SoP的长度要求不同，我申请的学校大部分设置了500或者1000的词数限制，也有一些学校会有800词的限制，1000和500词长度的文章经过调整应该可以符合大多数学校SoP的字数要求。

3. 对各部分的一些小建议

成绩单
首先，关于陆本学生是否要WES认证的问题，我想还是认证一下比较好。用一百多美金换一个成绩提升的可能性，应该比较值得。

其次，在提交成绩的时候，可能会纠结提交哪一个成绩：WES前的，WES后的，或者是百分制的。我的建议是，哪一个成绩的竞争力最强就提交哪个，不必在意是百分制还是4分制，因为最后学校可能会根据成绩单重新计算一次你的GPA。

语言成绩
早准备，早考，最好线下考！我第二次考托福是线上的，考到一半莫名其妙被考官掐掉，追问了两周没有任何回音（该死的ETS），之后线下考试一次过。

但是我还想补充一点：语言成绩在申请中的作用其实没有那么高。对于绝大多数项目来说，只要达到官网给的标准线就够了；对于少数项目来说，甚至略低于官网的标准线也可以录取。比如NYU的Politics项目虽然说托福卡100分，但是也写了鼓励托福成绩不达标的申请者申请。

就我所知，语言成绩要求比较高的是Purdue的PoliSci和Communication项目，口语卡27分。除此之外，UChicago的MAPSS要求口语25分才能给录取（23-25分的可以参加UChicago的AEPA口语测试）。

CV
一定要多找有经验的人修改。曾经我觉得自己的CV写的不错了，直到另一位同学给我提了三点很关键的建议，我才发现自己的CV原来还有很多进步空间。

如果对CV怎么写没有把握，可以随便访问一所学校的对应department，然后看看这些学校graduate students的personal pages，那里可能会有他们的个人网站/CV，可以酌情参考。比如这里Standord Politcal Science Graduate Students。

这里也有一些NYU graduate school applicants的优秀CV可以参考CVs

SoP (statement of purpose)
多改多润色。私以为一篇好的SoP应该做到以下几点：

阐述自己的1-2个研究兴趣以及具体的研究问题
阐述兴趣产生的motivation
阐述为了回答这些研究问题你做了哪些准备（研究经历）
阐述你未来打算怎么回答这些研究问题（研究计划）
解释为什么申请这个项目，并指出2位匹配你兴趣或对你的研究有帮助的faculties
展望毕业之后的去向（就业、读博）

PHS (personal history statement)
一些学校可能在SoP之外还要求PHS。如果说SoP讲的是你的研究，那么PHS讲的就是你的生活。不要认为自己很普通，每个人走到今天一定都会经历过独一无二的困难和挑战，请记录它们。

这里有一些示例：GETTING PREPARED FOR GRADUATE SCHOOL

WS (writing sample)
WS是展现你学术能力的地方，可以是课堂论文、毕业论文的选段，也可以只是一篇research proposal。我自己提交了一篇research proposal，没有quantitative analysis的部分，但是综述自认写得不错，建议大家在提交writing sample之前也找人润色一下。

RL (reference letter)
i人推荐使用interfolio，60$订阅一个申请季。可以在线储存教授的推荐信，然后提交到指定项目，不用每次去联系教授催她/他上传推荐信。缺点是有一些项目不接受interfolio（我自己遇到的是BU的传播学系不接受，其他都可以），而且rating的部分会跳过。

如果不使用interfolio的话，建议提早规划，给教授们多一些时间写作。请列一张表格，注明所有需要她/他提交推荐信的项目和截止日期。在推荐人的选择上，陆本学生建议优先选择熟悉你的老师而不是title高的老师。

四、一些对我有帮助的资源

一位台湾同学的政治学申请指南：美國、英國政治學領域碩士申請心得與流程指南
徐轶青老师的社科申请Q&A：徐轶青：如何申请北美社科类博士项目？
寄托上一位同学写的政治学国际关系申请指南：政治学与国际关系 MPP/MIA/EAS/IR/PS-Master项目申请“不完全指北”（美英新加）
Hongtao Hao写的传播学申请心路历以及申请材料：我的美国博士申请之路 , 美国博士申请的一些经验和我的申请材料
Jing Zhang写的美国传播学系政治传播方向的faculty介绍：总结：软科专排UStop50含政治传播方向项目与老师总结（part1）, 总结：软科专排UStop50含政治传播方向项目与老师总结（part2）
一位台湾同学写的SoP写作超详细教程：留學–SOP解析
Yuhua Wang老师整理的优秀政治学博士申请文书：For future graduate students in political science
Lijin Zhang写的PhD申请指南：Tutorial-on-PhD-Application
一位在PhD招生委员会工作过的“局内人”写的申请Q&A：FAQs on PhD applications
一位计量心理学方向的同学写的PhD申请指南：北美博士项目申请经验
一位经济学方向同学写的PhD申请指南：如何申请一个北美PhD项目
优秀文书网站（偏向PS）：Best Personal Statements are Here
留学准备美国传播学选校信息（学术向）
北美/香港传播学申请保姆级经验贴（2022）
2022 Fall 北美电影/媒介研究 PhD申请个人经验总结

美国Computational Social Science硕士项目

2024-03-15T13:53:00+00:00

一、前言

由于美国纯计算社会科学硕士项目着实不多，因此稍微扩大了一下范围，囊括了一些教统计方法，但是没有那么硬核的项目。

二、正文

MACSS 两年制，在Social Science学院下，知名读博跳板项目，网上能找到不少申请攻略和就读体验，例如这里 MACSS就读体验

2. UChicago, Master of Science in Computational Analysis and Public Policy, MSCAPP

MSCAPP 两年制，在Harris公共政策学院下，就读体验也很多，例如 UChicago MSCAPP感受

MACSS 一年制，2024年秋新开的项目，可能略微实践导向，有些好奇这个项目的bar，等24fall申请季结束后大家可以关注一下

CSS 一年制，课程范围很广，而且很前沿，可以享受UCSD的中国研究方面强大的师资

DACSS 一年到两年，允许part-time，一个刚开不久的项目（感觉staff不是很上心，我之前发的几封询问信息的邮件都没有得到回复）

Computational Social Science Concentration 没有了解过，可以作为CSS方面的保底项目

A3SR 两年制，允许part-time，课程更偏向统计学，可能比较适合想做Social Science的有理工科背景的学生。看项目官网比较直观的感受是强调social impact，可能实践色彩会比较浓

8. UMich, Master of Science in Survey and Data Science

Survey and Data Science 一年到两年，偏向统计学，size较小

MSSP+DA 没有了解过，每年申请人很多，寄托上也有很多信息

10. Georgetown, Master of Science in Data Science for Public Policy

Data Science for Public Policy 没有了解过，申请人也很多

11. UMaryland-College Park, Master’s in Survey and Data Science

Survey and Data Science 两年制，主要是就业导向。有三个方向：社会和心理学、调查统计和数据科学

三、结语

第一版写得比较粗糙，之后会再次更新。作为学科交叉的新方向，预计计算社会科学的申请者会越来越多，希望这篇文章对25fall及以后的申请者们有所帮助。

整理了一些中国研究数据库

2023-10-16T00:00:00+00:00

收集了一些中国研究的数据库，简介着实比较简单潦草，有一些信息也没有及时更新，如果未来有时间（可能）会慢慢修正并（可能会）添加一些的

一

中文名：中国综合社会调查
英文名：CGSS
官网： http://cgss.ruc.edu.cn/
简介：人大的调查项目，比较权威，目前最新是2018
数据下载： http://www.cnsda.org/index.php?r=projects/view&id=35694191

二

中文名：中国社会状况综合调查
英文名：CSS
官网： http://csqr.cass.cn/index.jsp
简介：社科院社会学所的项目，目前最新是2019
数据下载： http://csqr.cass.cn/DataExplore/?ProjectID=2018061909463245927261066314

三

中文名：中国家庭追踪调查
英文名：CFPS
官网： http://www.isss.pku.edu.cn/cfps/index.htm
简介：北大的项目，目前最新是2020
数据下载： https://opendata.pku.edu.cn/dataset.xhtml?persistentId=doi:10.18170/DVN/45LCSO

四

英文名：WVS
官网： https://www.worldvaluessurvey.org/wvs.jsp
简介：Inglehart主持的世界价值观调查，目前最新是第七波（2017-2022）
数据下载： https://www.worldvaluessurvey.org/WVSContents.jsp

五

中文名：亚洲民主动态调查（亚洲晴雨表）
英文名：Asian Barometer
简介：台大胡佛中心的项目，目前最新是第五波（中国数据最新是第四波2014-2016）
数据下载： https://www.asianbarometer.org/data?page=d10

六

英文名：East Asian Social Survey
简介：中、韩、日、台，最新2018
数据下载： https://www.icpsr.umich.edu/web/ICPSR/studies/38489/summary

七

中文名：中国反腐数据库
英文名：China’s Anti-Corruption Campaign
简介：小数据库，up-to-date
数据下载： https://www.chinafile.com/infographics/visualizing-chinas-anti-corruption-campaign

八

中文名：中国家庭收入调查
英文名：CHIP
简介：研究收入分配，最新为2018
数据下载：http://www.ciidbnu.org/chip

九

中文名：中国抗议数据集
英文名：China Dissent
简介：从2022.5.18开始的中国抗议数据
数据下载：https://chinadissent.net/zh

十

中文名：中国腐败调查数据
英文名：Catching Tigers and Flies
简介：China File的数据可视化项目，可以填表下载data
数据下载：https://www.chinafile.com/infographics/visualizing-chinas-anti-corruption-campaign

十一

中文名：中国历史传记数据
英文名：China Biographical Database Project (CBDB)
简介：Harvard的项目，记录了历史人物生平及其社会网络
数据下载：https://projects.iq.harvard.edu/cbdb/home

十二

中文名：中国历史地理信息系统
英文名：The China Historical Geographic Information System, CHGIS
简介：提供了一个基础 GIS 平台，用于空间分析或将中国的历史划分可视化为数字地图
数据下载：https://chgis.fairbank.fas.harvard.edu/

十三

中文名：中国私营企业调查
英文名：China Private Enterprise Survey
简介：社科院的数据
数据下载：https://cpes.zkey.cc/

Databases for Chinese Research